首页 > 解决方案 > 在精确文本匹配时重新索引数据框

问题描述

text当列(给定)匹配时,我想创建一个数据帧的索引与另一个数据帧的索引的映射。两个数据帧的长度相等,并且总是会完全匹配。

df_original = pd.DataFrame(dict(text=['The cat sat on the table', 'There is a kind of hush', 'The boy kicked the ball', 'He shot the elephant', 'I want to eat right now!']))
df = pd.DataFrame(dict(text=['He shot the elephant', 'The boy kicked the ball', 'The cat sat on the table', 'I want to eat right now!', 'There is a kind of hush']))

df_original好像:

0   The cat sat on the table
1   There is a kind of hush
2   The boy kicked the ball
3   He shot the elephant
4   I want to eat right now!

df好像:

0   He shot the elephant
1   The boy kicked the ball
2   The cat sat on the table
3   I want to eat right now!
4   There is a kind of hush

我想得到字典映射,像这样,

d = {2: 0, 4: 1, 1: 2, 0: 3, 3: 4}

例如:第2 个索引df与第0个索引匹配df_original。所以它们必须被映射在一起等等。

如果可能的话,我更喜欢矢量化操作并且正在寻找一个。

我试着做:

d = {}
for i1, r1 in df_original.iterrows():
    for i2, r2 in df.iterrows():
        if r1[0] == r2[0]: 
            d[i2] = i1
print(d)
# {2: 0, 4: 1, 1: 2, 0: 3, 3: 4}

但这非常慢,因为我有数百万行的数据帧。

标签: pythonpandas

解决方案


你可以试试map

df['text'].map(df_original.reset_index().set_index('text')['index']).to_dict()

{0: 3, 1: 2, 2: 0, 3: 4, 4: 1}

推荐阅读