python - 根据条件从另一个数据框中设置数据框列的值
问题描述
我有一个数据框
#Around 100000 rows
df = pd.DataFrame({'text': [ 'Apple is healthy', 'Potato is round', 'Apple might be green'],
'category': ["","", ""],
})
第二个数据框
#Around 3000 rows
df_2 = pd.DataFrame({'keyword': [ 'Apple ', 'Potato'],
'category': ["fruit","vegetable"],
})
所需的结果
#Around 100000 rows
df = pd.DataFrame({'text': [ 'Apple is healthy', 'Potato is round', 'Apple might be green'],
'category': ["fruit","vegetable", "fruit"],
})
我目前试过这个
df.set_index('text')
df_2.set_index('keyword')
df.update(df_2)
结果是
text category
Apple is healthy fruit
Potato is round vegetable
Apple might be green
如您所见,它没有为最后一行添加类别。我怎样才能做到这一点?
解决方案
您需要从 分配回输出DataFrame.set_index
,因为不是像 , 那样的就地操作DataFrame.update
,用于匹配Series.str.extract
由列使用df_2["keyword"]
:
df = df.set_index(df['text'].str.extract(f'({"|".join(df_2["keyword"])})', expand=False))
df_2 = df_2.set_index('keyword')
print (df)
text category
text
Apple Apple is healthy
Potato Potato is round
Apple Apple might be green
df.update(df_2)
print (df)
text category
text
Apple Apple is healthy fruit
Potato Potato is round vegetable
Apple Apple might be green fruit
如果需要仅添加一列,请Series.str.extract
使用Series.map
:
s = df['text'].str.extract(f'({"|".join(df_2["keyword"])})', expand=False)
df['category'] = s.map(df_2.set_index(['keyword'])['category'])
print (df)
text category
0 Apple is healthy fruit
1 Potato is round vegetable
2 Apple might be green fruit