python - 如果数据框中包含来自另一个数据框中的任何元素,如何在数据框中分配类别?
问题描述
我有两张excel表格。一个包含摘要,另一个包含具有潜在过滤词的类别。如果第二个数据框中的任何元素匹配,我需要为第一个数据框分配类别。
我试图通过将术语与第一个数据框中的任何单词匹配来扩展第二个数据框中的列表并进行映射。
用于测试的数据。
import pandas as pd
data1 = {'Bucket':['basket', 'bushel', 'peck', 'box'], 'Summary':['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red and sweet.', 'There is a peck of pears that taste sweet. They are very green.', 'We have a box of plums. They are sour and have a great color.']}
data2 = {'Category':['Fruit', 'Color'], 'Filters':['apple, pear, plum, grape', 'red, purple, green']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
Bucket Summary
0 basket This is a basket of red apples. They are sour.
1 bushel We found a bushel of fruit. They are red and s...
2 peck There is a peck of pears that taste sweet. The...
3 box We have a box of plums. They are sour and have...
print(df2)
Category Filters
0 Fruit apple, pear, plum, grape
1 Color red, purple, green
这行脚本将表中的 Category 列转换为列表以供以后使用。
category_list = df2['Category'].values
category_list = list(set(category_list))
尝试匹配文本。
for item in category_list:
item = df2.loc[df2['Category'] == item]
filter_list = item['Filters'].values
filter_list = list(set(filter_list))
df1 = df1 [df1 ['Summary'].isin(filter_list)]
我希望第一个数据框有分配给它的类别,用逗号分隔。
结果:
Bucket Category Summary
0 basket Fruit, Color This is a basket of red apples. They are sour.
1 bushel Color We found a bushel of fruit. They are red and s...
2 peck Fruit, Color There is a peck of pears that taste sweet. The...
3 box Fruit We have a box of plums. They are sour and have...
我希望这很清楚。我已经用头撞了一个星期了。
先感谢您
解决方案
使用pandas.Series.str.contains通过循环检查过滤器:
df2['Filters']=[key.replace(' ','') for key in df2['Filters']]
df2['Filters']=df2['Filters'].apply(lambda x : x.split(','))
Fruit=pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters']['Fruit']]).any()
Color=pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters']['Color']]).any()
print(Fruit)
print(Color)
0 True
1 False
2 True
3 True
dtype: bool
0 True
1 True
2 True
3 False
dtype: bool
然后使用np.where和Series.str.cat来获取数据帧输出:
df1['Fruit']=np.where(Fruit,'Fruit','')
df1['Color']=np.where(Color,'Color','')
df1['Category']=df1['Fruit'].str.cat(df1['Color'],sep=', ')
df1=df1[['Bucket','Category','Summary']]
print(df1)
Bucket Category Summary
0 basket Fruit, Color This is a basket of red apples. They are sour.
1 bushel , Color We found a bushel of fruit. They are red and s...
2 peck Fruit, Color There is a peck of pears that taste sweet. The...
3 box Fruit, We have a box of plums. They are sour and have...
到 n 类别过滤器:
df2['Filters']=[key.replace(' ','') for key in df2['Filters']]
df2['Filters']=df2['Filters'].apply(lambda x : x.split(','))
Categories=[pd.Series(np.where(( pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters'][category_filter]]).any() ),category_filter,'')) for category_filter in df2['Category']]
df1['Category']=Categories[0].str.cat(Categories[1:],sep=', ')
df1=df1.reindex(columns=['Bucket','Category','Summary'])
print(df1)
Bucket Category Summary
0 basket Fruit, Color This is a basket of red apples. They are sour.
1 bushel , Color We found a bushel of fruit. They are red and s...
2 peck Fruit, Color There is a peck of pears that taste sweet. The...
3 box Fruit, We have a box of plums. They are sour and have...
推荐阅读
- azure - Azure Active Directory - 获取租户列表
- unique - nativescript-nfc rfid阅读器插件唯一ID问题
- c - 产生关于“左值”的错误。原因是什么?
- android - 将 Gatt BLE 通知从 tizen 可穿戴应用程序发送到 Android 应用程序
- javascript - 在 redux 中更新状态
- javascript - 在 Django 模板中执行 JavaScript 函数
- android - ConstraintSet 不适用于嵌套的 ConstraintLayouts
- laravel - Laravel - 路线[公司]未定义
- python - 使用 Python 导出 4 字节浮点数
- reactjs - 使用 React Hooks 跨 Material-UI Stepper 传递数据