首页 > 解决方案 > 如果数据框中包含来自另一个数据框中的任何元素,如何在数据框中分配类别?

问题描述

我有两张excel表格。一个包含摘要,另一个包含具有潜在过滤词的类别。如果第二个数据框中的任何元素匹配,我需要为第一个数据框分配类别。

我试图通过将术语与第一个数据框中的任何单词匹配来扩展第二个数据框中的列表并进行映射。

用于测试的数据。

import pandas as pd

data1 = {'Bucket':['basket', 'bushel', 'peck', 'box'], 'Summary':['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red and sweet.', 'There is a peck of pears that taste sweet. They are very green.', 'We have a box of plums. They are sour and have a great color.']}

data2 = {'Category':['Fruit', 'Color'], 'Filters':['apple, pear, plum, grape', 'red, purple, green']}

df1 = pd.DataFrame(data1)

df2 = pd.DataFrame(data2)

print(df1)

   Bucket                                            Summary
0  basket     This is a basket of red apples. They are sour.
1  bushel  We found a bushel of fruit. They are red and s...
2    peck  There is a peck of pears that taste sweet. The...
3     box  We have a box of plums. They are sour and have...

print(df2)

  Category                   Filters
0    Fruit  apple, pear, plum, grape
1    Color        red, purple, green

这行脚本将表中的 Category 列转换为列表以供以后使用。

category_list =  df2['Category'].values

category_list = list(set(category_list))

尝试匹配文本。

for item in category_list:

    item = df2.loc[df2['Category'] == item]

    filter_list =  item['Filters'].values

    filter_list = list(set(filter_list))

    df1 = df1 [df1 ['Summary'].isin(filter_list)] 

我希望第一个数据框有分配给它的类别,用逗号分隔。

结果:

Bucket      Category                                            Summary
0  basket  Fruit, Color     This is a basket of red apples. They are sour.
1  bushel         Color  We found a bushel of fruit. They are red and s...
2    peck  Fruit, Color  There is a peck of pears that taste sweet. The...
3     box         Fruit  We have a box of plums. They are sour and have...

我希望这很清楚。我已经用头撞了一个星期了。

先感谢您

标签: pythonpandasdataframefiltering

解决方案


使用pandas.Series.str.contains通过循环检查过滤器:

df2['Filters']=[key.replace(' ','') for key in df2['Filters']]
df2['Filters']=df2['Filters'].apply(lambda x : x.split(','))
Fruit=pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters']['Fruit']]).any()
Color=pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters']['Color']]).any()
print(Fruit)
print(Color)

0     True
1    False
2     True
3     True
dtype: bool 

0     True
1     True
2     True
3    False
dtype: bool

然后使用np.whereSeries.str.cat来获取数据帧输出:

df1['Fruit']=np.where(Fruit,'Fruit','')
df1['Color']=np.where(Color,'Color','')
df1['Category']=df1['Fruit'].str.cat(df1['Color'],sep=', ')
df1=df1[['Bucket','Category','Summary']]
print(df1)

   Bucket      Category                                            Summary
0  basket  Fruit, Color     This is a basket of red apples. They are sour.
1  bushel       , Color  We found a bushel of fruit. They are red and s...
2    peck  Fruit, Color  There is a peck of pears that taste sweet. The...
3     box       Fruit,   We have a box of plums. They are sour and have...

到 n 类别过滤器:

df2['Filters']=[key.replace(' ','') for key in df2['Filters']]
df2['Filters']=df2['Filters'].apply(lambda x : x.split(','))
Categories=[pd.Series(np.where(( pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters'][category_filter]]).any() ),category_filter,'')) for category_filter in df2['Category']]
df1['Category']=Categories[0].str.cat(Categories[1:],sep=', ')
df1=df1.reindex(columns=['Bucket','Category','Summary'])
print(df1)

   Bucket      Category                                            Summary
0  basket  Fruit, Color     This is a basket of red apples. They are sour.
1  bushel       , Color  We found a bushel of fruit. They are red and s...
2    peck  Fruit, Color  There is a peck of pears that taste sweet. The...
3     box       Fruit,   We have a box of plums. They are sour and have...

推荐阅读