python - pandas 字符串基于另一列出现在一列中的次数
问题描述
我有一个非常大的汽车数据框。像这样:
Text Terms
0 Car's model porche year in data [tech, window, tech]
1 we’re simply making fossil fuel cars [brakes, window, Italy, nice]
2 Year of cars Ferrari to make [Detroit, window, seats, engine]
3 reading the specs of Ferrari file [tech, window, engine, v8, window]
4 likelihood Porche in the car list [from, wheel, tech]
还有这些:
term_list = ['tech', 'engine', 'window']
cap_list = ['Ferrari', 'porche']
term_cap_dict = {'Ferrari': ['engine', 'window'], 'Porche': ['tech']}
我想要一个结果数据框来计算每个术语(在 term_list 中)出现在“术语”列中的次数 - 仅当“文本”列包含相应的“键”(来自 term_cap_dict)时才计算。例如:“tech”一词的条件计数(给定 Porche)= 3(因为相应的“Text”中包含“Porche”。......尽管“tech”出现的总次数是 4 )。如果计数为 0 或条件文本不存在,则条件计数默认为 0。所需的输出:
Terms Cap ConditionalCount
0 engine Ferrari 2
1 engine porche 0
2 tech Ferrari 0
3 tech porche 3
4 window Ferrari 3
5 window porche 1
这是我到目前为止所拥有的(只是计算 TotalCount...不是条件计数):
term_cap_dict = {k.lower(): list(map(str.lower, v)) for k, v in term_cap_dict.items()}
terms_counter = Counter(chain.from_iterable(df['Terms']))
terms_series = pd.Series(terms_counter)
terms_df = pd.DataFrame({'Term': terms_series.index, 'TotalCount': terms_series.values})
df1 = terms_df[terms_df['Term'].isin(term_list)]
product_terms = product(term_list, cap_list)
df_cp = pd.DataFrame(product_terms, columns=['Terms', 'Capability'])
dff = df_cp.set_index('Terms').combine_first(df1.set_index('Term')).reset_index()
dff.rename(columns={'index': 'Terms'}, inplace=True)
这给出了 TotalCount:
Terms Capability TotalCount
0 engine Ferrari 3.0
1 engine porche 3.0
2 tech Ferrari 4.0
3 tech porche 4.0
4 window Ferrari 4.0
5 window porche 4.0
从现在开始,我不知道如何计算 ConditionalCount。任何建议表示赞赏。
df.to_dict()
{'Title': {0: "Car's model porche year in data",
1: 'we’re simply making fossil fuel cars',
2: 'Year of cars Ferrari to make',
3: 'reading the specs of Ferrari file',
4: 'likelihood Porche in the car list'},
'Terms': {0: ['tech', 'window', 'tech'],
1: ['brakes', 'engine', 'Italy', 'nice'],
2: ['Detroit', 'window', 'seats', 'engine'],
3: ['tech', 'window', 'engine', 'v8', 'window'],
4: ['from', 'wheel', 'tech']}}
解决方案
更新:
df1 = df.explode(column='Terms')
regcap = '|'.join(cap_list)
df1['Cap'] = df1['Text'].str.extract(f'({regcap})')
filter_df =pd.concat([pd.DataFrame({'Cap':v, 'Terms':k}) for v, k in term_cap_dict.items()])
filter_df = filter_df.apply(lambda x: x.str.lower())
df1 = df1.apply(lambda x: x.str.lower())
df1_filt = df1.merge(filter_df)
idx = pd.MultiIndex.from_product([term_list, list(map(str.lower, cap_list))], names=['Term','Cap'])
df_out = df1_filt[df1_filt['Terms'].isin(term_list)].groupby(['Terms','Cap']).count()\
.rename(columns= {'Text':'Count'})\
.reindex(idx, fill_value=0).reset_index()
print(df_out)
输出:
Term Cap Count
0 tech ferrari 0
1 tech porche 2
2 engine ferrari 2
3 engine porche 0
4 window ferrari 3
5 window porche 0
IIUC,试试这个:
df1 = df.explode(column='Terms')
regcap = '|'.join(cap_list)
df1['Cap'] = df1['Text'].str.extract(f'({regcap})')
idx = pd.MultiIndex.from_product([term_list, cap_list], names=['Term','Cap'])
df_out = df1[df1['Terms'].isin(term_list)].groupby(['Terms','Cap']).count()\
.rename(columns= {'Text':'Count'})\
.reindex(idx, fill_value=0).reset_index()
print(df_out)
输出:
Term Cap Count
0 tech Ferrari 1
1 tech porche 2
2 engine Ferrari 2
3 engine porche 0
4 window Ferrari 3
5 window porche 1
推荐阅读
- python - python下划线“\ u0332”在Kali终端中不起作用
- swift - UserDefaults 如何获取自己保存的密钥?
- c# - 在 MSTest 中,Assert.Fail 中的第二个参数有什么作用?
- javascript - 在vue js中将二进制图像数据转换为png.jpg格式的图像
- github - 有没有办法过滤启用 GitHub 页面的 GitHub 存储库?
- sonarqube - Sonarqube 获取作者列表
- python - 从一组编码文本值中过滤出正确的数据
- javascript - 删除行 PHP JAVASCRIPT
- java - 由于 Mac m1 中的 RocksDB,Kafka Streams groupByKey 无法正常工作
- ubuntu - 从 ubuntu 获取文件的大小