python - 计算python中元素组合的频率
问题描述
我有以下df:
我想要做的是计算元素组合的频率。例如:
- 伞在整个df中出现了8次
- 洗涤剂出现 5 次
- (啤酒、尿布)出现2次
- (啤酒、牛奶)出现2次
- (雨伞、牛奶、啤酒)出现2次
计算单个项目和组合项目的所有频率,并仅保留频率 >= n 的单个项目和组合项目,其中 n 是任何正整数。对于这个例子,假设 n -> {1, 2, 3, 4}。
我一直在尝试使用以下代码:
# candidates itemsets
records = []
# generates a list of lists of products that were bought together (convert df to list of lists)
for i in range(0, num_records):
records.append([str(data.values[i,j]) for j in range(0, len(data.columns))])
# clean list (delete NaN values)
records = [[x for x in y if str(x) != 'nan'] for y in records]
OUTPUT:
[['detergent'],
['bread', 'water'],
['bread', 'umbrella', 'milk', 'diaper', 'beer'],
['detergent', 'beer', 'umbrella', 'milk'],
['cheese', 'detergent', 'diaper', 'umbrella'],
['umbrella', 'water', 'beer'],
['umbrella', 'water'],
['water', 'umbrella'],
['diaper', 'water', 'cheese', 'beer', 'detergent', 'umbrella'],
['umbrella', 'cheese', 'detergent', 'water', 'beer']]
接着:
setOfItems = []
newListOfItems = []
for item in records:
if item in setOfItems:
continue
setOfItems.append(item)
temp = list(item)
occurence = records.count(item)
temp.append(occurence)
newListOfItems.append(temp)
OUTPUT:
['detergent', 1]
['bread', 'water', 1]
['bread', 'umbrella', 'milk', 'diaper', 'beer', 1]
['detergent', 'beer', 'umbrella', 'milk', 1]
['cheese', 'detergent', 'diaper', 'umbrella', 1]
['umbrella', 'water', 'beer', 1]
['umbrella', 'water', 1]
['water', 'umbrella', 1]
['diaper', 'water', 'cheese', 'beer', 'detergent', 'umbrella', 1]
['umbrella', 'cheese', 'detergent', 'water', 'beer', 1]
如您所见,它仅计算整行的频率(来自图像 1),但是我的预期输出是出现在第二个图像中的输出。
解决方案
有趣的问题!我itertools.combinations()
用来生成所有可能的组合并collections.Counter()
计算每个组合出现的频率:
import pandas as pd
import itertools
from collections import Counter
# create sample data
df = pd.DataFrame([
['detergent', np.nan],
['bread', 'water', None],
['bread', 'umbrella', 'milk', 'diaper', 'beer'],
['umbrella', 'water'],
['water', 'umbrella'],
['umbrella', 'water']
])
def get_all_combinations_without_nan_or_None(row):
# remove nan, None and double values
set_without_nan = {value for value in row if isinstance(value, str)}
# generate all possible combinations of the values in a row
all_combinations = []
for i in range(1, len(set_without_nan)+1):
result = list(itertools.combinations(set_without_nan, i))
all_combinations.extend(result)
return all_combinations
# get all posssible combinations of values in a row
all_rows = df.apply(get_all_combinations_without_nan_or_None, 1).values
all_rows_flatten = list(itertools.chain.from_iterable(all_rows))
# use Counter to count how many there are of each combination
count_combinations = Counter(all_rows_flatten)
文档collections.Counter()
:
https ://docs.python.org/2/library/collections.html#collections.Counter
文档itertools.combinations()
:
https ://docs.python.org/2/library/itertools.html#itertools.combinations
推荐阅读
- android - 如何获取授予我的应用程序的权限列表?
- javascript - 离子弹出按钮
- ios - 在 Swift 中将复杂的 JSON 保存到 Core Data
- spring - Spring Cloud Stream生产者批处理
- docker - Identityserver4 openid-configuration 省略了运行 nginx 反向代理的主机端口
- python - 流控制和失败:不允许数据库访问,使用“django_db”....错误
- spring - 使用 resultSetMapping 在外部属性文件中存储本机查询的值
- git - 远程 git 存储库中的文件重命名失败
- javascript - 2个不同大小的数组,一旦达到最大索引,就可以在循环中重用一个数组
- kendo-ui - 在 Kendo for Angular 中将上下文菜单添加到对话框