python - 更快的 collections.Counter - 类似 pandas Series 的操作
问题描述
您好我目前正在执行以下操作以查找多个熊猫系列中的所有独特项目:
In [44]: data = [Series([1,2,7,4]), Series([2,5,3,1]), Series([3, 2, 4])]
In [45]: counts = Counter(chain.from_iterable(data))
In [46]: unique_occurrences = [item for item, count in counts.items() if count == 1]
In [47]: unique_occurrences
Out[47]: [7, 5]
由于真实数据很大,有什么方法可以加快速度。
谢谢。
对答案的反馈
代码:
def uniq_0(data): # Original
counts = Counter(chain.from_iterable(data))
return [item for item, count in counts.items() if count == 1]
def uniq_1(data): # Divakar #1
a = np.concatenate(data)
unq,c = np.unique(a, return_counts=1)
return unq[c==1]
def uniq_2(data): # Divakar #2
a = np.concatenate(data)
return np.flatnonzero(np.bincount(a)==1)
def uniq_3(data): # Divakar #3
counts = Counter(chain.from_iterable(data))
k = np.array(list(counts.keys()))
v = np.array(list(counts.values()))
return k[v==1]
def uniq_4(data): # Divakar #4
L = max([i.max() for i in data])+1
return np.flatnonzero(np.sum([np.bincount(i,minlength=L)
for i in data],axis=0)==1)
def uniq_5(data): # Divakar #5
L = max([i.max() for i in data])+1
sums = np.zeros(L,dtype=int)
for i in data:
sums += np.bincount(i,minlength=L)
return np.flatnonzero(sums==1)
def uniq_6(data): # Erfan
v = pd.concat(data).value_counts()
return v.index[v == 1]
if __name__ == '__main__':
data = [Series([1,2,7,4]), Series([2,5,3,1]), Series([3, 2, 4])]
funcs = [uniq_0, uniq_1, uniq_2, uniq_3, uniq_4, uniq_5, uniq_6]
answers = [f(data) for f in funcs]
golden = set(answers[0])
check = [set(a) == golden for a in answers]
for n, a in enumerate(answers):
if set(a) != golden:
print(f' Error with uniq_{n}(data)')
else:
print(f' Confirmed uniq_{n}(data) == golden')
蜘蛛会话:
Confirmed uniq_0(data) == golden
Confirmed uniq_1(data) == golden
Confirmed uniq_2(data) == golden
Confirmed uniq_3(data) == golden
Confirmed uniq_4(data) == golden
Confirmed uniq_5(data) == golden
Confirmed uniq_6(data) == golden
In [73]: # 1000 Series. Averaging 10000.0 ints/Series. 405 ints unique.
In [74]: for f in funcs:
...: print(f.__name__, end=': ')
...: %timeit -r 3 f(data2)
uniq_0: 2.21 s ± 18.5 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
uniq_1: 465 ms ± 2.5 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
uniq_2: 126 ms ± 215 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
uniq_3: 2.22 s ± 48.6 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
uniq_4: 1.12 s ± 10.8 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
uniq_5: 374 ms ± 1.28 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
uniq_6: 831 ms ± 20.6 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
In [75]:
评论
非常感谢你。我的实际数据更大,但不适合这台笔记本电脑,但我觉得我现在有足够的选择来真正解决这个问题。再次感谢!
解决方案
方法#1
这是一个基于 NumPy 数组的 -
a = np.concatenate(data)
unq,c = np.unique(a, return_counts=1)
out = unq[c==1]
方法 #2(对于正整数数据)
对于正整数数据,我们可以使用np.bincount
直接out
从a
-
out = np.flatnonzero(np.bincount(a)==1) # a from app#1
方法#3
如果我们想使用counts
,我们可能更喜欢在处理大量系列时使用,因为在这种情况下连接可能会更慢 -
k = np.array(list(counts.keys()))
v = np.array(list(counts.values()))
out = k[v==1]
方法 #4(对于正整数数据)
由于有大量的系列持有正整数,我们可以bincount
在每个系列上使用,从而避免连接 -
L = max([i.max() for i in data])+1
out = np.flatnonzero(np.sum([np.bincount(i,minlength=L) for i in data],axis=0)==1)
方法 #5(对于正整数数据)
这可以进一步提高内存效率,就像这样 -
L = max([i.max() for i in data])+1
sums = np.zeros(L,dtype=int)
for i in data:
sums += np.bincount(i,minlength=L)
out = np.flatnonzero(sums==1)
推荐阅读
- c - 转换 Int8?到不安全指针
? - javascript - Uncaught Invariant Violation:渲染的钩子比上一次渲染时更多
- php - 如何通过 API 获取所有 shopify 订单(限 250 个)
- microsoft-graph-api - 如何根据存储在数据扩展中的值过滤 Outlook 消息?
- c - 将 .wav 文件转换为 C 代码,在 ARM 处理器中不输出声音
- c# - 尝试使用从 CurrentUser Store 检索到的证书的私钥时出错
- google-apps-script - gmail内的Apps脚本登录表单
- kubernetes - 基于 Istio 版本的路由导致 404
- mongodb - 按对象 ID 查找文档 - 将字符串转换为对象 ID
- python - 多个功能,但一个失败