python - 获取组范围内所有值的计数的有效方法

问题描述

我已经提到了这个问题，稍作修改的解决方案在示例模式下工作，但我的完整数据集内存不足（大约 3GB 的数据）。

我正在尝试查找组范围内所有值的计数（按锚分组）：范围公式是y_val +- (anchor_val / 20). 请注意，anchor_val所有锚点都是一致的，例如：

ID	锚	y_val	锚定值
12	抗体	80	40
13	抗体	20	40
14	美国广播公司	80	50
15	美国广播公司	80	50
16	抗体	81	40
17	abd	80	50

这将导致：

ID	锚	y_val	锚定值	(anchor_val / 20)	数数
12	抗体	80	40	2	1
13	抗体	20	40	2	0
14	美国广播公司	80	50	2.5	1
15	美国广播公司	79	50	2.5	1
16	抗体	81	40	2	1
17	abd	80	50	2.5	0

（为了清楚起见，我添加了 anchor_val/20）。

编辑：

导致内存不足错误的当前代码：

 df["rule_8_comp_low"] = df["y_val"] - df["anchor_val"] / 20
df["rule_8_comp_high"] = df["y_val"] + df["anchor_val"] / 20
 m = df.reset_index().merge(
 df[["anchor_col", "y_val"]].reset_index(), on="anchor_col"
  )
 m["rule_8_to_count"] = (
 m.y_val_y.ge(m.rule_8_comp_low)
    & m.y_val_y.le(m.rule_8_comp_high)
     & (m.index_x != m.index_y)
  )
 df["y_val_between"] =  m.groupby("index_x").rule_8_to_count.sum()

标签： pythonpandasnumpy

python - 获取组范围内所有值的计数的有效方法

问题描述

解决方案

推荐阅读