首页 > 解决方案 > 如何矢量化 2 列之间的 Pandas 比较

问题描述

我正在尝试将各种过滤器应用于熊猫 df。如果可能,我想应用向量操作,而不是遍历每一行(我现在正在这样做,而且速度慢得令人无法接受)。但是,有些过滤器并不是很简单。

productids = [
    '01t0J00000HcoqpQAB', '01t0J00000HcoqnQAB', '01t0J00000HcoqyQAB',
    '01t0J00000Hcor3QAB', '01t0J00000Hcor5QAB', '01t0J00000Hcor6QAB',
    '01t0J00000Hcor9QAB', '01t0J00000HcorCQAR', '01t0J00000HcorGQAR',
    '01t0J00000IDGAOQA5'
]

previous_products = [
{'01t0J00000Hcor3QAB', '01t0J00000IDGAOQA5', '01t0J00000Hcor5QAB', '01t0J00000HcoqyQAB', '01t0J00000Hcor9QAB', '01t0J00000HcorGQAR', '01t0J00000Hcor6QAB', '01t0J00000HcorCQAR', '01t0J00000HcoqnQAB'},
{'01t0J00000Hcor3QAB', '01t0J00000IDGAOQA5', '01t0J00000Hcor5QAB', '01t0J00000HcoqyQAB', '01t0J00000Hcor9QAB', '01t0J00000HcorGQAR', '01t0J00000Hcor6QAB', '01t0J00000HcorCQAR', '01t0J00000HcoqnQAB'},
{'01t0J00000Hcor3QAB', '01t0J00000IDGAOQA5', '01t0J00000Hcor5QAB', '01t0J00000HcoqyQAB', '01t0J00000Hcor9QAB', '01t0J00000HcorGQAR', '01t0J00000Hcor6QAB', '01t0J00000HcorCQAR', '01t0J00000HcoqnQAB'},
{'01t0J00000Hcor3QAB', '01t0J00000IDGAOQA5', '01t0J00000Hcor5QAB', '01t0J00000HcoqyQAB', '01t0J00000Hcor9QAB', '01t0J00000HcorGQAR', '01t0J00000Hcor6QAB', '01t0J00000HcorCQAR', '01t0J00000HcoqnQAB'},
{'01t0J00000Hcor3QAB', '01t0J00000IDGAOQA5', '01t0J00000Hcor5QAB', '01t0J00000HcoqyQAB', '01t0J00000Hcor9QAB', '01t0J00000HcorGQAR', '01t0J00000Hcor6QAB', '01t0J00000HcorCQAR', '01t0J00000HcoqnQAB'},
{'01t0J00000Hcor3QAB', '01t0J00000IDGAOQA5', '01t0J00000Hcor5QAB', '01t0J00000HcoqyQAB', '01t0J00000Hcor9QAB', '01t0J00000HcorGQAR', '01t0J00000Hcor6QAB', '01t0J00000HcorCQAR', '01t0J00000HcoqnQAB'},
{'01t0J00000Hcor3QAB', '01t0J00000IDGAOQA5', '01t0J00000Hcor5QAB', '01t0J00000HcoqyQAB', '01t0J00000Hcor9QAB', '01t0J00000HcorGQAR', '01t0J00000Hcor6QAB', '01t0J00000HcorCQAR', '01t0J00000HcoqnQAB'},
{'01t0J00000Hcor3QAB', '01t0J00000IDGAOQA5', '01t0J00000Hcor5QAB', '01t0J00000HcoqyQAB', '01t0J00000Hcor9QAB', '01t0J00000HcorGQAR', '01t0J00000Hcor6QAB', '01t0J00000HcorCQAR', '01t0J00000HcoqnQAB'},
{'01t0J00000Hcor3QAB', '01t0J00000IDGAOQA5', '01t0J00000Hcor5QAB', '01t0J00000HcoqyQAB', '01t0J00000Hcor9QAB', '01t0J00000HcorGQAR', '01t0J00000Hcor6QAB', '01t0J00000HcorCQAR', '01t0J00000HcoqnQAB'},
{'01t0J00000Hcor3QAB', '01t0J00000IDGAOQA5', '01t0J00000Hcor5QAB', '01t0J00000HcoqyQAB', '01t0J00000Hcor9QAB', '01t0J00000HcorGQAR', '01t0J00000Hcor6QAB', '01t0J00000HcorCQAR', '01t0J00000HcoqnQAB'}
]

df_test = pd.DataFrame({'productids': productids, 'previous_products': previous_products}, index=range(len(productids)))

df_test

这是我要应用的过滤器:

df_test.productids.isin(test.previous_products)

这背后的逻辑是我需要知道第 1 列上的 id 是否存在于第 2 列上设置的 id 内。第 2 列是其他函数集的结果,用于计算每个客户的先前产品。我现在正在做的事情看起来有点像这样:

for i, row in df_test.iterrows():
    if row['productids'] in row['previous_products']:
        **do more stuff**
    else:
        **do different stuff**

问题在于,随着 df get 的增大,完成循环需要很长时间。

还有其他建议可以解决这个问题吗?

标签: pythonpandasdataframenumpy

解决方案


df1 = pd.DataFrame([[1,4],[2,5],[3,6]), columns=['col1','col2'])
df2 = pd.DataFrame([1,2], columns = ['lookup_col'])

df_merge = df1.merge(df2, left_on='col1', right_on='lookup_col')

推荐阅读