首页 > 解决方案 > pandas - 遍历行并计算 - 更快

问题描述

我已经有了一个解决方案 - 但它非常慢(800 行需要 13 分钟)。这是数据框的示例:

import pandas as pd
d = {'col1': [20,23,40,41,48,49,50,50], 'col2': [39,32,42,50,63,68,68,69]}
df = pd.DataFrame(data=d)
df

在一个新列中,我想计算 col2 的先前值中有多少(例如三个)大于或等于 col1 的行值。我也继续第一行。

这是我的慢代码:

start_at_nr = 3 #variable in which row start to calculate
df["overlap_count"] = "" #create new column

for row in range(len(df)):
    if row <= start_at_nr - 1:
       df["overlap_count"].loc[row] = "x"
    else:
       df["overlap_count"].loc[row] = (
           df["col2"].loc[row - start_at_nr:row - 1] >=
           (df["col1"].loc[row])).sum()

df

我获得了更快的解决方案 - 感谢您抽出宝贵时间!

这是我得到的结果:

col1    col2    overlap_count
0   20  39  x
1   23  32  x
2   40  42  x
3   41  50  1
4   48  63  1
5   49  68  2
6   50  68  3
7   50  69  3

标签: pythonpandasloopspandas-apply

解决方案


IIUC,你可以这样做:

df['overlap_count'] = 0
for i in range(1,start_at_nr+1):
    df['overlap_count'] += df['col1'].le(df['col2'].shift(i))

# mask the first few rows
df.iloc[:start_at_nr, -1] = np.nan

输出:

   col1  col2  overlap_count
0    20    39            NaN
1    23    32            NaN
2    40    42            NaN
3    41    50            1.0
4    48    63            1.0
5    49    68            2.0
6    50    68            3.0
7    50    69            3.0

800 行和start_at_nr=3.


推荐阅读