首页 > 解决方案 > 对 Pandas DataFrame 的迭代

问题描述

迭代 DataFrame 的(最佳实践)正确方法是什么?

我在用:

for i in range(working.shape[0]):
    for j in range(1, working.shape[1]):
        working.iloc[i,j] = (100 - working.iloc[i,j])*100

以上是正确的,但与其他 Stack Overflow 答案不一致。我希望有人能解释为什么上述不是最优的,并提出一个更好的实现。

总的来说,我是一个编程新手,尤其是 Pandas。也很抱歉问了一个已经在 SF 上解决的问题:虽然我并没有真正理解这个问题的常规答案。可能重复,但这个答案对于新手来说很容易理解,如果不太全面的话。

标签: pythonpandasdataframefor-loopiteration

解决方案


What is the (best practice) correct way to iterate over DataFrames?

There are several ways (for example iterrows) but in general, you should try to avoid iteration at all costs. pandas offer several tools for vectorized operations which will almost always be faster than an iterative solution.

The example you provided can be vectorized in the following way using iloc:

working.iloc[:, 1:] = (100 - working.iloc[:, 1:]) * 100

Some timings:

from timeit import Timer

working = pd.DataFrame({'a': range(50), 'b': range(50)})


def iteration():
    for i in range(working.shape[0]):
        for j in range(1, working.shape[1]):
            working.iloc[i, j] = (100 - working.iloc[i, j]) * 100


def direct():
    # in actual code you will have to assign back to working.iloc[:, 1:]
    (100 - working.iloc[:, 1:]) * 100


print(min(Timer(iteration).repeat(50, 50)))
print(min(Timer(direct).repeat(50, 50)))

Outputs

0.38473859999999993
0.05334049999999735

A 7-factor difference and that's with only 50 rows.


推荐阅读