首页 > 解决方案 > 在某些条件下更快地复制 pandas 数据

问题描述

我有一个数据框(df_main),我想根据从另一个数据框(df_data)中找到必要的列来将数据复制到其中。

df_data
   name  Index     par_1   par_2 ... par_n
0    A1      1        a0      b0
1    A1      2        a1
2    A1      3        a2
3    A1      4        a3 
4    A2      2        a4
...    

df_main
   name Index_0  Index_1    
0    A1       1        2
1    A1       1        3
2    A1       1        4
3    A1       2        3 
4    A1       2        4
5    A1       3        4
...

我想将 df_data 中的参数列复制到 df_main条件是 df_data 行中具有相同名称和索引的所有参数都复制到 df_main。我使用 for 循环进行了以下实现,这实际上太慢而无法使用:

def data_copy(df, df_data, indice):
    '''indice: whether Index_0 or Index_1 is being checked'''
    names = df['name'].unique()
    # We get all different names in the dataset to loop over
    for name in tqdm.tqdm(names):
        # Get unique index for a specific name
        indexes = df[df['name']== name][indice].unique()
        # Looping over all indexes
        for index in indexes:
            # From df_data, get the data of all cols of specific name and data
            data = df_data[(df_data['Index']==index) & (df_data['name'] == name)]

            # columns: Only the cols of structure's data
            req_data = data[columns]

            for col in columns:
                # For each col (e.g. g1, g2, etc), get the val of a specific index
                val = df_struc.loc[(df_data['Index']==index) & (df_data['name'] == name), col]
                df.loc[(df[indice] == index) & (df['name']== name), col] = val[val.index.item()]
    return df

df_main = data_copy(df_main, df_data, 'Index_0') 

这给了我我所需要的:

df_main
   name Index_0  Index_1   par_1    par_2 ...
0    A1       1        2      a0
1    A1       1        3      a0    
2    A1       1        4      a0
3    A1       2        3      a1
4    A1       2        4      a1
5    A1       3        4      a2

但是,在非常大的数据上运行它需要大量时间。避免 for 循环以加快实现速度的最佳方法是什么?

标签: pythonpandasfor-loopbigdata

解决方案


对于每个数据框,您可以创建一个将连接名称和索引的新列。见下文 :

import pandas as pd

df1 = {'name':['A1','A1'],'index':['1','2'],'par_1':['a0','a1']}
df1 = pd.DataFrame(data=df1)
df1['new'] = df1['name'] + df1['index'] 
df1

df2 = {'name':['A1','A1'],'index_0':['1','2'],'index_1':['2','3']}
df2 = pd.DataFrame(data=df2)
df2['new'] = df2['name'] + df2['index_0'] 
df2

for i, row in df1.iterrows():
    df2.loc[(df2['new'] == row['new']) , 'par_1'] = row['par_1']
df2 

结果 :

    name index_0 index_1 new    par_1
0   A1   1       2       A11    a0
1   A1   2       3       A12    a1

推荐阅读