首页 > 解决方案 > 熊猫:合并两个数据框(相同的列)条件......我该如何改进这段代码?

问题描述

(对不起,我的英文不好……)

我正在研究公共数据。我正在尝试合并两个具有某些条件的 excel 文件。我尝试了多循环代码,但它太慢了......如何改进我的代码?

请帮帮我TVT

数据结构示例是

旧数据(整个文件.xlsx)

            KeyCode       Date                     Something
    0       aaa           2020-01-01 00:00:00      adaf
    1       bbb           2020-02-01 00:00:00      awd
    2       ccc           2020-03-01 00:00:00      feq
    ...
    6000    aewi          2020-03-03 00:00:00      awefeaw

新数据(file2.xlsx)

            KeyCode       Date                     Something
    1       bbb           2020-06-01 20:00:00      aafewfaewfaw
    2       ccc           2020-06-01 20:00:00      dfqefqe
    3       new           2020-06-01 20:00:00      newrow

希望(file3.xlsx)

            KeyCode       Date                     Something
    0       aaa           2020-01-01 00:00:00      adaf
    1       bbb           2020-06-01 20:00:00      aafewfaewfaw
    2       ccc           2020-06-01 20:00:00      dfqefqe
    ...
    6000    aewi          2020-03-03 00:00:00      awefeaw
    6001    new           2020-06-01 20:00:00      newrow

代码:

    import numpy as np
    import pandas as pd
    %matplotlib notebook
    import matplotlib.pyplot as plt
    
    data = pd.read_excel('fulldata_01_01_01_P_병원.xlsx', index_col='번호')
    tmp = pd.read_excel('(20200601~20200607)_01_01_01_P_병원.xlsx', index_col='번호')
    
    print('{} is tmp rows count'.format(len(tmp.index)))
    print('{} is data rows count'.format(len(data.index)))
    
    new_data = pd.DataFrame([])
    for j in range(len(tmp.index)):
        ischange = False;
        isexist = False;
        for i in range(len(data.index)):
            if (data.iloc[i].loc['KeyCode'] == tmp.iloc[j].loc['KeyCode']) and (data.iloc[i].loc['Date'] < tmp.iloc[j].loc['Date']) :
                ischange = True
                data.iloc[i] = tmp.iloc[j]
                break
            elif (data.iloc[i].loc['KeyCode'] == tmp.iloc[j].loc['KeyCode']) :
                isexist = True
                break
                
        if ischange :
            print('{} is change'.format(j))
        elif isexist :
            print('{} is exist'.format(j))
        elif not ischange and not isexist :
            print('{} is append'.format(j))
            new_data.append(tmp.iloc[j], ignore_index=True)
    
    data.append(new_data, ignore_index=True)
    print('{} is tmp rows count'.format(len(tmp.index)))
    print('{} is data rows count'.format(len(data.index)))

但是......它不起作用......

标签: pythonpandas

解决方案


如果您只想获取新数据或更新但不存在的数据:

result = pd.concat([data, tmp], ignore_index=True, sort=False)
result = result.sort_values(['KeyCode', 'Date'], ascending=[True,True])  # order to find duplicates later
result = result.drop_duplicates('KeyCode', keep='first')  # drop old data

推荐阅读