首页 > 解决方案 > 比较 DataFrames/csv 并仅返回有差异的列,包括 Key 值

问题描述

我有两个 CSV 文件,我正在比较并仅返回具有不同值的列。

df1
Country 1980    1981    1982    1983    1984
Bermuda 0.00793 0.00687 0.00727 0.00971 0.00752
Canada  9.6947  9.58952 9.20637 9.18989 9.78546
Greenland   0.00791 0.00746 0.00722 0.00505 0.00799
Mexico  3.72819 4.11969 4.33477 4.06414 4.18464

df2
Country 1980    1981    1982    1983    1984
Bermuda 0.77777 0.00687 0.00727 0.00971 0.00752
Canada  9.6947  9.58952 9.20637 9.18989 9.78546
Greenland   0.00791 0.00746 0.00722 0.00505 0.00799
Mexico  3.72819 4.11969 4.33477 4.06414 4.18464

import pandas as pd
import numpy as np


df1=pd.read_csv('csv1.csv')
df2=pd.read_csv('csv2.csv')



def diff_pd(df1, df2):
    """Identify differences between two pandas DataFrames"""
    assert (df1.columns == df2.columns).all(), \
        "DataFrame column names are different"
    if any(df1.dtypes != df2.dtypes):
        "Data Types are different, trying to convert"
        df2 = df2.astype(df1.dtypes)
    if df1.equals(df2):
        print("Dataframes are the same")
        return None
    else:
        # need to account for np.nan != np.nan returning True
        diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
        ne_stacked = diff_mask.stack()
        changed = ne_stacked[ne_stacked]
        changed.index.names = ['Country', 'Column']
        difference_locations = np.where(diff_mask)
        changed_from = df1.values[difference_locations][0]
        changed_to = df2.values[difference_locations]
        y=pd.DataFrame({'From': changed_from, 'To': changed_to},
                            index=changed.index)
        print(y)
        return pd.DataFrame({'From': changed_from, 'To': changed_to},
                            index=changed.index)



diff_pd(df1,df2)

我当前的输出是:

                   From       To
Country Column                  
0       1980    0.00793  0.77777

因此,我想获取具有不匹配值的行的国家名称,而不是索引 0。下面是一个例子。

我希望我的输出是:

                   From       To
Country Column                  
Bermuda  1980    0.00793  0.77777

感谢所有可以提供解决方案的人。

标签: pythonpandasnumpydataframe

解决方案


一个更短的方法,沿途重命名:

def process_df(df):
    res = df.set_index('Country').stack()
    res.index.rename('Column', level=1, inplace=True)
    return res

df1 = process_df(df1)
df2 = process_df(df2)
mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
df3 = pd.concat([df1[mask], df2[mask]], axis=1).rename({0:'From', 1:'To'}, axis=1)
df3
                   From       To
Country Column                  
Bermuda 1980    0.00793  0.77777

推荐阅读