首页 > 解决方案 > Is there a way to compare and highlight differences between multiple csv files sequentially?

问题描述

I am a newbie in the programming space and there is something that I need to do across multiple folders that I feel will be easier if I can code it out.

I have a folder containing 12 csv files which I need to run a comparison in python against a particular column in these files. The files contain common columns and data collected in the twelve months of the year (Jan-Dec). Is there a way I can compare the difference between January file and February file, then February file and March file, March file and April file....all along highlighting the differences and saving them in one dataframe, in python?

The data is numerical and I would like to run this comparison across this specific column.

标签: pythonpandas

解决方案


如果您碰巧有一个索引列,那么您可以通过比较每个数据帧的索引(对应于每个文件)来提取插入/删除。但是,这仅在您具有跨文件唯一的标识符时才有效;也就是说,单个观察值或行将始终具有相同的 ID(索引列中的值),无论它位于哪个文件中。

import numpy as np
import pandas as pd


def series_diff(series1: pd.Series, series2: pd.Series) -> pd.DataFrame:
    """Compare two series via their indices, returning a table of differences.

    Returns the additions and deletions made to ``series1`` to obtain ``series2``.
    """
    added = series2.index.difference(series1.index)
    deleted = series1.index.difference(series2.index)

    return pd.concat(
        [
            series1.loc[deleted].to_frame(name="value").assign(action="deleted"),
            series2.loc[added].to_frame(name="value").assign(action="added"),
        ]
    )

例如,如果您有以下文件并想要比较target列:

  • jan.csv
    id,filler1,target,filler2
    0,spam,0.6059782788074047,eggs
    1,spam,0.7333693611934982,eggs
    2,spam,0.13894715672839875,eggs
    3,spam,0.31267308385468695,eggs
    4,spam,0.9972432813403187,eggs
    5,spam,0.1281623754189607,eggs
    6,spam,0.17899310595018803,eggs
    7,spam,0.7529254287760938,eggs
    8,spam,0.662160514309534,eggs
    9,spam,0.7843101321411227,eggs
    
  • feb.csv
    id,filler1,target,filler2
    0,spam,0.6059782788074047,eggs
    1,spam,0.7333693611934982,eggs
    2,spam,0.13894715672839875,eggs
    4,spam,0.9972432813403187,eggs
    5,spam,0.1281623754189607,eggs
    6,spam,0.17899310595018803,eggs
    8,spam,0.662160514309534,eggs
    9,spam,0.7843101321411227,eggs
    10,spam,0.09689439592486082,eggs
    11,spam,0.058571285088035996,eggs
    12,spam,0.9623959902103917,eggs
    13,spam,0.6165574438945741,eggs
    14,spam,0.08662996124854716,eggs
    

这里的索引列被命名为id。请注意,它feb.csv包含 ID 为 10 到 14 的附加行,而第 3 行和第 7 行已被删除jan.csv

让我们加载文件:

jan = pd.read_csv("jan.csv", index_col="id")
feb = pd.read_csv("feb.csv", index_col="id")

并运行差异:

series_diff(jan["target"], feb["target"])
       value   action
id                   
3   0.312673  deleted
7   0.752925  deleted
10  0.096894    added
11  0.058571    added
12  0.962396    added
13  0.616557    added
14  0.086630    added

如果您没有索引列,则很难准确识别差异:例如,具有相同值的两行可能是相同的观察值,也可能是恰好具有相同值的不同观察值.

如果我们假设行的顺序没有在文件之间打乱,并且任何添加都在前一个表的末尾进行,一个想法是逐行比较行与文本差异,例如通过使用difflib模块,这将突出显示添加和删除。

import difflib

print(
    *difflib.unified_diff(
        [f"{x}\n" for x in jan["target"]],
        [f"{x}\n" for x in feb["target"]],
        fromfile="jan",
        tofile="feb",
    ),
    sep="",
)
--- jan
+++ feb
@@ -1,10 +1,13 @@
 0.6059782788074047
 0.7333693611934982
 0.13894715672839875
-0.31267308385468695
 0.9972432813403187
 0.1281623754189607
 0.17899310595018803
-0.7529254287760938
 0.662160514309534
 0.7843101321411227
+0.09689439592486082
+0.058571285088035996
+0.9623959902103917
+0.6165574438945741
+0.08662996124854716

推荐阅读