python - Is there a way to compare and highlight differences between multiple csv files sequentially?
问题描述
I am a newbie in the programming space and there is something that I need to do across multiple folders that I feel will be easier if I can code it out.
I have a folder containing 12 csv files which I need to run a comparison in python against a particular column in these files. The files contain common columns and data collected in the twelve months of the year (Jan-Dec). Is there a way I can compare the difference between January file and February file, then February file and March file, March file and April file....all along highlighting the differences and saving them in one dataframe, in python?
The data is numerical and I would like to run this comparison across this specific column.
解决方案
如果您碰巧有一个索引列,那么您可以通过比较每个数据帧的索引(对应于每个文件)来提取插入/删除。但是,这仅在您具有跨文件唯一的标识符时才有效;也就是说,单个观察值或行将始终具有相同的 ID(索引列中的值),无论它位于哪个文件中。
import numpy as np
import pandas as pd
def series_diff(series1: pd.Series, series2: pd.Series) -> pd.DataFrame:
"""Compare two series via their indices, returning a table of differences.
Returns the additions and deletions made to ``series1`` to obtain ``series2``.
"""
added = series2.index.difference(series1.index)
deleted = series1.index.difference(series2.index)
return pd.concat(
[
series1.loc[deleted].to_frame(name="value").assign(action="deleted"),
series2.loc[added].to_frame(name="value").assign(action="added"),
]
)
例如,如果您有以下文件并想要比较target
列:
jan.csv
:id,filler1,target,filler2 0,spam,0.6059782788074047,eggs 1,spam,0.7333693611934982,eggs 2,spam,0.13894715672839875,eggs 3,spam,0.31267308385468695,eggs 4,spam,0.9972432813403187,eggs 5,spam,0.1281623754189607,eggs 6,spam,0.17899310595018803,eggs 7,spam,0.7529254287760938,eggs 8,spam,0.662160514309534,eggs 9,spam,0.7843101321411227,eggs
feb.csv
:id,filler1,target,filler2 0,spam,0.6059782788074047,eggs 1,spam,0.7333693611934982,eggs 2,spam,0.13894715672839875,eggs 4,spam,0.9972432813403187,eggs 5,spam,0.1281623754189607,eggs 6,spam,0.17899310595018803,eggs 8,spam,0.662160514309534,eggs 9,spam,0.7843101321411227,eggs 10,spam,0.09689439592486082,eggs 11,spam,0.058571285088035996,eggs 12,spam,0.9623959902103917,eggs 13,spam,0.6165574438945741,eggs 14,spam,0.08662996124854716,eggs
这里的索引列被命名为id
。请注意,它feb.csv
包含 ID 为 10 到 14 的附加行,而第 3 行和第 7 行已被删除jan.csv
。
让我们加载文件:
jan = pd.read_csv("jan.csv", index_col="id")
feb = pd.read_csv("feb.csv", index_col="id")
并运行差异:
series_diff(jan["target"], feb["target"])
value action
id
3 0.312673 deleted
7 0.752925 deleted
10 0.096894 added
11 0.058571 added
12 0.962396 added
13 0.616557 added
14 0.086630 added
如果您没有索引列,则很难准确识别差异:例如,具有相同值的两行可能是相同的观察值,也可能是恰好具有相同值的不同观察值.
如果我们假设行的顺序没有在文件之间打乱,并且任何添加都在前一个表的末尾进行,一个想法是逐行比较行与文本差异,例如通过使用difflib
模块,这将突出显示添加和删除。
import difflib
print(
*difflib.unified_diff(
[f"{x}\n" for x in jan["target"]],
[f"{x}\n" for x in feb["target"]],
fromfile="jan",
tofile="feb",
),
sep="",
)
--- jan
+++ feb
@@ -1,10 +1,13 @@
0.6059782788074047
0.7333693611934982
0.13894715672839875
-0.31267308385468695
0.9972432813403187
0.1281623754189607
0.17899310595018803
-0.7529254287760938
0.662160514309534
0.7843101321411227
+0.09689439592486082
+0.058571285088035996
+0.9623959902103917
+0.6165574438945741
+0.08662996124854716
推荐阅读
- twitter-bootstrap - 标题汉堡包在移动设备上跳转到视口的左侧
- javascript - 如何使用正则表达式从字符串之间获取子字符串?反应原生
- gtk - 在林间空地中调整 GtkPaned 小部件的大小
- python - 将文件存储在 Json 字段中,Django
- routes - .Net Core 3 Razor 页面捕获 404
- c++ - 具有指数表示的无符号和/或长字面量
- windows - Windows 窗体调用图形 api,服务器错误:在 AAD 中按用户 ID 查找用户失败
- javascript - 在 Symfony 下将数据库值传递给 JavaScript?
- discord - Discord.js,message.guild.owner 返回 null
- php - [Laravel]昨天post的create-function可以工作,但现在不行