perl - UNIX - 根据主列比较 2 个文件
问题描述
我需要根据主列逐列比较 2 个文件(它可以是 1 个或多个列作为灵长类键)。它应该生成 3 个 csv 文件作为输出 - 差异,file1 中的额外记录,file2 中的额外记录
注意:尝试过,sdiff
但没有提供所需的输出
例子 :
这里第一列是主键
file1 :
abc 234 123
bcd 567 890
cde 678 789
file2 :
abc 234 012
bcd 532 890
cdf 678 789
Output files
differences file :
abc,234,123::012
bcd,567::532,890
extra records in file1 :
cde,678,789
extra records in file2
cdf,678,789
解决方案
如果文件可以轻松地放入内存中,那么在 Perl 中使用散列很容易。例如:
#!/bin/bash
# create test data files
>cmp.d1 cat <<'EOD'
abc 234 123
bcd 567 890
cde 678 789
EOD
>cmp.d2 cat <<'EOD'
abc 234 012
bcd 532 890
cdf 678 789
EOD
# create script
>dif.pl cat <<'EOD'
#!/usr/bin/perl -w
if ( $#ARGV!=0 or ! -f "$ARGV[0]" ) {
die "Usage: <file2 filter file1\n";
}
@KEYS = ( 0 ); # list of columns to use for primary key
# read file1 from filename given on commandline
while (<<>>) {
chomp;
@a1 = (split); # split line into individual fields
$k = join "\0", @a1[ @KEYS ];
# if $k is not unique, only final line is kept
warn "duplicate key: $k\n" if exists $h1{$k};
# store line in %h1 for later use
$h1{$k} = [ @a1 ];
}
# now read file2 from stdin
# process each line as we read it
while (<<>>) {
chomp;
@a2 = (split); # split line into individual fields
$k = join "\0", @a2[ @KEYS ];
if ( exists $h1{$k} ) {
# record exists in both files
# calculate differences
@a1 = @{ $h1{$k} }; # retrieve file1 version
# overwrite any difference fields in @a2
map {
$a1 = shift @a1;
$_ = "${a1}::$_" if $a1 ne $_;
} @a2;
# save difference records in %hd
$hd{$k} = [ @a2 ];
# this will not be an extra file1 record
delete $h1{$k};
}
else {
# this record only exists in file2
$h2{$k} = [ @a2 ];
}
}
# format record as csv line
sub print_csv {
print join(",", @{ $_ }), "\n";
}
print "differences file :\n";
print_csv for values %hd;
print "\n";
print "extra records in file1 :\n";
print_csv for values %h1;
print "\n";
print "extra records in file2\n";
print_csv for values %h2;
EOD
# try it out
perl dif.pl cmp.d1 <cmp.d2
输出:
differences file :
bcd,567::532,890
abc,234,123::012
extra records in file1 :
cde,678,789
extra records in file2
cdf,678,789
注意: csv 输出通常不需要排序,因此此代码不进行任何排序。
推荐阅读
- python - 计算循环数据“平均”上限值和下限值的最佳方法,如冷藏冷却器
- python - 在 Python 中的字典字母计数 for 循环中使用方括号和 count() 的目的是什么?
- r - Restrict SliderInput in R Shiny date range to weekdays
- python - 通过ifft从空间频域使用功率谱密度函数生成新的二维数据?
- python - Pandas(合并/合并/连接)多个表,同时将结果放在一行中
- c# - 如何使用自适应卡中的数据?
- java - 如何在 netbeans 中将 jpanel 声明为变量,以便可以使用它来拖动表单?[java摇摆]
- javascript - 无法让 minimax 函数适用于井字游戏
- python - 将第一个文件的内容复制到第二个文件,但在新行中显示每个数字
- spring-boot - 具有最高优先级的 PropertySourcesPlaceholderConfigurer