首页 > 解决方案 > Check if list of patterns from file1.csv is present in file2.csv and change records in file2.csv

问题描述

I have 2 files file1.csv and file2.csv

file1.csv contains only 1 column with hundreds of rows.

aaa
ddd
fff
ggg

file2.csv contains 5 fields with thousands of rows.

aaa,2,3,4,
aaa,2,3,4, 
bbb,2,3,4,
ccc,2,3,4, 
ccc,2,3,4, 
ddd,2,3,4, 
ddd,2,3,4,
ddd,2,3,4,  
eee,2,3,4, 
fff,2,3,4, 
ggg,2,3,4, 
hhh,2,3,4, 
hhh,2,3,4,   

My task is to check if col1 present in file1.csv matches with col1 in fil2.csv then change the 5th column to Y in file2.csv

Desired output

aaa,2,3,4,Y
aaa,2,3,4,Y 
bbb,2,3,4, 
ccc,2,3,4, 
ccc,2,3,4, 
ddd,2,3,4,Y
ddd,2,3,4,Y
ddd,2,3,4,Y  
eee,2,3,4, 
fff,2,3,4,Y 
ggg,2,3,4,Y 
hhh,2,3,4, 
hhh,2,3,4, 

what i tried is

for i in $(cat file1.csv); do awk -F "," '$1==$i{$5="Y"}1' OFS="," file2.csv ; done

But I am getting only the matched records but not the unmatched records.

Is there a better way I can achieve this in UNIX using awk,sed or other common utility.

EDIT: Update question with clear example

标签: bashcsvunixawksed

解决方案


没必要那样做,awk读取两个文件就可以了:

awk -F, 'NR==FNR{a[$1]++;next;}a[$1]{$5="Y"}1' file1.csv file2.csv

不确定标题和第二行是否是您的解释方式,如果您想摆脱它们:

awk -F, 'NR==FNR{a[$1]++;next;}a[$1]{$5="Y"}FNR>2' file1.csv file2.csv

请注意,如果 file1.csv 可能为空,则应更改NR==FNR为不同的文件检查方法,例如ARGIND==1GNU awkFILENAME=="file1.csv"等。

如果要处理大量数据,更改a[$1]++a[$1]=1将略微提高速度。
此外,如果您想保留标题(或第二行),那么最好aFNR>1或时开始更改数组FNR>2。自己改进命令,我相信你明白了;)

NR==FNR表示第一个文件,因为NR表示现在的总记录数,并且表示FNR当前文件记录数。是一个实际保存为键的数组。跳过其他块执行。
a$1
next

NR==FNR为 false 时,表示它不是第一个文件,第一个块将不会被执行。
(也可以NR>FNR在block之前使用来指定,但是因为我用nextNR==FNRblock中,所以没有必要。)
然后其他指令进来,a[$1]就是判断key是否存在于数组中a(通过引用值,字面意思检查key exists实际上应该是$1 in a),如果存在,则更改 $5。
最后1是指定一个真值,它是{print}.
(由于没有块的表达式,将{print}隐含,并且当前面的表达式计算为时执行块true,对于这1始终为真。)


推荐阅读