首页 > 解决方案 > Merging two csv files, can't get rid of newline

问题描述

I am merging two csv files. For simplicity, I am showing relevant columns only. There are more than four columns in both files.

file_a.csv

col2, col6, col7, col17
a, b, c, 145
e, f, g, 101
x, y, z, 243

file_b.csv

col2, col6, col7, col17
a, b, c, 88
e, f, g, 96
x, k, l, 222

Output should look like this:

col2, col6, col7, col17, col18
a, b, c, 145, 88
e, f, g, 101, 96

So col17 of file_b is added to file_a as col18 when the contents of col2, col6 and col7 match.

I tried this:

awk -F, 'NR == FNR {a[$2,$6,$7] = $17;next;} {if (! (b = a[$2,$6,$7])) b = "N/A";print $0,FS,b;}' file_a.csv file_b.csv > out.csv

The output looks like this:

col2, col6, col7, col17, 
 , col18
a, b, c, 145
 , 88
e, f, g, 101
 , 96

So the column 17 from file_b I am trying to add does get added but shows up on a new line.

I think this is because there are carriage returns after each line of file_a and file_b. In Notepad++, I can see CRLF. But I can't get rid of them. Also, I would rather not go through two steps: getting rid of carriage returns first and then merging. Instead, if I can bypass the carriage returns during the merge, it will be much faster.

Also, I will appreciate it if you could tell me how to get rid of the spaces before and after the comma separating the merged column. Note that I put spaces between the columns and commas for the other columns for better readability. That is not how it is in the actual files. But there are indeed spaces between col17 and "," and col18 in the merged file and I don't know why.

If you insist on marking this as a duplicate, kindly explain in a comment below how the answers to the previous question(s) address my issue. I tried figuring it out from those previous similar questions and I failed.

标签: csvawkmergetext-processing

解决方案


请试试这个(GNU awk):

awk -F, -v RS="[\r\n]+" 'NR == FNR {a[$2,$6,$7] = $17;next;} {b=a[$2,$6,$7]; print $0 FS (b? b : "N/A")}' file_a.csv file_b.csv 

您遇到的问题:
1. 回车, by RS="[\r\n]+",它将处理多个换行符,包括\r\n作为行分隔符。请注意,这也会忽略空行,如果您不想这样做,请更改为RS="\r\n".
2.空格,那是因为awk的默认OFS是空格。当您打印时,您使用,了 ,这将在它们之间添加空格。只需使用空间或有时将它们写在一起就可以了,它们将被连接起来。


推荐阅读