首页 > 解决方案 > How can I merge two files by column with awk?

问题描述

I have the two following text files:

file1

-7.7
-7.4
-7.3
-7.3
-7.3

file2

4.823
5.472
5.856
4.770
4.425

And I want to merge them side by side, separated by a comma:

file3

-7.7,4.823
-7.4,5.472
-7.3,5.856
-7.3,4.770
-7.3,4.425

I know this can be easily done with paste -d ',' file1 file2 > file3, but I want a solution that allows me to have control over each iteration, since my dataset is big and I also need to add other columns to the output file. E.g.:

A,-7.7,4.823,3
A,-7.4,5.472,2
B,-7.3,5.856,3
A,-7.3,4.770,1
B,-7.3,4.425,1

Here's what I got so far:

awk 'NR==FNR {a[$count]=$1; count+=1; next} {print a[$count] "," $1; count+=1;}' file1 file2 > file3

Output:

-7.3,4.823
-7.3,5.472
-7.3,5.856
-7.3,4.770
-7.3,4.425

I am new to bash and awk, so a detailed response would be appreciated :)

Edit:
Suppose I have a directory with pairs of files, ending with two extensions: .ext1 and .ext2. Those files have parameters included in their names, for example file_0_par1_par2.ext1 has its pair, file_0_par1_par2.ext2. Each file contains 5 values. I have a function to extract its serial number and its parameters from its name. My goal is to write, on a single csv file (file_out.csv), the values present in the files along with the parameters extracted from their names.
Code:

for file1 in *.ext1 ; do
    for file2 in *.ext2 ; do
        # for each file ending with .ext2, verify if it is file1's corresponding pair
        # I know this is extremely time inefficient, since it's a O(n^2) operation, but I couldn't find another alternative
        if [[ "${file1%.*}" == "${file2%.*}" ]] ; then
            # extract file_number, and par1, par2 based on some conditions, then append to the csv file
            paste -d ',' "$file1" "$file2" | while IFS="," read -r var1 var2;
            do
                echo "$par1,$par2,$var1,$var2,$file_number" >> "file_out.csv" 
            done
        fi
    done
done

标签: bashcsvawk

解决方案


有效执行更新问题描述的方法:

假设我有一个包含成对文件的目录,以两个扩展名结尾:.ext1 和 .ext2。这些文件的名称中包含参数,例如 file_0_par1_par2.ext1 有其对 file_0_par1_par2.ext2。每个文件包含 5 个值。我有一个函数可以从它的名称中提取它的序列号和它的参数。我的目标是在单个 csv 文件 (file_out.csv) 上写入文件中存在的值以及从它们的名称中提取的参数。

for file1 in *.ext1 ; do
    for file2 in *.ext2 ; do
        # for each file ending with .ext2, verify if it is file1's corresponding pair
        # I know this is extremely time inefficient, since it's a O(n^2) operation, but I couldn't find another alternative
        if [[ "${file1%.*}" == "${file2%.*}" ]] ; then
            # extract file_number, and par1, par2 based on some conditions, then append to the csv file
            paste -d ',' "$file1" "$file2" | while IFS="," read -r var1 var2;
            do
                echo "$par1,$par2,$var1,$var2,$file_number" >> "file_out.csv" 
            done
        fi
    done
done

将是(未经测试):

for file1 in *.ext1; do
    base="${file1%.*}"
    file2="${base}.ext2"
    paste -d ',' "$file1" "$file2" |
    awk -v base="$base" '
        BEGIN { split(base,b,/_/); FS=OFS="," }
        { print b[3], b[4], $1, $2, b[2] }
    '
done > 'file_out.csv'

自己做base="${file1%.*}"; file2="${base}.ext2"会比(给定 N 对文件)效率高 N^2 倍,for file2 in *.ext2 ; do if [[ "${file1%.*}" == "${file2%.*}" ]] ; then而自己做| awk '...'会比| while IFS="," read -r var1 var2; do echo ...; done(请参阅为什么使用外壳循环到处理文本)效率高一个数量级被认为是不好的做法),因此您可以期望在现有脚本的性能上看到巨大的改进。


推荐阅读