bash - How can I merge two files by column with awk?
问题描述
I have the two following text files:
file1
-7.7
-7.4
-7.3
-7.3
-7.3
file2
4.823
5.472
5.856
4.770
4.425
And I want to merge them side by side, separated by a comma:
file3
-7.7,4.823
-7.4,5.472
-7.3,5.856
-7.3,4.770
-7.3,4.425
I know this can be easily done with paste -d ',' file1 file2 > file3
, but I want a solution that allows me to have control over each iteration, since my dataset is big and I also need to add other columns to the output file. E.g.:
A,-7.7,4.823,3
A,-7.4,5.472,2
B,-7.3,5.856,3
A,-7.3,4.770,1
B,-7.3,4.425,1
Here's what I got so far:
awk 'NR==FNR {a[$count]=$1; count+=1; next} {print a[$count] "," $1; count+=1;}' file1 file2 > file3
Output:
-7.3,4.823
-7.3,5.472
-7.3,5.856
-7.3,4.770
-7.3,4.425
I am new to bash and awk, so a detailed response would be appreciated :)
Edit:
Suppose I have a directory with pairs of files, ending with two extensions: .ext1 and .ext2. Those files have parameters included in their names, for example file_0_par1_par2.ext1 has its pair, file_0_par1_par2.ext2. Each file contains 5 values. I have a function to extract its serial number and its parameters from its name. My goal is to write, on a single csv file (file_out.csv), the values present in the files along with the parameters extracted from their names.
Code:
for file1 in *.ext1 ; do
for file2 in *.ext2 ; do
# for each file ending with .ext2, verify if it is file1's corresponding pair
# I know this is extremely time inefficient, since it's a O(n^2) operation, but I couldn't find another alternative
if [[ "${file1%.*}" == "${file2%.*}" ]] ; then
# extract file_number, and par1, par2 based on some conditions, then append to the csv file
paste -d ',' "$file1" "$file2" | while IFS="," read -r var1 var2;
do
echo "$par1,$par2,$var1,$var2,$file_number" >> "file_out.csv"
done
fi
done
done
解决方案
有效执行更新问题描述的方法:
假设我有一个包含成对文件的目录,以两个扩展名结尾:.ext1 和 .ext2。这些文件的名称中包含参数,例如 file_0_par1_par2.ext1 有其对 file_0_par1_par2.ext2。每个文件包含 5 个值。我有一个函数可以从它的名称中提取它的序列号和它的参数。我的目标是在单个 csv 文件 (file_out.csv) 上写入文件中存在的值以及从它们的名称中提取的参数。
for file1 in *.ext1 ; do
for file2 in *.ext2 ; do
# for each file ending with .ext2, verify if it is file1's corresponding pair
# I know this is extremely time inefficient, since it's a O(n^2) operation, but I couldn't find another alternative
if [[ "${file1%.*}" == "${file2%.*}" ]] ; then
# extract file_number, and par1, par2 based on some conditions, then append to the csv file
paste -d ',' "$file1" "$file2" | while IFS="," read -r var1 var2;
do
echo "$par1,$par2,$var1,$var2,$file_number" >> "file_out.csv"
done
fi
done
done
将是(未经测试):
for file1 in *.ext1; do
base="${file1%.*}"
file2="${base}.ext2"
paste -d ',' "$file1" "$file2" |
awk -v base="$base" '
BEGIN { split(base,b,/_/); FS=OFS="," }
{ print b[3], b[4], $1, $2, b[2] }
'
done > 'file_out.csv'
自己做base="${file1%.*}"; file2="${base}.ext2"
会比(给定 N 对文件)效率高 N^2 倍,for file2 in *.ext2 ; do if [[ "${file1%.*}" == "${file2%.*}" ]] ; then
而自己做| awk '...'
会比| while IFS="," read -r var1 var2; do echo ...; done
(请参阅为什么使用外壳循环到处理文本)效率高一个数量级被认为是不好的做法),因此您可以期望在现有脚本的性能上看到巨大的改进。
推荐阅读
- python - Python中的文件类型
- ssis - SSIS VS17 Unsupported:此版本的 Visual Studio 无法打开以下项目 .dtproj
- java - Spring Boot Hibernate 高效的事务管理
- laravel - laravel中条件的嵌套关系
- assembly - 引用类型的对象数组如何存储在内存中?
- python - 身份验证方法已停止工作
- javascript - 第一次运行后停止此动画
- c++ - 为什么确定友元声明是否是其命名空间中的第一个如此重要?
- c# - XMLDocument 类 - 子节点 - 缩写加载与完全加载
- wordpress - 使用 Wordpress $wpdb 获取按 2 列排序的行结果