unix - Unix 性能改进 - 可能正在使用 AWK
问题描述
我有两个文件 File1.txt(它有 6 列由管道分隔)和 File2.txt(它有 2 列由管道分隔)
文件1.txt
NEW|abcd|1234|10000000|Hello|New_value|
NEW|abcd|1234|20000000|Hello|New_value|
NEW|xyzq|5678|30000000|myname|New_Value|
文件2.txt
10000000|10000001>10000002>10000003>10000004
19000000|10000000>10000001>10000002>10000003>10000004
17000000|10000099>10000000>10000001>10000002>10000003>10000004
20000000|10000001>10000002>10000003>10000004>30000000
29000000|20000000>10000001>10000002>10000003>10000004
目标是针对 File1.txt 中的每一行,我必须选择第 4 列并且必须在 File2.txt 中搜索该值。如果在 File2.txt 中找到任何匹配项,那么我必须从 File2.txt 中提取所有行,但只提取第一列。
这可能会在目标文件中产生更多的记录。输出应如下所示(最后一列 123 来自固定变量)
NEW|abcd|1234|10000000|Hello|New_value|123 (this row comes as it matches 1st row & 4th column of File1.txt with 1st row of File2.txt)
NEW|abcd|1234|19000000|Hello|New_value|123 (this row comes as it matches 1st row & 4th column of File1.txt with 2nd row of File2.txt)
NEW|abcd|1234|17000000|Hello|New_value|123 (this row comes as it matches 1st row & 4th column of File1.txt with 3rd row of File2.txt)
NEW|abcd|1234|20000000|Hello|New_value|123 (this row comes as it matches 2nd row & 4th column of File1.txt with 4th row of File2.txt)
NEW|abcd|1234|29000000|Hello|New_value|123 (this row comes as it matches 2nd row & 4th column of File1.txt with 5th row of File2.txt)
NEW|xyzq|5678|20000000|myname|New_Value|123 (this row comes as it matches 3rd row & 4th column of File1.txt with 4th row of File2.txt)
我可以写一个像下面这样的解决方案,它也给了我正确的输出。但是当 File1.txt 和 File2.txt 都有大约 150K 行时,这需要 21 分钟。生成的最终目标文件中包含超过 1000 万行。
VAL1=123
for ROW in `cat File1.txt`
do
Fld1=`echo $ROW | cut -d'|' -f'1-3'`
Fld2=`echo $ROW | cut -d'|' -f4`
Fld3=`echo $ROW | cut -d'|' -f'5-6'`
grep -i $Fld2 File2.txt | cut -d'|' -f1 > File3.txt
sed 's/^/'$Fld1'|/g' File3.txt | sed 's/$/|'${Fld3}'|'${VAL1}'/g' >> Target.txt
done
但我的问题是这个解决方案可以优化吗?可以使用 AWK 或任何其他方式更快地重写它吗?
解决方案
我很确定这会更快(因为在单个 awk 或 sed 进程中使用隐式循环通常比在 shell 循环中一遍又一遍地调用它更快),但你必须尝试它并让我们知道:
编辑:这个版本应该解决输出中重复的问题
$ cat a.awk
NR == FNR {
for (i=1; i<=NF; ++i) {
if ($i in a)
a[$i] = a[$i] "," $1
else
a[$i] = $1;
}
next
}
$4 in a {
split(a[$4], b, ",")
for (i in b) {
if (!(b[i] in seen)) {
print $1, $2, $3, b[i], $5, $6, new_value
seen[b[i]]
}
}
delete seen
}
输出包含所需的行,尽管顺序不同:
$ awk -v new_value=123 -v OFS="|" -f a.awk FS='[|>]' file2.txt FS='|' file1.txt
NEW|abcd|1234|19000000|Hello|New_value|123
NEW|abcd|1234|17000000|Hello|New_value|123
NEW|abcd|1234|10000000|Hello|New_value|123
NEW|abcd|1234|29000000|Hello|New_value|123
NEW|abcd|1234|20000000|Hello|New_value|123
NEW|xyzq|5678|20000000|myname|New_Value|123
推荐阅读
- python - 有没有办法将列表中的重复项分隔到另一个列表中
- node.js - 在Nodejs中使用内置宏读取Excel文件
- python - Keras:在自定义损失函数中访问信心
- git - 当它有特殊字符时,如何在 git pull 中使用密码作为环境变量?
- javascript - 复制的行重复输入
- python - 尝试为数据框的单元格着色然后将其转储为 html 并尝试保留单元格的样式时出现形状错误
- assembly - 不匹配的块嵌套:main [ASM]
- python - ImageDataGenerator flow_from_directory 与孙文件夹
- qt - 如何解决 MSVC 编译器问题?
- javascript - 如何在 react-router-dom 中模拟 useNavigate 钩子