for-loop - 比较具有重复输入值的两个文件
问题描述
我有以下两个文件
BC.txt
"PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
"PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
PB.txt
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10";
我正在尝试将 BC.txt 的 Col1 与 PB.txt 的 Col12 进行比较,并将匹配项彼此相邻打印。对于 BC.txt 的 col1 中的相同值,在 col2 和 Col3 中具有不同的值。因此,在比较时,我只得到 BC.txt 的一个条目的输出。但我想要所有人。
awk 'BEGIN {OFS=FS} NR==FNR {a[$1]=($2" "$3);next} $12 in a {print $0,a[$12]}' BC.txt PB.txt
预期产出
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
我想将 BC.txt 的所有条目与 PB.txt 的条目进行比较;但由于它的值相同,我的代码不起作用。
解决方案
如果您不关心与问题中的预期输出相比的输出行顺序,那么将 BC.txt 读入内存,因为它更简洁:
$ cat tst.awk
NR==FNR {
map[$1,++cnt[$1]] = $2 OFS $3
next
}
{
for (c=1; c<=cnt[$12]; c++) {
print $0, map[$12,c]
}
}
$ awk -f tst.awk BC.txt PB.txt
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
但如果你在乎的话:
$ cat tst.awk
NR==FNR {
map[$12,++cnt[$12]] = $0
next
}
{
for (c=1; c<=cnt[$1]; c++) {
print map[$1,c], $2, $3
}
}
$ awk -f tst.awk PB.txt BC.txt
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
推荐阅读
- python - 使用 iloc 对数据帧中的值求和
- php - Laravel Eloquent Multiple Where with count
- node.js - 即使在镜像模式下连接外部显示器时,电子屏幕包也只显示一个显示器
- reactjs - 为什么在 cpanel 上部署具有多个页面的反应应用程序时出现此错误
- tensorflow - Nvidia K2200 (Manjaro/Arch-Linux) 上的 TensorFlow-GPU 问题
- nosql - 如何使用 Oracle NoSQL 数据库云服务 Java SDK 检索表限制详细信息?
- anaconda - libgomp.so.1:未找到版本“GOMP_4.5”
- javascript - 无法让 onclick 事件在默认 Shopify Debut 主题中工作
- react-native - react native recyclerlistview 只渲染一项
- reactjs - 如何解决 ReactJS 中超出的更新深度