bash - 将多个 grep 变量合并到一个列式文件中
问题描述
我有一些 grep 表达式来计算匹配字符串的行数,每个表达式用于一组具有不同扩展名的文件:
Nreads_ini=$(grep -c '^>' $WDIR/*_R1.trim.contigs.fasta)
Nreads_align=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.align)
Nreads_preclust=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.filter.unique.precluster.fasta)
Nreads_final=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.filter.unique.precluster.pick.fasta)
这些 grep 中的每一个都会输出样本名称和出现次数,如下所示。
第一个:
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.fasta:13175
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.fasta:14801
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.fasta:13475
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.fasta:13424
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.fasta:12053
第二个:
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.good.unique.align:12589
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.good.unique.align:13934
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.good.unique.align:12981
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.good.unique.align:12896
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.good.unique.align:11617
等等。我需要创建一个 .txt 文件,将这些数字 grep 输出作为列,将样本名称作为键列。示例名称是文件名中“_R1”之前的部分(V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA、V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG...):
Sample | Nreads_ini | Nreads_align |
-----------------------------------------------------------------------
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589 |
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934 |
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981 |
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896 |
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617 |
任何想法?我的问题还有其他更简单的解决方案吗?谢谢!
解决方案
In this answers the variable names are shortened to ini
and align
.
First, we extract the sample name and count from grep's output. Since we have to do this multiple times, we define the function
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
Then we join the extracted data into one file. Lines with the same sample name will be combined.
join -t $'\t' <(e <<< "$ini") <(e <<< "$align")
Now we nearly have the expected output. We only have to add the header and draw lines for the table.
join ... | column -to " | " -N Sample,ini,align
This will print
Sample | ini | align
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617
Adding a horizontal line after the header is left as an exercise for the reader :)
This approach also works with more than two number columns. The join
and -N
parts have to be extended. join
can only work with two files, requiring us to use an unwieldy workaround ...
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
join -t $'\t' <(e <<< "$var1") <(e <<< "$var2") |
join -t $'\t' - <(e <<< "$var3") | ... | join -t $'\t' - <(e <<< "$varN") |
column -to " | " -N Sample,Col1,Col2,...,ColN
... so it would be easier to add another helper function
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
j2() { join -t $'\t' <(e <<< "$1") <(e <<< "$2"); }
j() { join -t $'\t' - <(e <<< "$1"); }
j2 "$var1" "$var2" | j "$var3" | ... | j "$varN" |
column -to " | " -N Sample,Col1,Col2,...,ColN
Alternatively, if all inputs contain the same samples in the same order, join
can be replaced with one single paste
command.
推荐阅读
- amp-html - 使用 AMP 插件在 Google PageSpeed 上获得 100 分?这甚至可能吗?
- regex - 带有数字和字母字符的单行,正则表达式无法识别数字
- python - 如何解决我在编写 while 循环时遇到的“关键错误:t”?
- spring-boot - Spring Cloud Stream:Kafka生产者和消费者的多个绑定器具有单独的jaas配置不能一起工作
- python - 多值请求解析即将用于选择多个
- bash - 如何将除前两个之外的所有 bash 参数设置为 git 别名
- c++ - C++ 错误:调用非 constexpr 函数
- sql - 尝试在一个查询中使用两个联接时重复计数
- html - Bootstrap 垂直导航无法按预期工作
- java - 有没有更好的方法在 Java 中使用通配符?