首页 > 解决方案 > 将多个 grep 变量合并到一个列式文件中

问题描述

我有一些 grep 表达式来计算匹配字符串的行数,每个表达式用于一组具有不同扩展名的文件:

Nreads_ini=$(grep -c '^>' $WDIR/*_R1.trim.contigs.fasta)
Nreads_align=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.align)
Nreads_preclust=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.filter.unique.precluster.fasta)
Nreads_final=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.filter.unique.precluster.pick.fasta)

这些 grep 中的每一个都会输出样本名称和出现次数,如下所示。

第一个:

PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.fasta:13175
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.fasta:14801
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.fasta:13475
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.fasta:13424
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.fasta:12053

第二个:

PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.good.unique.align:12589
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.good.unique.align:13934
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.good.unique.align:12981
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.good.unique.align:12896
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.good.unique.align:11617

等等。我需要创建一个 .txt 文件,将这些数字 grep 输出作为列,将样本名称作为键列。示例名称是文件名中“_R1”之前的部分(V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA、V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG...):

Sample                                   | Nreads_ini | Nreads_align  |
-----------------------------------------------------------------------
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT  | 13175      | 12589         | 
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT  | 14801      | 13934         | 
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT  | 13475      | 12981         | 
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG  | 13424      | 12896         |
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA  | 12053      | 11617         |

任何想法?我的问题还有其他更简单的解决方案吗?谢谢!

标签: bashcountgrep

解决方案


In this answers the variable names are shortened to ini and align.

First, we extract the sample name and count from grep's output. Since we have to do this multiple times, we define the function

e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }

Then we join the extracted data into one file. Lines with the same sample name will be combined.

join -t $'\t' <(e <<< "$ini") <(e <<< "$align")

Now we nearly have the expected output. We only have to add the header and draw lines for the table.

join ... | column -to " | " -N Sample,ini,align

This will print

Sample                                  | ini   | align
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617

Adding a horizontal line after the header is left as an exercise for the reader :)

This approach also works with more than two number columns. The join and -N parts have to be extended. join can only work with two files, requiring us to use an unwieldy workaround ...

e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
join -t $'\t' <(e <<< "$var1") <(e <<< "$var2") |
join -t $'\t' - <(e <<< "$var3") | ... | join -t $'\t' - <(e <<< "$varN") |
column -to " | " -N Sample,Col1,Col2,...,ColN

... so it would be easier to add another helper function

e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
j2() { join -t $'\t' <(e <<< "$1") <(e <<< "$2"); }
j() { join -t $'\t' - <(e <<< "$1"); }
j2 "$var1" "$var2" | j "$var3" | ... | j "$varN" |
column -to " | " -N Sample,Col1,Col2,...,ColN

Alternatively, if all inputs contain the same samples in the same order, join can be replaced with one single paste command.


推荐阅读