首页 > 解决方案 > 计算不包含特定单词的特定行

问题描述

请问我有问题:我有一个这样的文件

@HWI-ST273:296:C0EFRACXX:2:2101:17125:145325/1
TTAATACACCCAACCAGAAGTTAGCTCCTTCACTTTCAGCTAAATAAAAG
+
8?8A;DDDD;@?++8A?;C;F92+2A@19:1*1?DDDECDE?B4:BDEEI
@BBBB-ST273:296:C0EFRACXX:2:1303:5281:183410/1
TAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTTACCA
+
CCBFFFFFFHHHHJJJJJJJJJIIJJJJJJJJJJJJJJJJJJJIJJJJJI
@HWI-ST273:296:C0EFRACXX:2:1103:16617:140195/1
AAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTT
+
@C@FF?EDGFDHH@HGHIIGEGIIIIIEDIIGIIIGHHHIIIIIIIIIII
@HWI-ST273:296:C0EFRACXX:2:1207:14316:145263/1
AATACACCCAACCAGAAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCC
+
CCCFFFFFHHHHHJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJIJ

我对以“@HWI”开头的行感兴趣,但我想计算所有不以“@HWI”开头的行。在所示示例中,结果将为 1,因为有一行以“@BBB”开头。

更清楚地说:我只想知道不是'@HWI'的模式的第一行(重复的4行)的数量;我希望我足够清楚。如果您需要更多说明,请告诉我

标签: grepbioinformaticsbiopython

解决方案


With GNU sed, you can use its extended address to print every fourth line, then use grep to count the ones that don't start with @HWI:

sed -n '1~4p' file.fastq | grep -cv '^@HWI'

Otherwise, you can use e.g. Perl

perl -ne 'print if 1 == $. % 4' -- file.fastq | grep -cv '^@HWI'

$. contains the current line number, % is the modulo operator.

But once we're running Perl, we don't need grep anymore:

perl -lne '++$c if 1 == $. % 4; END { print $c }' -- file.fastq

-l removes newlines from input and adds them to output.


推荐阅读