首页 > 解决方案 > 如何在 Linux/Unix 中查找与标识符匹配的字符序列?

问题描述

我有一个名为mytext.fasta.

mytext.fasta

>lcl|NW_001820834.1_gene_4 [locus_tag=SS1G_01081] [db_xref=GeneID:5493597] [partial=5',3'] [location=complement(<6452..>8801)] [gbkey=Gene]
ATGCAATTGGCAGCAGTCCTAAGCCTCGTGGGCTTGGTTACGGCTCAATGTCCGTACGGATTTGACACAC
CACTTCAAAAGCGTGAATCTATTGATGCTCAAGCCAGTAGTTCTAGTTTCTTGAATCAATTCACAATTAA
CGATACCGATGCACACTTTACCACCGACGCAGGTGGGCCTATGCAAGAGGACACTAGTTTGAAAGCTGGG
>lcl|NW_001820834.1_gene_5 [locus_tag=SS1G_01082] [db_xref=GeneID:5493601] [partial=5',3'] [location=<9695..>10785] [gbkey=Gene]
ATGTTTTCCGGTCCCCAGAAACTTGGCAACGCCAAACAAAAATCAATTGGCCTCGCTTGTCACACAATTA
GTCCCCACGAAGCCTTGTACAAACTAGCCACTGGCTCGTCCCGGACCATTAGGGCAATGTTCAACAGAGA
>lcl|NW_001820834.1_gene_6 [locus_tag=SS1G_01083] [db_xref=GeneID:5494096] [partial=5',3'] [location=<12203..>15199] [gbkey=Gene]
ATGAGAGGCAAGCTTGGTGTCACAGTTGCTGCATTTGCGACGGCATTTCTAAATACGACACTTGCTCAAG
ACTCAACATCATCACAAGCGGATGCGGATACTACCACAAGTTATTGTCCCGTTTACACGCTCACAGCTTC
AGTTGATGCCAGCGCACCTATTATCCCAAACATCCACGATCCGCAGGCAATTAATCCACAAGATGTTTGT
CCGGGGTATACTGCATCCAATGTGAAGCGAACCTCTCACGGATTGACGGCTTCTCTGTCATTGGCTGGTG

当我这样做时grep -A1 'SS1G_01082' mytext.fasta,我得到:

>lcl|NW_001820834.1_gene_5 [locus_tag=SS1G_01082] [db_xref=GeneID:5493601] [partial=5',3'] [location=<9695..>10785] [gbkey=Gene]
ATGTTTTCCGGTCCCCAGAAACTTGGCAACGCCAAACAAAAATCAATTGGCCTCGCTTGTCACACAATTA

相反,我想得到:

>lcl|NW_001820834.1_gene_5 [locus_tag=SS1G_01082] [db_xref=GeneID:5493601] [partial=5',3'] [location=<9695..>10785] [gbkey=Gene]
ATGTTTTCCGGTCCCCAGAAACTTGGCAACGCCAAACAAAAATCAATTGGCCTCGCTTGTCACACAATTA
GTCCCCACGAAGCCTTGTACAAACTAGCCACTGGCTCGTCCCGGACCATTAGGGCAATGTTCAACAGAGA

如果你注意到,每个序列都以>这个文件开头,所以我想在执行 grep 时获得序列的完整长度。我怎样才能完成这项工作?

标签: linuxbashunixsedgrep

解决方案


gnu awk使用自定义更容易RS

awk -v RS='(^|\n)>' '/SS1G_01082/{print RT $0}' file

>lcl|NW_001820834.1_gene_5 [locus_tag=SS1G_01082] [db_xref=GeneID:5493601] [partial=5',3'] [location=<9695..>10785] [gbkey=Gene]
ATGTTTTCCGGTCCCCAGAAACTTGGCAACGCCAAACAAAAATCAATTGGCCTCGCTTGTCACACAATTA
GTCCCCACGAAGCCTTGTACAAACTAGCCACTGGCTCGTCCCGGACCATTAGGGCAATGTTCAACAGAGA

推荐阅读