首页 > 解决方案 > 如何将已识别的正则表达式保存在找到模式的同一 txt 文件中的列中?

问题描述

我有一个包含很多列的文件(第一行)

TRINITY_DN3472760_c4_g4 TRINITY_DN3472760_c4_g4_i1  DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex`DHAS_AQUAE^DHAS_AQUAE^Q:2-361,H:214-332^53.333%ID^E:4.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex TRINITY_DN3472760_c4_g4_i1.p2   2-373[+]    DHAS_AQUAE^DHAS_AQUAE^Q:1-120,H:214-332^53.333%ID^E:1.37e-32^RecName: Full=Aspartate-semialdehyde dehydrogenase {ECO:0000255|HAMAP-Rule:MF_02121};^Bacteria; Aquificae; Aquificales; Aquificaceae; Aquifex  PF02774.15^Semialdhyde_dhC^Semialdehyde dehydrogenase, dimerisation domain^1-108^E:6.4e-24  COG0136^Catalyzes the NADPH-dependent formation of L-aspartate- semialdehyde (L-ASA) by the reductive dephosphorylation of L- aspartyl-4-phosphate (By similarity)  KEGG:aae:aq_1866`KO:K00133  KEGG:aae:aq_1866`KO:K00133  GO:0005737^cellular_component^cytoplasm`GO:0004073^molecular_function^aspartate-semialdehyde dehydrogenase activity`GO:0003942^molecular_function^N-acetyl-gamma-glutamyl-phosphate reductase activity`GO:0051287^molecular_function^NAD binding`GO:0050661^molecular_function^NADP binding`GO:0071266^biological_process^'de novo' L-methionine biosynthetic process`GO:0019877^biological_process^diaminopimelate biosynthetic process`GO:0009097^biological_process^isoleucine biosynthetic process`GO:0009089^biological_process^lysine biosynthetic process via diaminopimelate`GO:0009088^biological_process^threonine biosynthetic process   GO:0003942^molecular_function^N-acetyl-gamma-glutamyl-phosphate reductase activity`GO:0016620^molecular_function^oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor`GO:0046983^molecular_function^protein dimerization activity`GO:0008652^biological_process^cellular amino acid biosynthetic process`GO:0055114^biological_process^oxidation-reduction process`GO:0005737^cellular_component^cytoplasm   GGAGCGTAAGGTCACCTGGGAGACGCGCAAGATCATGGACCTGCCCGACCTCCCTGTGTCGTGCACGTGCGTGCGCATCCCCACGCTGCGCGCGCACGGCGAGTCGATCACCATCGAGACGGAGAAGCCGATCAACATGGAGAGGGCCTACGCTGTGCTCAACGAGGCCTCCGGCGTCGTCGTCGTCGACGACACCTCGAAGAACCTCTACCCGATGCCGATCACCGCCTCGACCAAGTTCGACGTCGAGGTCGGCCGCCTCCGCATCAACGACGTCTTCGGCGAGAACGGCCTCGACATGTTCGTCGTCGGCGATCAGCTCCTCCGCGGCGCGGCGCTCAACGCCGTCCTCATCGCGGAGGCCGTCATGTAAACTTGTTTACACCCGCGCCGCCACTCGTGCTGTTTGCTGCCGCCGGCCCGCTTCGGCCCAAACCGCGACGCCCTTGCGTGGCTTGGC    ERKVTWETRKIMDLPDLPVSCTCVRIPTLRAHGESITIETEKPINMERAYAVLNEASGVVVVDDTSKNLYPMPITASTKFDVEVGRLRINDVFGENGLDMFVVGDQLLRGAALNAVLIAEAVM*

其中一列有一些注释,如下所示:

KEGG:aag:AaeL_AAEL000291`KO:K02155
KEGG:aag:AaeL_AAEL003872
KEGG:aag:AaeL_AAEL005901`KEGG:aag:AaeL_AAEL013158`KO:K02984
KEGG:ago:AGOS_AGR122C`KO:K13126
KEGG:ame:408385`KO:K03231

我有兴趣通过 grep 提取带有 KO 注释的部分

grep -P 'K[0-9]{5}' myfile

但后来我想将匹配的模式保存在同一个文件中,比如说在第 15 列中。其他可以帮助我的选项是匹配的模式是否保留在同一个地方,但其他所有内容都被删除。

所以我的预期结果是一个与保存在同一文件中的 K[0-9]{5} 匹配的数字。

有人可以帮我吗?

标签: regexunixgrepoutput

解决方案


检查字段 9 是否实际上以您需要的模式结束,然后仅在有效行的末尾sub匹配并添加:sub(/.*:/, "", r)

awk -F"\t" '{if ($9 ~ /KO:K[0-9]{5}$/) { r=$9; sub(/.*:/, "", r); print $0 "\t" r; } else print $0; }' file > outfile

这里,

  • -F"\t"使用制表符分割成字段
  • if ($9 ~ /KO:K[0-9]{5}$/)是一个条件,仅当字段 9 ( $9) 以KO:K+ 5 位数字结尾时,
    • r=$9;将字段 9 的值分配给r
    • sub(/.*:/, "", r);然后,删除所有直到并包括最后一个:
    • print $0 "\t" r;r然后,使用选项卡和值打印整个记录
  • else
    • print $0;按原样打印记录。

推荐阅读