首页 > 解决方案 > 用于在文件中剪切/粘贴字符串的 awk 脚本

问题描述

我得到了一个像这样格式化的文件:(每个空格=制表符分隔符)

NB551027:767:H73JMAFX2:1:11101:5356:1093:AATGT+GTGTA blabla LASTTAG

我想在行尾剪切/粘贴 :AATGT+GTGTA 部分,并使用制表符分隔符来获取

NB551027:767:H73JMAFX2:1:11101:5356:1093 blabla LASTTAG RX:Z:AATGT+GTGTA

重要的精度:我希望第一个实例的最后一个':'之后的最后一个字符串被复制粘贴,(包括':')不管字符串的大小(它可以是AAAA,或AAAA + GGGG等) )

我使用了以下 awk 脚本:

awk '/^@/ {print;next} {N=split($1,n,":"); print $0 "\tRX:Z:" n[N] ; sub("[:]"n[N],"") ; print $0}'

我的问题是原始行仍然存在所以我得到了这个结果

NB551027:767:H73JMAFX2:1:11101:5356:1093:AATGT+GTGTA blabla LASTTAG
NB551027:767:H73JMAFX2:1:11101:5356:1093 blabla LASTTAG RX:Z:AATGT+GTGTA

基本上我不知道如何使用 awk 将结果重定向到新文件(或覆盖原始文件)。bash 脚本对我来说也是一个很好的解决方案。谢谢你的帮助

编辑:忘记提及我必须排除以 @ 开头的行:脚本不应应用于那些行。(这是 NGS 数据的 bam 文件,标题行不应更改)

文件看起来像这样

@SQ     SN:chrY LN:59373566
@RG     ID:1    PL:ILLUMINA     PU:PU   LB:001  SM:TeCoriell
@PG     ID:MarkDuplicates       VN:2.23.7       CL:MarkDuplicates BARCODE_TAG=RX DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam PN:MarkDuplicates
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa mem -C -M -t 4 -R @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz TeCoriell.R2.fastq.gz
@PG     ID:samtools     PN:samtools     PP:bwa  VN:1.11 
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.11
@PG     ID:GATK PrintReads      VN:3.8-1-0-gf15c1c3ef   CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG     ID:samtools.2   PN:samtools     PP:samtools.1   VN:1.11 CL:samtools sort -o TeCoriell.bwamem.bam -l 5 -T TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX -@ 4 TeCoriell.bwamem.compress.bam
@PG     ID:samtools.3   PN:samtools     PP:samtools.2   VN:1.11 CL:samtools view -h TeCoriell.bwamem.bam
NB551027:724:HTWHHAFXY:3:21602:20054:7507:CACTC-CCGTC   371     chr1    10257   0       2H48M59H        chr7    128036692       0       ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA        AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4>   SA:Z:chr7,128036692,+,76M33S,60,0;      BC:Z:TGCCACCA+GAGCAGCC  MC:Z:76M33H     BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM   MD:Z:36A5A5     PG:Z:MarkDuplicates     RG:
Z:1     BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ   NM:i:2  OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E   AS:i:38 XS:i:38
NB551027:724:HTWHHAFXY:2:11110:2230:8695:AGTCT-AAAGT    163     chr1    15596   0       113M    =       15596   113     CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG  =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/       BC:Z:TGGCACCA+GAGCAGCA  MC:Z:113M       BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ     MD:Z:113        PG:Z:MarkDuplicates     RG:Z:1  BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN     NM:i:0  OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/  AS:i:113        XS:i:113

我应该得到这个结果

@SQ     SN:chrY LN:59373566
@RG     ID:1    PL:ILLUMINA     PU:PU   LB:001  SM:TeCoriell
@PG     ID:MarkDuplicates       VN:2.23.7       CL:MarkDuplicates BARCODE_TAG=RX DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam PN:MarkDuplicates
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa mem -C -M -t 4 -R @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz TeCoriell.R2.fastq.gz
@PG     ID:samtools     PN:samtools     PP:bwa  VN:1.11 
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.11
@PG     ID:GATK PrintReads      VN:3.8-1-0-gf15c1c3ef   CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG     ID:samtools.2   PN:samtools     PP:samtools.1   VN:1.11 CL:samtools sort -o TeCoriell.bwamem.bam -l 5 -T TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX -@ 4 TeCoriell.bwamem.compress.bam
@PG     ID:samtools.3   PN:samtools     PP:samtools.2   VN:1.11 CL:samtools view -h TeCoriell.bwamem.bam
NB551027:724:HTWHHAFXY:3:21602:20054:7507   371     chr1    10257   0       2H48M59H        chr7    128036692       0       ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA        AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4>   SA:Z:chr7,128036692,+,76M33S,60,0;      BC:Z:TGCCACCA+GAGCAGCC  MC:Z:76M33H     BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM   MD:Z:36A5A5     PG:Z:MarkDuplicates     RG:
Z:1     BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ   NM:i:2  OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E   AS:i:38 XS:i:38 RX:Z:CACTC-CCGTC
NB551027:724:HTWHHAFXY:2:11110:2230:8695    163     chr1    15596   0       113M    =       15596   113     CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG  =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/       BC:Z:TGGCACCA+GAGCAGCA  MC:Z:113M       BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ     MD:Z:113        PG:Z:MarkDuplicates     RG:Z:1  BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN     NM:i:0  OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/  AS:i:113        XS:i:113 RX:Z:AGTCT-AAAGT

标签: awk

解决方案


你可以使用这个gnu-awk

awk -v RS='(^|\n)(@|NB)' -F "[[:space:]]+" -v OFS='\t' '(n=split($1, a, /:/)) > 1 {sub(/:[^:\t]+\t/, OFS); sub(/\n$/, ""); $0 = $0 OFS "RX:Z:" a[n]} {ORS=RT} 1; END {print "\n"}' file

@SQ SN:chrY LN:59373566
@RG ID:1    PL:ILLUMINA PU:PU   LB:001  SM:TeCoriell
@PG ID:MarkDuplicates   VN:2.23.7   CL:MarkDuplicates   BARCODE_TAG=RX  DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam   METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam    PN:MarkDuplicates
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa   mem -C  -M  -t  4   -R  @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz   TeCoriell.R2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa  VN:1.11
@PG ID:samtools.1   PN:samtools PP:samtools VN:1.11
@PG ID:GATK PrintReads  VN:3.8-1-0-gf15c1c3ef   CL:readGroup=null   platform=null   number=-1   sample_file=[]  sample_name=[]  simplify=false  no_pg_tag=false
@PG ID:samtools.2   PN:samtools PP:samtools.1   VN:1.11 CL:samtools sort    -o  TeCoriell.bwamem.bam    -l  5   -T  TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX   -@  4   TeCoriell.bwamem.compress.bam
@PG ID:samtools.3   PN:samtools PP:samtools.2   VN:1.11 CL:samtools view    -h  TeCoriell.bwamem.bam
NB551026:723:HTWHHAFXY:3:21602:20054:7507   371 chr1    10257   0   2H48M59H    chr7    128036692   0   ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA    AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4>   SA:Z:chr7,128036692,+,76M33S,60,0;  BC:Z:TGCCACCA+GAGCAGCC  MC:Z:76M33H BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM   MD:Z:36A5A5 PG:Z:MarkDuplicates RG:
Z:1 BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ   NM:i:2  OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E   AS:i:38 XS:i:38 RX:Z:CACTC-CCGTC
NB551027:724:HTWHHAFXY:2:11110:2230:8695    163 chr1    15596   0   113M    =   15596   113 CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG  =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/   BC:Z:TGGCACCA+GAGCAGCA  MC:Z:113M   BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ MD:Z:113    PG:Z:MarkDuplicates RG:Z:1  BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN NM:i:0  OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/  AS:i:113    XS:i:113    RX:Z:AGTCT-AAAGT

更易读的版本:

awk -v RS='(^|\n)(@|NB)' -F "[[:space:]]+" -v OFS='\t' '
(n=split($1, a, /:/)) > 1 {
   sub(/:[^:\t]+\t/, OFS)
   sub(/\n$/, "")
   $0 = $0 OFS "RX:Z:" a[n]
}
{
   ORS=RT
}
1;
END {
   print "\n"
}' file

推荐阅读