awk - 用于在文件中剪切/粘贴字符串的 awk 脚本
问题描述
我得到了一个像这样格式化的文件:(每个空格=制表符分隔符)
NB551027:767:H73JMAFX2:1:11101:5356:1093:AATGT+GTGTA blabla LASTTAG
我想在行尾剪切/粘贴 :AATGT+GTGTA 部分,并使用制表符分隔符来获取
NB551027:767:H73JMAFX2:1:11101:5356:1093 blabla LASTTAG RX:Z:AATGT+GTGTA
重要的精度:我希望第一个实例的最后一个':'之后的最后一个字符串被复制粘贴,(包括':')不管字符串的大小(它可以是AAAA,或AAAA + GGGG等) )
我使用了以下 awk 脚本:
awk '/^@/ {print;next} {N=split($1,n,":"); print $0 "\tRX:Z:" n[N] ; sub("[:]"n[N],"") ; print $0}'
我的问题是原始行仍然存在所以我得到了这个结果
NB551027:767:H73JMAFX2:1:11101:5356:1093:AATGT+GTGTA blabla LASTTAG
NB551027:767:H73JMAFX2:1:11101:5356:1093 blabla LASTTAG RX:Z:AATGT+GTGTA
基本上我不知道如何使用 awk 将结果重定向到新文件(或覆盖原始文件)。bash 脚本对我来说也是一个很好的解决方案。谢谢你的帮助
编辑:忘记提及我必须排除以 @ 开头的行:脚本不应应用于那些行。(这是 NGS 数据的 bam 文件,标题行不应更改)
文件看起来像这样
@SQ SN:chrY LN:59373566
@RG ID:1 PL:ILLUMINA PU:PU LB:001 SM:TeCoriell
@PG ID:MarkDuplicates VN:2.23.7 CL:MarkDuplicates BARCODE_TAG=RX DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam PN:MarkDuplicates
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa mem -C -M -t 4 -R @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz TeCoriell.R2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.11
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.11
@PG ID:GATK PrintReads VN:3.8-1-0-gf15c1c3ef CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.11 CL:samtools sort -o TeCoriell.bwamem.bam -l 5 -T TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX -@ 4 TeCoriell.bwamem.compress.bam
@PG ID:samtools.3 PN:samtools PP:samtools.2 VN:1.11 CL:samtools view -h TeCoriell.bwamem.bam
NB551027:724:HTWHHAFXY:3:21602:20054:7507:CACTC-CCGTC 371 chr1 10257 0 2H48M59H chr7 128036692 0 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4> SA:Z:chr7,128036692,+,76M33S,60,0; BC:Z:TGCCACCA+GAGCAGCC MC:Z:76M33H BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM MD:Z:36A5A5 PG:Z:MarkDuplicates RG:
Z:1 BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ NM:i:2 OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E AS:i:38 XS:i:38
NB551027:724:HTWHHAFXY:2:11110:2230:8695:AGTCT-AAAGT 163 chr1 15596 0 113M = 15596 113 CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/ BC:Z:TGGCACCA+GAGCAGCA MC:Z:113M BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ MD:Z:113 PG:Z:MarkDuplicates RG:Z:1 BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN NM:i:0 OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/ AS:i:113 XS:i:113
我应该得到这个结果
@SQ SN:chrY LN:59373566
@RG ID:1 PL:ILLUMINA PU:PU LB:001 SM:TeCoriell
@PG ID:MarkDuplicates VN:2.23.7 CL:MarkDuplicates BARCODE_TAG=RX DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam PN:MarkDuplicates
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa mem -C -M -t 4 -R @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz TeCoriell.R2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.11
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.11
@PG ID:GATK PrintReads VN:3.8-1-0-gf15c1c3ef CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.11 CL:samtools sort -o TeCoriell.bwamem.bam -l 5 -T TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX -@ 4 TeCoriell.bwamem.compress.bam
@PG ID:samtools.3 PN:samtools PP:samtools.2 VN:1.11 CL:samtools view -h TeCoriell.bwamem.bam
NB551027:724:HTWHHAFXY:3:21602:20054:7507 371 chr1 10257 0 2H48M59H chr7 128036692 0 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4> SA:Z:chr7,128036692,+,76M33S,60,0; BC:Z:TGCCACCA+GAGCAGCC MC:Z:76M33H BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM MD:Z:36A5A5 PG:Z:MarkDuplicates RG:
Z:1 BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ NM:i:2 OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E AS:i:38 XS:i:38 RX:Z:CACTC-CCGTC
NB551027:724:HTWHHAFXY:2:11110:2230:8695 163 chr1 15596 0 113M = 15596 113 CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/ BC:Z:TGGCACCA+GAGCAGCA MC:Z:113M BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ MD:Z:113 PG:Z:MarkDuplicates RG:Z:1 BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN NM:i:0 OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/ AS:i:113 XS:i:113 RX:Z:AGTCT-AAAGT
解决方案
你可以使用这个gnu-awk
:
awk -v RS='(^|\n)(@|NB)' -F "[[:space:]]+" -v OFS='\t' '(n=split($1, a, /:/)) > 1 {sub(/:[^:\t]+\t/, OFS); sub(/\n$/, ""); $0 = $0 OFS "RX:Z:" a[n]} {ORS=RT} 1; END {print "\n"}' file
@SQ SN:chrY LN:59373566
@RG ID:1 PL:ILLUMINA PU:PU LB:001 SM:TeCoriell
@PG ID:MarkDuplicates VN:2.23.7 CL:MarkDuplicates BARCODE_TAG=RX DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam PN:MarkDuplicates
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa mem -C -M -t 4 -R @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz TeCoriell.R2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.11
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.11
@PG ID:GATK PrintReads VN:3.8-1-0-gf15c1c3ef CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.11 CL:samtools sort -o TeCoriell.bwamem.bam -l 5 -T TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX -@ 4 TeCoriell.bwamem.compress.bam
@PG ID:samtools.3 PN:samtools PP:samtools.2 VN:1.11 CL:samtools view -h TeCoriell.bwamem.bam
NB551026:723:HTWHHAFXY:3:21602:20054:7507 371 chr1 10257 0 2H48M59H chr7 128036692 0 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4> SA:Z:chr7,128036692,+,76M33S,60,0; BC:Z:TGCCACCA+GAGCAGCC MC:Z:76M33H BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM MD:Z:36A5A5 PG:Z:MarkDuplicates RG:
Z:1 BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ NM:i:2 OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E AS:i:38 XS:i:38 RX:Z:CACTC-CCGTC
NB551027:724:HTWHHAFXY:2:11110:2230:8695 163 chr1 15596 0 113M = 15596 113 CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/ BC:Z:TGGCACCA+GAGCAGCA MC:Z:113M BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ MD:Z:113 PG:Z:MarkDuplicates RG:Z:1 BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN NM:i:0 OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/ AS:i:113 XS:i:113 RX:Z:AGTCT-AAAGT
更易读的版本:
awk -v RS='(^|\n)(@|NB)' -F "[[:space:]]+" -v OFS='\t' '
(n=split($1, a, /:/)) > 1 {
sub(/:[^:\t]+\t/, OFS)
sub(/\n$/, "")
$0 = $0 OFS "RX:Z:" a[n]
}
{
ORS=RT
}
1;
END {
print "\n"
}' file
推荐阅读
- c# - 在 c# 中取消 Parallel.ForEach 中的单个任务或特定任务
- ios - Alamofire 仅为特定请求设置超时
- javascript - 如何使用 Jest 模拟 mailgun.messages().send()?
- dask - 在 DASK RandomizedSearchCV 中实现 SMOTEENN
- vue.js - 如何解决通过方法触发或不触发 vue-transitions 的问题
- c++ - 使 VS Code 自动完成包含整个项目的标头
- r - 循环以重构时间序列的数据
- json - 使用 circe lib for scala 通过 json 路径解析 JSON
- git - repo sync 命令的替代方案是什么?
- c++ - leetcode 88. 合并排序数组 c++ 运行时错误