awk - 如何重新格式化 fasta 文件头的字段并折叠序列?
问题描述
- 生物信息学中最常用的文件之一是fasta 文件格式。
- Fasta 文件很简单:它们包含以“>”开头的“Header”记录,然后是“Sequence”记录,即在标题之后但在下一个记录分隔符之前的所有内容(即“>”)。
>ENSP00000488314.1 pep chromosome:GRCh38:X:143884071:143885255:1 gene:ENSG00000276380.2 transcript:ENST00000618570.1 gene_biotype:polymorphic_pseudogene transcript_biotype:polymorphic_pseudogene gene_symbol:UBE2NL description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene) [Source:HGNC Symbol;Acc:HGNC:31710]
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLA
EEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
> next record...
> another one...
- 标头可以非常简单(例如,">ENSP00000488314.1")或复杂。
- 复杂的标题重要但可变的信息。
- 对于上面的示例序列(来自ENSEMBL),标题记录由以下部分组成:
Field 01: ENSP00000488314.1 <=Protein ID
Field 02: pep <=Peptide record
Field 03: chromosome:GRCh38:X:143884071:143885255:1 <=Chromosome and chromosomal coordinates
Field 04: gene:ENSG00000276380.2 <=Gene ID
Field 05: transcript:ENST00000618570.1 <=Transcript ID
Field 06: gene_biotype:polymorphic_pseudogene <=Gene Biotype
Field 07: transcript_biotype:polymorphic_pseudogene <=Transcript Biotype
Field 08: gene_symbol:UBE2NL <=Gene Symbol
Up to here the fields are all nicely separated by spaces, and then...Field 09 (Variable)
Field 09: description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene)
Field 10: [Source:HGNC Symbol;Acc:HGNC:31710] <=Predictable
- 很多时候,其他生物信息学应用程序不能很好地接受长标题,因此需要缩短标题。
- 以聪明的方式做到这一点会很好。因此,使用 AWK,并使用下面的示例序列,我想:
- 第一:控制表头记录的打印如下:
- 始终保留第一个字段:
>ENSP00000488314.1
- 但随后能够省略和/或包含字段。例子:
>ENSP00000488314.1 gene:ENSG00000276380.2 transcript:ENST00000618570.1
Field: 01 04 05
>ENSP00000488314.1 pep chromosome:GRCh38:X:143884071:143885255:1 [Source:HGNC Symbol;Acc:HGNC:31710]
Field: 01 02 03 10
- 为简单起见,完全忽略字段 09 是完全可以接受的,但能够使用字段 10 会很好
- 然后能够将序列“折叠”到用户指定的数字。例如,每 60 个字符折叠一次的记录:
>ENSP00000441696.1 pep chromosome:GRCh38:14:21868839:21869365:1 gene:ENSG00000211788.2 transcript:ENST00000390436.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRAV13-1 description:T cell receptor alpha variable 13-1 [Source:HGNC Symbol;Acc:HGNC:12108]
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELG
KGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>ENSP00000488314.1 pep chromosome:GRCh38:X:143884071:143885255:1 gene:ENSG00000276380.2 transcript:ENST00000618570.1 gene_biotype:polymorphic_pseudogene transcript_biotype:polymorphic_pseudogene gene_symbol:UBE2NL description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene) [Source:HGNC Symbol;Acc:HGNC:31710]
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLA
EEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>ENSP00000437680.2 pep chromosome:GRCh38:22:42140203:42141924:-1 gene:ENSG00000205702.11 transcript:ENST00000435101.1 gene_biotype:polymorphic_pseudogene transcript_biotype:nonsense_mediated_decay gene_symbol:CYP2D7 description:cytochrome P450 family 2 subfamily D member 7 (gene/pseudogene) [Source:HGNC Symbol;Acc:HGNC:2624]
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
- 可能变成(每 120 个字符折叠一次的序列):
>ENSP00000441696.1 gene:ENSG00000211788.2 transcript:ENST00000390436.2
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>ENSP00000488314.1 gene:ENSG00000276380.2 transcript:ENST00000618570.1
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>ENSP00000437680.2 gene:ENSG00000205702.11 transcript:ENST00000435101.1
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
- 到目前为止,我能做的最好的事情就是调用一个包含以下代码的脚本:
awk -v w=60 -f script.awk fasta_file.fa
#!/usr/bin/env gawk
## Script.awk
/^>/ {
if (seq != "") print seq; print $1,$4,$5; seq = ""; next
}
{
seq = seq $1
while (length(seq) > w) {
print substr(seq, 1,w)
seq = substr(seq, 1+w)
}
}
END { if (seq != "") print seq }
上面代码的问题是 $1、$4 和 $5 字段是硬编码的。
Ed Morton为类似问题提出了一个优雅的解决方案 ,但是,它需要我理解\s/\S gawk 扩展和 AWK 数组,这是我正在努力做的事情。
任何关于如何使用 AWK(不是 Perl/Python)改进上述代码的想法将不胜感激
解决方案
这不仅显示了如何使用 awk 执行您想要的操作,还显示了如何正确构建 shell 脚本以在解析参数后调用 awk(如果您使用 shebang 调用 awk 则不能这样做 - 不要那样做)。它使用 GNU awk 进行 gensub() 和第三个参数来匹配():
$ cat tst.sh
#!/usr/bin/env bash
while getopts ":w:f:" opt; do
case "$opt" in
w) wid=${OPTARG}
;;
f) flds=${OPTARG}
;;
*) printf 'bad argument "%s"\n' "$opt" >&2
exit 1
;;
esac
done
shift "$((OPTIND-1))"
awk -v wid="$wid" -v flds="$flds" '
BEGIN {
wid=(wid ? wid : 120)
flds=(flds ? flds : "protein gene transcript")
numTags = split(flds,tags)
}
sub(/^>/,"") {
if (NR > 1) {
prt()
}
match($0,/(description:.*\S)\s+\[([^]]+)/,a)
$0 = substr($0,1,RSTART-1)
f["description"] = a[1]
f["predictable"] = a[2]
f["protein"] = $1
f["peptide"] = $2
for (i=3; i<=NF; i++) {
tag = gensub(/:.*/,"",1,$i)
f[tag] = $i
}
next
}
{ f["sequence"] = f["sequence"] $0 }
END { prt() }
function prt( tagNr, tag) {
printf ">"
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
printf "%s%s", f[tag], (tagNr<numTags ? OFS : ORS)
}
print gensub(".{"wid"}","&"RS,"g",f["sequence"])
delete f
}
' "${@:--}"
.
$ ./tst.sh file
>ENSP00000441696.1 gene:ENSG00000211788.2 transcript:ENST00000390436.2
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>ENSP00000488314.1 gene:ENSG00000276380.2 transcript:ENST00000618570.1
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>ENSP00000437680.2 gene:ENSG00000205702.11 transcript:ENST00000435101.1
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
.
$ ./tst.sh -w 60 -f 'gene_symbol chromosome' file
>gene_symbol:TRAV13-1 chromosome:GRCh38:14:21868839:21869365:1
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELG
KGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>gene_symbol:UBE2NL chromosome:GRCh38:X:143884071:143885255:1
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLA
EEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>gene_symbol:CYP2D7 chromosome:GRCh38:22:42140203:42141924:-1
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
.
$ ./tst.sh -w 10000 -f 'description' file
>description:T cell receptor alpha variable 13-1
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene)
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDDPLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>description:cytochrome P450 family 2 subfamily D member 7 (gene/pseudogene)
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
.
$ ./tst.sh -w 10000 -f 'predictable' file
>Source:HGNC Symbol;Acc:HGNC:12108
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>Source:HGNC Symbol;Acc:HGNC:31710
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDDPLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>Source:HGNC Symbol;Acc:HGNC:2624
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
推荐阅读
- python - 你如何改变python中可以添加到整数的内容?
- python - 这是尝试代理的正确方法吗?
- node.js - 节点js mongodb查询
- python-3.x - 我在 python 中创建了 2 种不同的 2D 列表方式,然后用相同的方式更新,但我得到不同的结果
- javascript - NGX 在 Angular 9 中翻译
- next.js - 错误:失败的道具类型:道具`href`在` `中需要`string`或`object`,但得到`undefined`
- unity3d - 如何使用 google cardboard 在 VR 中开始?
- apache-camel - Camel fromF文件路由
- flutter - 更新 Android Studio 后 Flutter 插件不起作用
- javascript - 鼠标滚轮后刷卡器不会恢复自动播放