awk - 将类似 GenBank 的多行记录转换为新的文件格式(fasta 格式)
问题描述
我有一个具有以下架构的文件:
source 1..3566367
/organism="Laccaria bicolor S238N-H82"
/mol_type="genomic DNA"
/strain="S238N-H82"
/db_xref="taxon:486041"
gene <143..>783
/locus_tag="LACBIDRAFT_300552"
/db_xref="GeneID:6069681"
mRNA join(<143..224,274..309,357..470,524..>783)
/locus_tag="LACBIDRAFT_300552"
/product="helix-turn-helix transcription factor, AraC
type"
/note="Has EST support"
/transcript_id="XM_001873113.1"
/db_xref="GeneID:6069681"
CDS join(143..224,274..309,357..470,524..783)
/locus_tag="LACBIDRAFT_300552"
/note="Helix-turn-helix transcription factor; AraC type"
/codon_start=1
/product="helix-turn-helix transcription factor, AraC
type"
/protein_id="XP_001873148.1"
/db_xref="GeneID:6069681"
/translation="MHAKIFLTILFASAVSVYASPQLEERQILSGVVSAITSAGGAVA
SGVTSVAGQVTSVAGSIGGDITSEAGQVFQTVTSIGGKAVTIVTSVGGDAITLATSGA
GVATSKFGSVYTVATAAAASEASAATGKSSAALPIHGLQSSLIVGLVTVVGSALLGAA
ITL"
gap 1104..3435
/estimated_length=2332
gene complement(<3738..>4636)
/locus_tag="LACBIDRAFT_242762"
/db_xref="GeneID:6069481"
mRNA complement(join(<3738..3852,3910..4045,4101..4244,
4296..4356,4409..4491,4540..4565,4620..>4636))
/locus_tag="LACBIDRAFT_242762"
/product="predicted protein"
/transcript_id="XM_001873722.1"
/db_xref="GeneID:6069481"
CDS complement(join(3738..3852,3910..4045,4101..4244,
4296..4356,4409..4491,4540..4565,4620..>4636))
/locus_tag="LACBIDRAFT_242762"
/note="Ribulose-5-phosphate 3-epimerase. RPE2. Fourth step
of pentose phosphate pathway; Ribulose-5-phosphate
3-epimerase. RPE2. TRuncated gene. No EST evidence"
/codon_start=1
/product="hypothetical protein"
/protein_id="XP_001873757.1"
/db_xref="InterPro:IPR000056"
/db_xref="GeneID:6069481"
/translation="LDVMDGHFVPNITMGAPILSCVHKGVPGIFMDCHMMVAKPEQWV
DDIADAGGSLYCFHIEATSDPVSLINTIHKRNMKAGVAISPDTPSTAITDEIANAADM
LLVMTVYPGRGGQKFIERCVPKVAELRARFPEKDIEVDGGVGPNTIGICADAGCNVIV
AGTAIFGSENPMEVIQRLKDTVNAAQAQSGAKY"
gap 4637..5256
/estimated_length=620
gap 7701..7750
/estimated_length=50
在这个文件中,我感兴趣的记录是以单词开头的CDS
:
CDS join(143..224,274..309,357..470,524..783)
/locus_tag="LACBIDRAFT_300552"
/note="Helix-turn-helix transcription factor; AraC type"
/codon_start=1
/product="helix-turn-helix transcription factor, AraC
type"
/protein_id="XP_001873148.1"
/db_xref="GeneID:6069681"
/translation="MHAKIFLTILFASAVSVYASPQLEERQILSGVVSAITSAGGAVA
SGVTSVAGQVTSVAGSIGGDITSEAGQVFQTVTSIGGKAVTIVTSVGGDAITLATSGA
GVATSKFGSVYTVATAAAASEASAATGKSSAALPIHGLQSSLIVGLVTVVGSALLGAA
ITL"
CDS complement(join(3738..3852,3910..4045,4101..4244,
4296..4356,4409..4491,4540..4565,4620..>4636))
/locus_tag="LACBIDRAFT_242762"
/note="Ribulose-5-phosphate 3-epimerase. RPE2. Fourth step
of pentose phosphate pathway; Ribulose-5-phosphate
3-epimerase. RPE2. TRuncated gene. No EST evidence"
/codon_start=1
/product="hypothetical protein"
/protein_id="XP_001873757.1"
/db_xref="InterPro:IPR000056"
/db_xref="GeneID:6069481"
/translation="LDVMDGHFVPNITMGAPILSCVHKGVPGIFMDCHMMVAKPEQWV
DDIADAGGSLYCFHIEATSDPVSLINTIHKRNMKAGVAISPDTPSTAITDEIANAADM
LLVMTVYPGRGGQKFIERCVPKVAELRARFPEKDIEVDGGVGPNTIGICADAGCNVIV
AGTAIFGSENPMEVIQRLKDTVNAAQAQSGAKY"
从他们那里,我想将这些记录的信息转换成以下格式:
>XP_001873148.1 GeneID:6069681 LACBIDRAFT_300552 helix-turn-helix transcription factor, AraC
MHAKIFLTILFASAVSVYASPQLEERQILSGVVSAITSAGGAVA
SGVTSVAGQVTSVAGSIGGDITSEAGQVFQTVTSIGGKAVTIVT
SVGGDAITLATSGAGVATSKFGSVYTVATAAAASEASAATGKSS
AALPIHGLQSSLIVGLVTVVGSALLGAAITL
>XP_001873757.1 GeneID:6069481 LACBIDRAFT_242762 hypothetical protein
LDVMDGHFVPNITMGAPILSCVHKGVPGIFMDCHMMVAKPEQWV
DDIADAGGSLYCFHIEATSDPVSLINTIHKRNMKAGVAISPDTP
STAITDEIANAADMLLVMTVYPGRGGQKFIERCVPKVAELRARF
PEKDIEVDGGVGPNTIGICADAGCNVIVAGTAIFGSENPMEVIQ
RLKDTVNAAQAQSGAKY
此新文件的记录应以“>”符号开头,并包含从以下内容中提取的以下信息:
XP_001873148.1
从: /protein_id="XP_001873148.1"
GeneID:6069681
从: /db_xref="GeneID:6069681"
LACBIDRAFT_300552
从:/locus_tag="LACBIDRAFT_300552"
helix-turn-helix transcription factor, AraC
从:/product="helix-turn-helix transcription factor, AraC
最后,顺序:
MHAKIFLTILFASAVSVYASPQLEERQILSGVVSAITSAGGAVA
SGVTSVAGQVTSVAGSIGGDITSEAGQVFQTVTSIGGKAVTIVT
SVGGDAITLATSGAGVATSKFGSVYTVATAAAASEASAATGKSS
AALPIHGLQSSLIVGLVTVVGSALLGAAITL
我们可以“折叠”到任何数字吗(在这个例子中通常是 80、44)
如果有人能指出我使用 AWK 而不是 Perl 或 Python 完成这项工作的正确方向,我将不胜感激。我用来执行此任务的 Python/Perl 工具越来越难以维护/编译。我希望/相信好的/旧的 AWK 有可能执行这些任务。
解决方案
每当您的数据中有标签值对时,我发现最好首先构建一个数组来包含f[]
下面的映射 (),然后您可以通过它们的标签/名称访问这些值。
使用 GNU awk 作为第三个参数match()
,然后利用\s/\S
gawk 扩展:
$ cat tst.awk
BEGIN {
wid=(wid ? wid : 44)
}
/^ ?\S/ {
prt()
key = $1
sub(/\s*\S+/,"")
}
{
gsub(/^\s+|\s+$/,"")
rec = (rec == "" ? "" : rec " ") $0
}
END { prt() }
function prt( f, tag, val) {
if ( key == "CDS" ) {
while ( match(rec,/\/([^=]+)=(\S+|"[^"]+")/,a) ) {
tag = a[1]
val = a[2]
gsub(/^"|"$/,"",val)
f[tag] = val
rec = substr(rec,RSTART+RLENGTH)
}
gsub(/\s+/,"",f["translation"])
gsub(".{"wid"}","&"RS,f["translation"])
sub(RS"$","",f["translation"])
print ">" f["protein_id"], f["db_xref"], f["locus_tag"], f["product"]
print f["translation"]
}
rec = ""
}
.
$ awk -f tst.awk file
>XP_001873148.1 GeneID:6069681 LACBIDRAFT_300552 helix-turn-helix transcription factor, AraC type
MHAKIFLTILFASAVSVYASPQLEERQILSGVVSAITSAGGAVA
SGVTSVAGQVTSVAGSIGGDITSEAGQVFQTVTSIGGKAVTIVT
SVGGDAITLATSGAGVATSKFGSVYTVATAAAASEASAATGKSS
AALPIHGLQSSLIVGLVTVVGSALLGAAITL
>XP_001873757.1 GeneID:6069481 LACBIDRAFT_242762 hypothetical protein
LDVMDGHFVPNITMGAPILSCVHKGVPGIFMDCHMMVAKPEQWV
DDIADAGGSLYCFHIEATSDPVSLINTIHKRNMKAGVAISPDTP
STAITDEIANAADMLLVMTVYPGRGGQKFIERCVPKVAELRARF
PEKDIEVDGGVGPNTIGICADAGCNVIVAGTAIFGSENPMEVIQ
RLKDTVNAAQAQSGAKY
如果您想为该translation
字段使用不同的宽度,则可以更改代码或在命令行上指定它:
$ awk -v wid=80 -f tst.awk file
>XP_001873148.1 GeneID:6069681 LACBIDRAFT_300552 helix-turn-helix transcription factor, AraC type
MHAKIFLTILFASAVSVYASPQLEERQILSGVVSAITSAGGAVASGVTSVAGQVTSVAGSIGGDITSEAGQVFQTVTSIG
GKAVTIVTSVGGDAITLATSGAGVATSKFGSVYTVATAAAASEASAATGKSSAALPIHGLQSSLIVGLVTVVGSALLGAA
ITL
>XP_001873757.1 GeneID:6069481 LACBIDRAFT_242762 hypothetical protein
LDVMDGHFVPNITMGAPILSCVHKGVPGIFMDCHMMVAKPEQWVDDIADAGGSLYCFHIEATSDPVSLINTIHKRNMKAG
VAISPDTPSTAITDEIANAADMLLVMTVYPGRGGQKFIERCVPKVAELRARFPEKDIEVDGGVGPNTIGICADAGCNVIV
AGTAIFGSENPMEVIQRLKDTVNAAQAQSGAKY
推荐阅读
- kotlin - Kotlin SDK 与 Java SDK
- c - 为什么根本没有内联函数符号?
- java - Java - 如何创建一个可以处理特定 [restricted] 类型的泛型对象的类?
- http-headers - 如何在 SAP Hybrid Application Toolkit 中将 Content-ID 标头设置为正确的请求?
- reactjs - React:如何将登录/注册页面与其他应用程序内容分开
- c++ - 为每个登录用户查找活动应用程序
- objective-c - audit_token_to_pid 未定义符号
- c++ - 如何在 C++ 中声明类的外部成员?
- java - Java单例,它是如何工作的?
- android - 模拟 Android Snackbar 以进行 Mockk 单元测试