首页 > 解决方案 > 合并包含特定字符的行之间出现的行

问题描述

我正在尝试使用一般格式操作 FASTA 文件:

>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG

我试图读取(ACTG ...)并将其附加到带有ReadID的行的末尾,使用

paste -sd "\t\n" input.file > output.file

这就像它应该的那样工作,除了出于某种原因,一些读取被故意分成两行:

>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTG
ACTG

这意味着我不能简单地用制表符分隔符替换换行符。

我想要做的事情是把所有的行放在开头的行之间>,并将它们组合成一行。我该如何将介于两者之间的所有行>合并为一行?

标签: bashawknewlinepastecsv

解决方案


您可以使用以下 Perl one-liner 使每个读取都变成一行:

perl -ne 'sub out {return unless chomp @_; print shift, "\n", @_, "\n" } if (/^>/) {out(@buffer); @buffer = ()} push @buffer, $_; END {out(@buffer)}' -- input.fasta

对应于以下脚本:

# Subroutine which prints a header and concatenates the following lines.
sub out {
    return unless chomp @_;       # Remove newlines. Do nothing if there's no buffer.
    print shift, "\n", @_, "\n";  # Print the first line, newline, remaining lines, and newline.
}
if (/^>/) {        # If the line starts with a ">",
    out(@buffer);  # output the previous read
    @buffer = ();  # and empty the buffer.
}
push @buffer, $_;  # Store the current line to the buffer.
END {
    out(@buffer);  # Output the final read.
}

推荐阅读