首页 > 解决方案 > 仅在特定列之间更改分隔符

问题描述

我正在尝试仅在第 1 列和第 9 列之间更改分隔符。之后,我想保留原始分隔符。

这些是我的文件在直接读取和od -c file执行时的第一行:

#description: evidence-based annotation of the human genome (GRCh38), version 35 (Ensembl 101), mapped to GRCh37 with gencode-backmap
#provider: GENCODE
#contact: gencode-help@ebi.ac.uk
#format: gff3
#date: 2020-06-03
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5_4"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2_4"; remap_status "full_contig"; remap_num_mappings 1; remap_target_status "overlap";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level 1; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000362751.1_1"; remap_num_mappings 1; remap_status "full_contig"; remap_target_status "overlap";

0000000   #   d   e   s   c   r   i   p   t   i   o   n   :       e   v
0000020   i   d   e   n   c   e   -   b   a   s   e   d       a   n   n
0000040   o   t   a   t   i   o   n       o   f       t   h   e       h
0000060   u   m   a   n       g   e   n   o   m   e       (   G   R   C
0000100   h   3   8   )   ,       v   e   r   s   i   o   n       3   5
0000120       (   E   n   s   e   m   b   l       1   0   1   )   ,
0000140   m   a   p   p   e   d       t   o       G   R   C   h   3   7
0000160       w   i   t   h       g   e   n   c   o   d   e   -   b   a
0000200   c   k   m   a   p  \n   #   p   r   o   v   i   d   e   r   :
0000220       G   E   N   C   O   D   E  \n   #   c   o   n   t   a   c
0000240   t   :       g   e   n   c   o   d   e   -   h   e   l   p   @
0000260   e   b   i   .   a   c   .   u   k  \n   #   f   o   r   m   a
0000300   t   :       g   f   f   3  \n   #   d   a   t   e   :       2
0000320   0   2   0   -   0   6   -   0   3  \n   c   h   r   1       H
0000340   A   V   A   N   A       g   e   n   e       1   1   8   6   9
0000360       1   4   4   0   9       .       +       .       g   e   n
0000400   e   _   i   d       "   E   N   S   G   0   0   0   0   0   2
0000420   2   3   9   7   2   .   5   _   4   "   ;       g   e   n   e
0000440   _   t   y   p   e       "   t   r   a   n   s   c   r   i   b
0000460   e   d   _   u   n   p   r   o   c   e   s   s   e   d   _   p
0000500   s   e   u   d   o   g   e   n   e   "   ;       g   e   n   e
0000520   _   n   a   m   e       "   D   D   X   1   1   L   1   "   ;
0000540       l   e   v   e   l       2   ;       h   g   n   c   _   i
0000560   d       "   H   G   N   C   :   3   7   1   0   2   "   ;
0000600   h   a   v   a   n   a   _   g   e   n   e       "   O   T   T
0000620   H   U   M   G   0   0   0   0   0   0   0   0   9   6   1   .
0000640   2   _   4   "   ;       r   e   m   a   p   _   s   t   a   t
0000660   u   s       "   f   u   l   l   _   c   o   n   t   i   g   "
0000700   ;       r   e   m   a   p   _   n   u   m   _   m   a   p   p
0000720   i   n   g   s       1   ;       r   e   m   a   p   _   t   a
0000740   r   g   e   t   _   s   t   a   t   u   s       "   o   v   e
0000760   r   l   a   p   "   ;  \n   c   h   r   1       H   A   V   A
0001000   N   A       t   r   a   n   s   c   r   i   p   t       1   1
0001020   8   6   9       1   4   4   0   9       .       +       .    
0001040   g   e   n   e   _   i   d       "   E   N   S   G   0   0   0
0001060   0   0   2   2   3   9   7   2   .   5   _   4   "   ;       t
0001100   r   a   n   s   c   r   i   p   t   _   i   d       "   E   N
0001120   S   T   0   0   0   0   0   4   5   6   3   2   8   .   2   _
0001140   1   "   ;       g   e   n   e   _   t   y   p   e       "   t
0001160   r   a   n   s   c   r   i   b   e   d   _   u   n   p   r   o
0001200   c   e   s   s   e   d   _   p   s   e   u   d   o   g   e   n
0001220   e   "   ;       g   e   n   e   _   n   a   m   e       "   D
0001240   D   X   1   1   L   1   "   ;       t   r   a   n   s   c   r
0001260   i   p   t   _   t   y   p   e       "   p   r   o   c   e   s
0001300   s   e   d   _   t   r   a   n   s   c   r   i   p   t   "   ;
0001320       t   r   a   n   s   c   r   i   p   t   _   n   a   m   e
0001340       "   D   D   X   1   1   L   1   -   2   0   2   "   ;
0001360   l   e   v   e   l       2   ;       t   r   a   n   s   c   r
0001400   i   p   t   _   s   u   p   p   o   r   t   _   l   e   v   e

我怎样才能将其转换为:

#description: evidence-based annotation of the human genome (GRCh38), version 35 (Ensembl 101), mapped to GRCh37 with gencode-backmap
#provider: GENCODE
#contact: gencode-help@ebi.ac.uk
#format: gff3
#date: 2020-06-03
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5_4"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37 102"; havana_gene "OTTHUMG00000000961.2_4"; remap_status "full_contig"; remap_num_mappings 1; remap_target_status "overlap";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level 1; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000362751.1_1"; remap_num_mappings 1; remap_status "full_contig"; remap_target_status "overlap";

0000000   #   d   e   s   c   r   i   p   t   i   o   n   :       e   v
0000020   i   d   e   n   c   e   -   b   a   s   e   d       a   n   n
0000040   o   t   a   t   i   o   n       o   f       t   h   e       h
0000060   u   m   a   n       g   e   n   o   m   e       (   G   R   C
0000100   h   3   8   )   ,       v   e   r   s   i   o   n       3   5
0000120       (   E   n   s   e   m   b   l       1   0   1   )   ,
0000140   m   a   p   p   e   d       t   o       G   R   C   h   3   7
0000160       w   i   t   h       g   e   n   c   o   d   e   -   b   a
0000200   c   k   m   a   p  \n   #   p   r   o   v   i   d   e   r   :
0000220       G   E   N   C   O   D   E  \n   #   c   o   n   t   a   c
0000240   t   :       g   e   n   c   o   d   e   -   h   e   l   p   @
0000260   e   b   i   .   a   c   .   u   k  \n   #   f   o   r   m   a
0000300   t   :       g   f   f   3  \n   #   d   a   t   e   :       2
0000320   0   2   0   -   0   6   -   0   3  \n   c   h   r   1  \t   H
0000340   A   V   A   N   A  \t   g   e   n   e  \t   1   1   8   6   9
0000360  \t   1   4   4   0   9  \t   .  \t   +  \t   .  \t   g   e   n
0000400   e   _   i   d       "   E   N   S   G   0   0   0   0   0   2
0000420   2   3   9   7   2   .   5   _   4   "   ;       g   e   n   e
0000440   _   t   y   p   e       "   t   r   a   n   s   c   r   i   b
0000460   e   d   _   u   n   p   r   o   c   e   s   s   e   d   _   p
0000500   s   e   u   d   o   g   e   n   e   "   ;       g   e   n   e
0000520   _   n   a   m   e       "   D   D   X   1   1   L   1   "   ;
0000540       l   e   v   e   l       2   ;       h   g   n   c   _   i
0000560   d       "   H   G   N   C   :   3   7   1   0   2   "   ;
0000600   h   a   v   a   n   a   _   g   e   n   e       "   O   T   T
0000620   H   U   M   G   0   0   0   0   0   0   0   0   9   6   1   .
0000640   2   _   4   "   ;       r   e   m   a   p   _   s   t   a   t
0000660   u   s       "   f   u   l   l   _   c   o   n   t   i   g   "
0000700   ;       r   e   m   a   p   _   n   u   m   _   m   a   p   p
0000720   i   n   g   s       1   ;       r   e   m   a   p   _   t   a
0000740   r   g   e   t   _   s   t   a   t   u   s       "   o   v   e
0000760   r   l   a   p   "   ;  \n   c   h   r   1  \t   H   A   V   A
0001000   N   A  \t   t   r   a   n   s   c   r   i   p   t  \t   1   1
0001020   8   6   9  \t   1   4   4   0   9  \t   .  \t   +  \t   .  \t
0001040   g   e   n   e   _   i   d       "   E   N   S   G   0   0   0
0001060   0   0   2   2   3   9   7   2   .   5   _   4   "   ;       t
0001100   r   a   n   s   c   r   i   p   t   _   i   d       "   E   N
0001120   S   T   0   0   0   0   0   4   5   6   3   2   8   .   2   _
0001140   1   "   ;       g   e   n   e   _   t   y   p   e       "   t
0001160   r   a   n   s   c   r   i   b   e   d   _   u   n   p   r   o
0001200   c   e   s   s   e   d   _   p   s   e   u   d   o   g   e   n
0001220   e   "   ;       g   e   n   e   _   n   a   m   e       "   D
0001240   D   X   1   1   L   1   "   ;       t   r   a   n   s   c   r
0001260   i   p   t   _   t   y   p   e       "   p   r   o   c   e   s
0001300   s   e   d   _   t   r   a   n   s   c   r   i   p   t   "   ;
0001320       t   r   a   n   s   c   r   i   p   t   _   n   a   m   e
0001340       "   D   D   X   1   1   L   1   -   2   0   2   "   ;
0001360   l   e   v   e   l       2   ;       t   r   a   n   s   c   r
0001400   i   p   t   _   s   u   p   p   o   r   t   _   l   e   v   e

如您所见,有一个我想保持完全相同的标题。之后,我只想用标签分隔前 9 列。如果我这样做,在 9 选项卡之后,测试的其余部分将成为第一列的一部分。

谢谢!

标签: awksed

解决方案


perl

perl -pe 'if(!/^#/){$c=8; s/ /\t/ while $c--}' ip.txt

\s+请注意,如果您想使用而不是单个空格,上述解决方案将不起作用,因为\t它将作为\s字符匹配。



如果您的第 9 列始终是gene_id并且它不能出现在该行的其他任何地方:

perl -pe 's/ (?=.*gene_id\s)/\t/g'

# if gene_id can occur in header lines
perl -pe 's/ (?=.*gene_id\s)/\t/g if !/^#/'

这些解决方案可以使用\s+单个空格而不是单个空格,因为这是使用单个替换命令。


推荐阅读