awk - 仅在特定列之间更改分隔符
问题描述
我正在尝试仅在第 1 列和第 9 列之间更改分隔符。之后,我想保留原始分隔符。
这些是我的文件在直接读取和od -c file
执行时的第一行:
#description: evidence-based annotation of the human genome (GRCh38), version 35 (Ensembl 101), mapped to GRCh37 with gencode-backmap
#provider: GENCODE
#contact: gencode-help@ebi.ac.uk
#format: gff3
#date: 2020-06-03
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5_4"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2_4"; remap_status "full_contig"; remap_num_mappings 1; remap_target_status "overlap";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level 1; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000362751.1_1"; remap_num_mappings 1; remap_status "full_contig"; remap_target_status "overlap";
0000000 # d e s c r i p t i o n : e v
0000020 i d e n c e - b a s e d a n n
0000040 o t a t i o n o f t h e h
0000060 u m a n g e n o m e ( G R C
0000100 h 3 8 ) , v e r s i o n 3 5
0000120 ( E n s e m b l 1 0 1 ) ,
0000140 m a p p e d t o G R C h 3 7
0000160 w i t h g e n c o d e - b a
0000200 c k m a p \n # p r o v i d e r :
0000220 G E N C O D E \n # c o n t a c
0000240 t : g e n c o d e - h e l p @
0000260 e b i . a c . u k \n # f o r m a
0000300 t : g f f 3 \n # d a t e : 2
0000320 0 2 0 - 0 6 - 0 3 \n c h r 1 H
0000340 A V A N A g e n e 1 1 8 6 9
0000360 1 4 4 0 9 . + . g e n
0000400 e _ i d " E N S G 0 0 0 0 0 2
0000420 2 3 9 7 2 . 5 _ 4 " ; g e n e
0000440 _ t y p e " t r a n s c r i b
0000460 e d _ u n p r o c e s s e d _ p
0000500 s e u d o g e n e " ; g e n e
0000520 _ n a m e " D D X 1 1 L 1 " ;
0000540 l e v e l 2 ; h g n c _ i
0000560 d " H G N C : 3 7 1 0 2 " ;
0000600 h a v a n a _ g e n e " O T T
0000620 H U M G 0 0 0 0 0 0 0 0 9 6 1 .
0000640 2 _ 4 " ; r e m a p _ s t a t
0000660 u s " f u l l _ c o n t i g "
0000700 ; r e m a p _ n u m _ m a p p
0000720 i n g s 1 ; r e m a p _ t a
0000740 r g e t _ s t a t u s " o v e
0000760 r l a p " ; \n c h r 1 H A V A
0001000 N A t r a n s c r i p t 1 1
0001020 8 6 9 1 4 4 0 9 . + .
0001040 g e n e _ i d " E N S G 0 0 0
0001060 0 0 2 2 3 9 7 2 . 5 _ 4 " ; t
0001100 r a n s c r i p t _ i d " E N
0001120 S T 0 0 0 0 0 4 5 6 3 2 8 . 2 _
0001140 1 " ; g e n e _ t y p e " t
0001160 r a n s c r i b e d _ u n p r o
0001200 c e s s e d _ p s e u d o g e n
0001220 e " ; g e n e _ n a m e " D
0001240 D X 1 1 L 1 " ; t r a n s c r
0001260 i p t _ t y p e " p r o c e s
0001300 s e d _ t r a n s c r i p t " ;
0001320 t r a n s c r i p t _ n a m e
0001340 " D D X 1 1 L 1 - 2 0 2 " ;
0001360 l e v e l 2 ; t r a n s c r
0001400 i p t _ s u p p o r t _ l e v e
我怎样才能将其转换为:
#description: evidence-based annotation of the human genome (GRCh38), version 35 (Ensembl 101), mapped to GRCh37 with gencode-backmap
#provider: GENCODE
#contact: gencode-help@ebi.ac.uk
#format: gff3
#date: 2020-06-03
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5_4"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37 102"; havana_gene "OTTHUMG00000000961.2_4"; remap_status "full_contig"; remap_num_mappings 1; remap_target_status "overlap";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level 1; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000362751.1_1"; remap_num_mappings 1; remap_status "full_contig"; remap_target_status "overlap";
0000000 # d e s c r i p t i o n : e v
0000020 i d e n c e - b a s e d a n n
0000040 o t a t i o n o f t h e h
0000060 u m a n g e n o m e ( G R C
0000100 h 3 8 ) , v e r s i o n 3 5
0000120 ( E n s e m b l 1 0 1 ) ,
0000140 m a p p e d t o G R C h 3 7
0000160 w i t h g e n c o d e - b a
0000200 c k m a p \n # p r o v i d e r :
0000220 G E N C O D E \n # c o n t a c
0000240 t : g e n c o d e - h e l p @
0000260 e b i . a c . u k \n # f o r m a
0000300 t : g f f 3 \n # d a t e : 2
0000320 0 2 0 - 0 6 - 0 3 \n c h r 1 \t H
0000340 A V A N A \t g e n e \t 1 1 8 6 9
0000360 \t 1 4 4 0 9 \t . \t + \t . \t g e n
0000400 e _ i d " E N S G 0 0 0 0 0 2
0000420 2 3 9 7 2 . 5 _ 4 " ; g e n e
0000440 _ t y p e " t r a n s c r i b
0000460 e d _ u n p r o c e s s e d _ p
0000500 s e u d o g e n e " ; g e n e
0000520 _ n a m e " D D X 1 1 L 1 " ;
0000540 l e v e l 2 ; h g n c _ i
0000560 d " H G N C : 3 7 1 0 2 " ;
0000600 h a v a n a _ g e n e " O T T
0000620 H U M G 0 0 0 0 0 0 0 0 9 6 1 .
0000640 2 _ 4 " ; r e m a p _ s t a t
0000660 u s " f u l l _ c o n t i g "
0000700 ; r e m a p _ n u m _ m a p p
0000720 i n g s 1 ; r e m a p _ t a
0000740 r g e t _ s t a t u s " o v e
0000760 r l a p " ; \n c h r 1 \t H A V A
0001000 N A \t t r a n s c r i p t \t 1 1
0001020 8 6 9 \t 1 4 4 0 9 \t . \t + \t . \t
0001040 g e n e _ i d " E N S G 0 0 0
0001060 0 0 2 2 3 9 7 2 . 5 _ 4 " ; t
0001100 r a n s c r i p t _ i d " E N
0001120 S T 0 0 0 0 0 4 5 6 3 2 8 . 2 _
0001140 1 " ; g e n e _ t y p e " t
0001160 r a n s c r i b e d _ u n p r o
0001200 c e s s e d _ p s e u d o g e n
0001220 e " ; g e n e _ n a m e " D
0001240 D X 1 1 L 1 " ; t r a n s c r
0001260 i p t _ t y p e " p r o c e s
0001300 s e d _ t r a n s c r i p t " ;
0001320 t r a n s c r i p t _ n a m e
0001340 " D D X 1 1 L 1 - 2 0 2 " ;
0001360 l e v e l 2 ; t r a n s c r
0001400 i p t _ s u p p o r t _ l e v e
如您所见,有一个我想保持完全相同的标题。之后,我只想用标签分隔前 9 列。如果我这样做,在 9 选项卡之后,测试的其余部分将成为第一列的一部分。
谢谢!
解决方案
与perl
:
perl -pe 'if(!/^#/){$c=8; s/ /\t/ while $c--}' ip.txt
\s+
请注意,如果您想使用而不是单个空格,上述解决方案将不起作用,因为\t
它将作为\s
字符匹配。
如果您的第 9 列始终是gene_id
并且它不能出现在该行的其他任何地方:
perl -pe 's/ (?=.*gene_id\s)/\t/g'
# if gene_id can occur in header lines
perl -pe 's/ (?=.*gene_id\s)/\t/g if !/^#/'
这些解决方案可以使用\s+
单个空格而不是单个空格,因为这是使用单个替换命令。
推荐阅读
- batch-file - 批处理 - 仅接管某些列
- python - Discord Music Bot 错误 discord.ext.commands.errors.CommandInvokeError:命令引发异常:KeyError:'videoId'
- python - Python 导入语句解释
- reactjs - ReactJS - React ContextMenu:菜单出现在滚动条后面
- java - Java 8 流是不可变的吗?
- html - React 功能组件不会更新 Font Awesome 图标的条件样式
- java - Spring Boot JPA - SQLIntegrityConstraintViolationException
- java - Groovy:通过变量调用LinkedHashMap的key
- python - Google Cloud OAuth 同意屏幕未反映(内部)应用范围的更改
- xslt - XSL + XPath 访问另一个分支中的另一个元素