首页 > 解决方案 > 比较两个文件中的列,如果匹配更改另一列中的字符串

问题描述

我有两个文件

file1 
non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase gene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase gene    15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";

file2
betaNACtes5
CR18275
28SrRNA-Psi:CR45859
CR32821

我想要什么:如果 file2 中的任何行与 file1 的第 13 列匹配(由于“”而部分匹配)我想将第 4 列中的字符串更改为“pseudogene”,否则不应该做任何事情。

Desired output

non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase pseudogene  19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase gene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase pseudogene  15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding  X   FlyBase pseudogene  19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";

到目前为止,我可以得到比赛,但我不能做其余的事情。

grep -Ff file2 file1

标签: unixawkgrep

解决方案


使用您显示的示例,请尝试以下awk代码。这也将保留 Input_file1 中存在的空格。

awk '
BEGIN{ s1="\"" }
FNR==NR{
  arr[s1 $0 s1";"]
  next
}
{
  match($0,/^([^[:space:]]+[[:space:]]+){3}/)
  firstPart=substr($0,RSTART,RLENGTH)
  $0=substr($0,RSTART+RLENGTH)
  match($0,/^[^ ]+/)
  restPart=substr($0,RSTART+RLENGTH)
  print firstPart ($NF in arr?"pseudogene":substr($0,RSTART,RLENGTH)) restPart
}
' file2 file1

说明:为上述添加详细说明。

awk '                                          ##Starting awk program from here.
BEGIN{ s1="\"" }                               ##Setting s1 to " in BEGIN section.
FNR==NR{                                       ##Checking condition FNR==NR which will be TRUE when file2 is being read.
  arr[s1 $0 s1";"]                             ##Creating arr array with index of s1 current line s1 semi colon here.
  next                                         ##next will skip all further statements from here.
}
{
  match($0,/^([^[:space:]]+[[:space:]]+){3}/)  ##using match function to match 1st 3 fields here.
  firstPart=substr($0,RSTART,RLENGTH)          ##Saving matched part into firstPart to be used later on.
  $0=substr($0,RSTART+RLENGTH)                 ##Saving rest of the matched line into current line.
  match($0,/^[^ ]+/)                           ##matching everything from starting till 1st space in current line to get 4th field and rest of line value here.
  restPart=substr($0,RSTART+RLENGTH)           ##Creating restpart variable which has everything after 4th field value here.
  print firstPart ($NF in arr?"pseudogene":substr($0,RSTART,RLENGTH)) restPart ##Printing firstPart then pseudogene OR 4th field and restPart as per need.
}
' file2 file1                                  ##Mentioning Input_file names here.

推荐阅读