unix - 比较两个文件中的列,如果匹配更改另一列中的字符串
问题描述
我有两个文件
file1
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase gene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase gene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
file2
betaNACtes5
CR18275
28SrRNA-Psi:CR45859
CR32821
我想要什么:如果 file2 中的任何行与 file1 的第 13 列匹配(由于“”而部分匹配)我想将第 4 列中的字符串更改为“pseudogene”,否则不应该做任何事情。
Desired output
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase pseudogene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase gene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase pseudogene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding X FlyBase pseudogene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
到目前为止,我可以得到比赛,但我不能做其余的事情。
grep -Ff file2 file1
解决方案
使用您显示的示例,请尝试以下awk
代码。这也将保留 Input_file1 中存在的空格。
awk '
BEGIN{ s1="\"" }
FNR==NR{
arr[s1 $0 s1";"]
next
}
{
match($0,/^([^[:space:]]+[[:space:]]+){3}/)
firstPart=substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
match($0,/^[^ ]+/)
restPart=substr($0,RSTART+RLENGTH)
print firstPart ($NF in arr?"pseudogene":substr($0,RSTART,RLENGTH)) restPart
}
' file2 file1
说明:为上述添加详细说明。
awk ' ##Starting awk program from here.
BEGIN{ s1="\"" } ##Setting s1 to " in BEGIN section.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file2 is being read.
arr[s1 $0 s1";"] ##Creating arr array with index of s1 current line s1 semi colon here.
next ##next will skip all further statements from here.
}
{
match($0,/^([^[:space:]]+[[:space:]]+){3}/) ##using match function to match 1st 3 fields here.
firstPart=substr($0,RSTART,RLENGTH) ##Saving matched part into firstPart to be used later on.
$0=substr($0,RSTART+RLENGTH) ##Saving rest of the matched line into current line.
match($0,/^[^ ]+/) ##matching everything from starting till 1st space in current line to get 4th field and rest of line value here.
restPart=substr($0,RSTART+RLENGTH) ##Creating restpart variable which has everything after 4th field value here.
print firstPart ($NF in arr?"pseudogene":substr($0,RSTART,RLENGTH)) restPart ##Printing firstPart then pseudogene OR 4th field and restPart as per need.
}
' file2 file1 ##Mentioning Input_file names here.
推荐阅读
- azure - Azure PowerShell - 虚拟机标记报告
- jquery - jqueryUI 可拖动的可拖放和来自数组的项目问题
- python - 对于 Python 中的范围(y)中的 y
- python - (Python)模拟在另一个方法中调用的方法的返回值
- php - 在 PHP 应用程序中使用 Azure Active Directory 对 MySQL 进行身份验证
- google-apps-script - 我的 Y 轴不会出现在我编写的脚本中?
- c# - C# 为什么我的数组的所有值都相同?
- kubernetes - kubectl get all - 命令返回 - 限制请求
- php - 如何在php中从css重新加载背景图像?
- git - git 将一个文件的更改显示为 2 个单独的修改(相同的路径但一个大写)