首页 > 解决方案 > 在 awk 中用它们各自的字符串替换数字

问题描述

我是 bash/awk 编程的新手,我的文件如下所示:

1   10032154    10032154    A   C   Leber_congenital_amaurosis_9    criteria_provided,_single_submitter Benign  .   1
1   10032184    10032184    A   G   Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts    Pathogenic/Likely_pathogenic    .   1,4
1   10032209    10032209    G   A   not_provided    criteria_provided,_single_submitter Likely_benign   .   8,64,512

使用 awk,我想更改最后一列 ($10) 中的数字及其描述。我在两个不同的数组中分配了数字及其定义。我的想法是通过一起迭代两个数组来更改这些数字。这里,0 是“未知”,1 是“种系”,4 是“体细胞”,然后继续。

z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")

number=$(IFS=,; echo "${z[*]}")
def=$(IFS=,; echo "${t[*]}")
    
awk -v a="$number" -v b="${def}" 'BEGIN { OFS="\t" } /#/ {next} 
{
    x=split(a, e, /,/)
    y=split(b, f, /,/)
    
    delete c
    m=split($10, c, /,/)
    for (i=1; i<=m; i++) {
        for (j=1; j<=x; j++) {
            if (c[i]==e[j]) {
                c[i]=f[j]
            }
        }
        $10+=sprintf("%s, ",c[i])
    }
    print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10
}' input.vcf > output.vcf

输出应如下所示:

1   10032154    10032154    A   C   Leber_congenital_amaurosis_9    criteria_provided,_single_submitter Benign  .   germline
1   10032184    10032184    A   G   Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts    Pathogenic/Likely_pathogenic    .   germline,paternal
1   10032209    10032209    G   A   not_provided    criteria_provided,_single_submitter Likely_benign   .   paternal,biparental,tested-inconclusive

如果你能帮助我,我会很高兴的!

一切顺利

标签: bashawk

解决方案


假设由于某些其他原因,您实际上不需要将数字和名称列表定义为 2 个 shell 数组:

$ cat tst.awk
BEGIN {
    split("0 1 2 4 8 16 32 64 128 256 512 1024 1073741824",nrsArr)
    split("unknown germline somatic inherited paternal maternal de-novo biparental uniparental not-tested tested-inconclusive not-reported other",namesArr)
    for (i in nrsArr) {
        nr2name[nrsArr[i]] = namesArr[i]
    }
}
!/#/ {
    n = split($NF,nrs,/,/)
    sub(/[^[:space:]]+$/,"")
    printf "%s", $0
    for (i=1; i<=n; i++) {
        printf "%s%s", nr2name[nrs[i]], (i<n ? "," : ORS)
    }
}

$ awk -f tst.awk input.vcf
1   10032154    10032154    A   C   Leber_congenital_amaurosis_9    criteria_provided,_single_submitter Benign  .   germline
1   10032184    10032184    A   G   Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts    Pathogenic/Likely_pathogenic    .   germline,inherited
1   10032209    10032209    G   A   not_provided    criteria_provided,_single_submitter Likely_benign   .   paternal,biparental,tested-inconclusive

上面保留了输入文件中的任何空白,以防万一。


推荐阅读