首页 > 解决方案 > Ubuntu 中的 Bashscript 错误:awk:第 1 行:正则表达式超出实现大小限制

问题描述

我正在尝试将此代码应用于 snpEff 生成的带注释的文件:(我的操作系统是 Ubuntu)

grep -v '^##' /home/zee/fdr_vs_wt.snp.annotated.vcf | awk 'BEGIN{FS=" "; OFS=" "} $1~/SL2.50chch/ || $10~/^1\/1/ && ($11~/^1\/0/ || $11~/^0\/0/ || $11~/^0\/1/) && $1~/^[0-9X]*$/ && /splice_acceptor_variant|splice_donor_variant|splice_region_variant|stop_lost|start_lost|stop_gained|missense_variant|coding_sequence_variant|inframe_insertion|disruptive_inframe_insertion|inframe_deletion|disruptive_inframe_deletion|exon_variant|exon_loss_variant|exon_loss_variant|duplication|inversion|frameshift_variant|feature_ablation|duplication|gene_fusion|bidirectional_gene_fusion|rearranged_at_DNA_level|miRNA|initiator_codon_variant|start_retained/ {$3=$7=""; print $0}' | sed 's/  */ /g' | awk '{split($9,a,":"); split(a[2],b,","); if (b[1]>b[2] || $1~/SL2.50ch/) print $0}' > /home/zee/fdr_vs_wt.raw.vcfmutantbulk.cands2.txt

我收到以下错误:

awk: line 1: regular expression /splice_acc ... exceeds implementation size limit

有人可以帮忙吗?我知道这个问题是不久前另一个人问的,但我技术不强,我不明白给出的解决方案。提前致谢。

我还打算稍后在我的 Java GUI 中使用此代码,我将使用 ProcessBuilder 使用以下代码运行它:

    speciesFastaVersionCH = "SL2.50";

    String longInputcmd4b = "ch/ || $10~/^1\\/1/ && ($11~/^1\\/0/ || $11~/^0\\/0/ || $11~/^0\\/1/) && $1~/^[0-9X]*$/ && /splice_acceptor_variant|splice_donor_variant|splice_region_variant|stop_lost|start_lost|stop_gained|missense_variant|coding_sequence_variant|inframe_insertion|disruptive_inframe_insertion|inframe_deletion|disruptive_inframe_deletion|exon_variant|exon_loss_variant|exon_loss_variant|duplication|inversion|frameshift_variant|feature_ablation|duplication|gene_fusion|bidirectional_gene_fusion|rearranged_at_DNA_level|miRNA|initiator_codon_variant|start_retained/ {$3=$7=\"\"; print $0}' | sed 's/  */ /g' | awk '{split($9,a,\":\"); split(a[2],b,\",\"); if (b[1]>b[2] || $1~/";
    StringBuilder cmd4 = new StringBuilder().append("\"").append("grep -v '^##' ").append(outputFilecmd3).append(" | awk 'BEGIN{FS=\" \"; OFS=\" \"} $1~/").append(speciesFastaVersionCH).append(longInputcmd4b).append(speciesFastaVersionCH).append("ch/) print $0}' > ").append(outputFilecmd5).append("\"");



    System.out.println("Here is cmd4:" + cmd4.toString());
    String [] gatkArray1 = cmd1.split(" ");
    String [] gatkArray2 = cmd2.split(" ");
    String [] gatkArray3 = {"bash", "-c", cmd3};


    String [][] gatkArrays = {gatkArray1, gatkArray2, gatkArray3};


    ProcessBuilder pb = new ProcessBuilder(gatkArray3);
    pb.redirectOutput(ProcessBuilder.Redirect.INHERIT);
    pb.redirectError(ProcessBuilder.Redirect.INHERIT);
    Process p = pb.start();

标签: javaregexbashawkprocessbuilder

解决方案


您的实现awk不支持该长度的正则表达式。

具体来说,您使用mawk的最大正则表达式限制为 400,包括//

$ true | mawk "/$(printf '%397s')/"
(no output)

$ true | mawk "/$(printf '%398s')/" 
mawk: line 1: regular expression /           ... exceeds implementation size limit

您可以重写您的 awk 脚本以使用较短的正则表达式文字(POSIX 保证的最大大小为256 字节),或切换到gawk唯一限制是 Linux 的最大参数大小为 128KiB 的实现:

$ true | gawk "/$(printf '%131069s')/"
(no output)

$ true | gawk "/$(printf '%131070s')/"
bash: /usr/bin/gawk: Argument list too long

推荐阅读