首页 > 解决方案 > 文本文件中计数正值的脚本更正

问题描述

前段时间,我请求帮助生成一个 Perl 脚本,该脚本将文本文件中的值分成几部分进行计数。当文本文件的某些行中存在正值时,该脚本会告诉我,然后在开始文本的另一部分时,再次告诉我正值的数量。例如,这是我的文本文件:

;YP_003858584.1_BtCoVBM48_gp2   25 NKSP   0.1462     (9/9)   ---   
;YP_003858584.1_BtCoVBM48_gp2   66 NLTW   0.7837     (9/9)   +++   
;YP_003858584.1_BtCoVBM48_gp2  116 NTTQ   0.7013     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  126 NGTH   0.7112     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  163 NCTY   0.7620     (9/9)   +++   
;YP_003858584.1_BtCoVBM48_gp2  173 NIST   0.6556     (8/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  231 NITY   0.7442     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  273 NGTI   0.7109     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  322 NITQ   0.6116     (8/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  334 NITS   0.7296     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  361 NSSA   0.5388     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  462 NPSG   0.4656     (5/9)   -     
;YP_003858584.1_BtCoVBM48_gp2  541 NSTK   0.5883     (8/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  590 NASS   0.5643     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  603 NCTD   0.7117     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  646 NSSY   0.5467     (4/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  665 NVSS   0.7980     (9/9)   +++   
;YP_003858584.1_BtCoVBM48_gp2  695 NNTI   0.4537     (5/9)   -     
;YP_003858584.1_BtCoVBM48_gp2  703 NFSI   0.5613     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  787 NFSQ   0.6209     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2 1060 NFTT   0.4540     (6/9)   -     
;YP_003858584.1_BtCoVBM48_gp2 1084 NGTH   0.5408     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2 1120 NNTV   0.5803     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2 1144 NHTS   0.3828     (8/9)   -     
;YP_003858584.1_BtCoVBM48_gp2 1149 NVSL   0.4879     (5/9)   -     
;YP_003858584.1_BtCoVBM48_gp2 1159 NASV   0.5021     (3/9)   +     
;YP_003858584.1_BtCoVBM48_gp2 1180 NESL   0.5770     (7/9)   +     
;ADK66841.1_NA   25 NKSP   0.1462     (9/9)   ---   
;ADK66841.1_NA   66 NLTW   0.7837     (9/9)   +++   
;ADK66841.1_NA  116 NTTQ   0.7013     (9/9)   ++    
;ADK66841.1_NA  126 NGTH   0.7112     (9/9)   ++    
;ADK66841.1_NA  163 NCTY   0.7620     (9/9)   +++   
;ADK66841.1_NA  173 NIST   0.6556     (8/9)   +     
;ADK66841.1_NA  231 NITY   0.7442     (9/9)   ++    
;ADK66841.1_NA  273 NGTI   0.7109     (9/9)   ++    
;ADK66841.1_NA  322 NITQ   0.6116     (8/9)   +     
;ADK66841.1_NA  334 NITS   0.7296     (9/9)   ++    
;ADK66841.1_NA  361 NSSA   0.5388     (6/9)   +     
;ADK66841.1_NA  462 NPSG   0.4656     (5/9)   -     
;ADK66841.1_NA  541 NSTK   0.5883     (8/9)   +     
;ADK66841.1_NA  590 NASS   0.5643     (6/9)   +     
;ADK66841.1_NA  603 NCTD   0.7117     (9/9)   ++    
;ADK66841.1_NA  646 NSSY   0.5467     (4/9)   +     
;ADK66841.1_NA  665 NVSS   0.7980     (9/9)   +++   
;ADK66841.1_NA  695 NNTI   0.4537     (5/9)   -     
;ADK66841.1_NA  703 NFSI   0.5613     (9/9)   ++    
;ADK66841.1_NA  787 NFSQ   0.6209     (9/9)   ++    
;ADK66841.1_NA 1060 NFTT   0.4540     (6/9)   -     
;ADK66841.1_NA 1084 NGTH   0.5408     (6/9)   +     
;ADK66841.1_NA 1120 NNTV   0.5803     (6/9)   +     
;ADK66841.1_NA 1144 NHTS   0.3828     (8/9)   -     
;ADK66841.1_NA 1149 NVSL   0.4879     (5/9)   -     
;ADK66841.1_NA 1159 NASV   0.5021     (3/9)   +     
;ADK66841.1_NA 1180 NESL   0.5770     (7/9)   +     

当存在正值时,此文件会向我报告:只有 0.7 >= 是正值。文本文件有两部分:一部分用于YP_003858584.1_BtCoVBM48_gp2,另一部分用于ADK66841.1_NA。当您计算每个部分中的正值(7> =)的数量时,每个部分有 9 个正值。我有很多这样的文件,其中包含数百个部分,因此,我想了解一下 Perl 中的一个脚本来计算这些值。这是脚本:

use strict;
use warnings;

my $cnt = {};
while(my $line = <STDIN>) {
    if($. == 1) {
        next;
    }else {
        my @cols = split(m{\s+},$line);
        if(@cols == 6) {
            my $potential = $cols[3];
            my $id = $cols[0];
            $id =~ s{^\;}{};
            if(0.7 >= $potential) {
                $cnt->{$id}++;
            };
        };
    };
};

my @ids_found = sort { $a cmp $b } (keys %$cnt);

for my $id (@ids_found) {
    print "PART $id:\n";
    print "$cnt->{$id} (values 0.7 >=)\n";
};

这工作正常,但是,我注意到输出中有错误。输出:

$ cat Test00.txt | perl File_for_count_values.pl 
PART ADK66841.1_NA:
18 (values 0.7 >=)
PART YP_003858584.1_BtCoVBM48_gp2:
18 (values 0.7 >=)

输出看起来不像我想要的那样,在计算值时这个脚本加上每个部分的正值 (9 + 9 = 18)。输出必须是:

$ cat Test00.txt | perl File_for_count_values.pl 
PART ADK66841.1_NA:
9 (values 0.7 >=)
PART YP_003858584.1_BtCoVBM48_gp2:
9 (values 0.7 >=)

关于必须在脚本中进行哪些更改才能做到这一点的任何想法?

欢迎任何评论。

标签: perltext-files

解决方案


请调查以下重新设计的 perl 脚本是否有用。

注意:原始代码假定一个基于指令的标题if($. == 1)——见$.

进行了一些更改以提高脚本的可读性

  • $threshold在脚本顶部定义的变量
  • 跳过标题/第一行next unless $. > 1(下一个,除非行计数器超过一个)
  • 分割线不仅在空格上,而且;避免替代
  • $id,在一条指令中$potential从数组中填充@cols
  • 之前作为第一个字段调整的字段编号;将为空
  • 使用用于格式化输出的格式写入

注意:参见$~,它定义了当前的输出格式write,用于关闭表格

该脚本使用__DATA__带有原始发布数据的块来进行输出演示。

while( <> )用相反的方式更改代码while( <DATA> ),这将允许您接受来自STDIN或通过将文件名指定为脚本的参数(运行为./script.pl file.dat)来接受输入。

#!/usr/bin/env perl
#
# vim: ai ts=4 sw=4

use strict;
use warnings;

my($id,$counter);
my $treshold = 0.7;

while( <DATA> ) {
    chomp;
    next unless $. > 1;
    my @cols = split("[; ]+", $_);
    next unless @cols == 7;
    my($id,$potential) = @cols[1,4];
    $counter->{$id}++ if $potential >= $treshold;
}

my @sorted_ids = sort { $a cmp $b } keys %$counter;

for $id (@sorted_ids) {
    write;
}

$~ = "STDOUT_BOTTOM";
write;

exit 0;

format STDOUT_TOP =

Criteria:          potential >= @#.##
$treshold

+-----------------------------+-------+
| Part                        | Count |
+-----------------------------+-------+
.

format STDOUT =
| @<<<<<<<<<<<<<<<<<<<<<<<<<< | @>>>> |
$id,$counter->{$id}
.

format STDOUT_BOTTOM =
+-----------------------------+-------+

.

__DATA__
;YP_003858584.1_BtCoVBM48_gp2   25 NKSP   0.1462     (9/9)   ---   
;YP_003858584.1_BtCoVBM48_gp2   66 NLTW   0.7837     (9/9)   +++   
;YP_003858584.1_BtCoVBM48_gp2  116 NTTQ   0.7013     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  126 NGTH   0.7112     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  163 NCTY   0.7620     (9/9)   +++   
;YP_003858584.1_BtCoVBM48_gp2  173 NIST   0.6556     (8/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  231 NITY   0.7442     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  273 NGTI   0.7109     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  322 NITQ   0.6116     (8/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  334 NITS   0.7296     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  361 NSSA   0.5388     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  462 NPSG   0.4656     (5/9)   -     
;YP_003858584.1_BtCoVBM48_gp2  541 NSTK   0.5883     (8/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  590 NASS   0.5643     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  603 NCTD   0.7117     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  646 NSSY   0.5467     (4/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  665 NVSS   0.7980     (9/9)   +++   
;YP_003858584.1_BtCoVBM48_gp2  695 NNTI   0.4537     (5/9)   -     
;YP_003858584.1_BtCoVBM48_gp2  703 NFSI   0.5613     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  787 NFSQ   0.6209     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2 1060 NFTT   0.4540     (6/9)   -     
;YP_003858584.1_BtCoVBM48_gp2 1084 NGTH   0.5408     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2 1120 NNTV   0.5803     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2 1144 NHTS   0.3828     (8/9)   -     
;YP_003858584.1_BtCoVBM48_gp2 1149 NVSL   0.4879     (5/9)   -     
;YP_003858584.1_BtCoVBM48_gp2 1159 NASV   0.5021     (3/9)   +     
;YP_003858584.1_BtCoVBM48_gp2 1180 NESL   0.5770     (7/9)   +     
;ADK66841.1_NA   25 NKSP   0.1462     (9/9)   ---   
;ADK66841.1_NA   66 NLTW   0.7837     (9/9)   +++   
;ADK66841.1_NA  116 NTTQ   0.7013     (9/9)   ++    
;ADK66841.1_NA  126 NGTH   0.7112     (9/9)   ++    
;ADK66841.1_NA  163 NCTY   0.7620     (9/9)   +++   
;ADK66841.1_NA  173 NIST   0.6556     (8/9)   +     
;ADK66841.1_NA  231 NITY   0.7442     (9/9)   ++    
;ADK66841.1_NA  273 NGTI   0.7109     (9/9)   ++    
;ADK66841.1_NA  322 NITQ   0.6116     (8/9)   +     
;ADK66841.1_NA  334 NITS   0.7296     (9/9)   ++    
;ADK66841.1_NA  361 NSSA   0.5388     (6/9)   +     
;ADK66841.1_NA  462 NPSG   0.4656     (5/9)   -     
;ADK66841.1_NA  541 NSTK   0.5883     (8/9)   +     
;ADK66841.1_NA  590 NASS   0.5643     (6/9)   +     
;ADK66841.1_NA  603 NCTD   0.7117     (9/9)   ++    
;ADK66841.1_NA  646 NSSY   0.5467     (4/9)   +     
;ADK66841.1_NA  665 NVSS   0.7980     (9/9)   +++   
;ADK66841.1_NA  695 NNTI   0.4537     (5/9)   -     
;ADK66841.1_NA  703 NFSI   0.5613     (9/9)   ++    
;ADK66841.1_NA  787 NFSQ   0.6209     (9/9)   ++    
;ADK66841.1_NA 1060 NFTT   0.4540     (6/9)   -     
;ADK66841.1_NA 1084 NGTH   0.5408     (6/9)   +     
;ADK66841.1_NA 1120 NNTV   0.5803     (6/9)   +     
;ADK66841.1_NA 1144 NHTS   0.3828     (8/9)   -     
;ADK66841.1_NA 1149 NVSL   0.4879     (5/9)   -     
;ADK66841.1_NA 1159 NASV   0.5021     (3/9)   +     
;ADK66841.1_NA 1180 NESL   0.5770     (7/9)   +     

输出


Criteria:          potential >=  0.70

+-----------------------------+-------+
| Part                        | Count |
+-----------------------------+-------+
| ADK66841.1_NA               |     9 |
| YP_003858584.1_BtCoVBM48_gp |     9 |
+-----------------------------+-------+

笔记:

您在 GitHub 上引用我的文件不包括;数据文件中的前导。由于这个原因,数字字段的计数减少了一个,导致没有得到任何结果。

请在 perl 脚本中进行以下更改:

       next unless @cols == 7;
       my($id,$potential) = @cols[1,4];

       next unless @cols == 6;
       my($id,$potential) = @cols[0,3];

推荐阅读