perl - 文本文件中计数正值的脚本更正
问题描述
前段时间,我请求帮助生成一个 Perl 脚本,该脚本将文本文件中的值分成几部分进行计数。当文本文件的某些行中存在正值时,该脚本会告诉我,然后在开始文本的另一部分时,再次告诉我正值的数量。例如,这是我的文本文件:
;YP_003858584.1_BtCoVBM48_gp2 25 NKSP 0.1462 (9/9) ---
;YP_003858584.1_BtCoVBM48_gp2 66 NLTW 0.7837 (9/9) +++
;YP_003858584.1_BtCoVBM48_gp2 116 NTTQ 0.7013 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 126 NGTH 0.7112 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 163 NCTY 0.7620 (9/9) +++
;YP_003858584.1_BtCoVBM48_gp2 173 NIST 0.6556 (8/9) +
;YP_003858584.1_BtCoVBM48_gp2 231 NITY 0.7442 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 273 NGTI 0.7109 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 322 NITQ 0.6116 (8/9) +
;YP_003858584.1_BtCoVBM48_gp2 334 NITS 0.7296 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 361 NSSA 0.5388 (6/9) +
;YP_003858584.1_BtCoVBM48_gp2 462 NPSG 0.4656 (5/9) -
;YP_003858584.1_BtCoVBM48_gp2 541 NSTK 0.5883 (8/9) +
;YP_003858584.1_BtCoVBM48_gp2 590 NASS 0.5643 (6/9) +
;YP_003858584.1_BtCoVBM48_gp2 603 NCTD 0.7117 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 646 NSSY 0.5467 (4/9) +
;YP_003858584.1_BtCoVBM48_gp2 665 NVSS 0.7980 (9/9) +++
;YP_003858584.1_BtCoVBM48_gp2 695 NNTI 0.4537 (5/9) -
;YP_003858584.1_BtCoVBM48_gp2 703 NFSI 0.5613 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 787 NFSQ 0.6209 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 1060 NFTT 0.4540 (6/9) -
;YP_003858584.1_BtCoVBM48_gp2 1084 NGTH 0.5408 (6/9) +
;YP_003858584.1_BtCoVBM48_gp2 1120 NNTV 0.5803 (6/9) +
;YP_003858584.1_BtCoVBM48_gp2 1144 NHTS 0.3828 (8/9) -
;YP_003858584.1_BtCoVBM48_gp2 1149 NVSL 0.4879 (5/9) -
;YP_003858584.1_BtCoVBM48_gp2 1159 NASV 0.5021 (3/9) +
;YP_003858584.1_BtCoVBM48_gp2 1180 NESL 0.5770 (7/9) +
;ADK66841.1_NA 25 NKSP 0.1462 (9/9) ---
;ADK66841.1_NA 66 NLTW 0.7837 (9/9) +++
;ADK66841.1_NA 116 NTTQ 0.7013 (9/9) ++
;ADK66841.1_NA 126 NGTH 0.7112 (9/9) ++
;ADK66841.1_NA 163 NCTY 0.7620 (9/9) +++
;ADK66841.1_NA 173 NIST 0.6556 (8/9) +
;ADK66841.1_NA 231 NITY 0.7442 (9/9) ++
;ADK66841.1_NA 273 NGTI 0.7109 (9/9) ++
;ADK66841.1_NA 322 NITQ 0.6116 (8/9) +
;ADK66841.1_NA 334 NITS 0.7296 (9/9) ++
;ADK66841.1_NA 361 NSSA 0.5388 (6/9) +
;ADK66841.1_NA 462 NPSG 0.4656 (5/9) -
;ADK66841.1_NA 541 NSTK 0.5883 (8/9) +
;ADK66841.1_NA 590 NASS 0.5643 (6/9) +
;ADK66841.1_NA 603 NCTD 0.7117 (9/9) ++
;ADK66841.1_NA 646 NSSY 0.5467 (4/9) +
;ADK66841.1_NA 665 NVSS 0.7980 (9/9) +++
;ADK66841.1_NA 695 NNTI 0.4537 (5/9) -
;ADK66841.1_NA 703 NFSI 0.5613 (9/9) ++
;ADK66841.1_NA 787 NFSQ 0.6209 (9/9) ++
;ADK66841.1_NA 1060 NFTT 0.4540 (6/9) -
;ADK66841.1_NA 1084 NGTH 0.5408 (6/9) +
;ADK66841.1_NA 1120 NNTV 0.5803 (6/9) +
;ADK66841.1_NA 1144 NHTS 0.3828 (8/9) -
;ADK66841.1_NA 1149 NVSL 0.4879 (5/9) -
;ADK66841.1_NA 1159 NASV 0.5021 (3/9) +
;ADK66841.1_NA 1180 NESL 0.5770 (7/9) +
当存在正值时,此文件会向我报告:只有 0.7 >= 是正值。文本文件有两部分:一部分用于YP_003858584.1_BtCoVBM48_gp2,另一部分用于ADK66841.1_NA。当您计算每个部分中的正值(7> =)的数量时,每个部分有 9 个正值。我有很多这样的文件,其中包含数百个部分,因此,我想了解一下 Perl 中的一个脚本来计算这些值。这是脚本:
use strict;
use warnings;
my $cnt = {};
while(my $line = <STDIN>) {
if($. == 1) {
next;
}else {
my @cols = split(m{\s+},$line);
if(@cols == 6) {
my $potential = $cols[3];
my $id = $cols[0];
$id =~ s{^\;}{};
if(0.7 >= $potential) {
$cnt->{$id}++;
};
};
};
};
my @ids_found = sort { $a cmp $b } (keys %$cnt);
for my $id (@ids_found) {
print "PART $id:\n";
print "$cnt->{$id} (values 0.7 >=)\n";
};
这工作正常,但是,我注意到输出中有错误。输出:
$ cat Test00.txt | perl File_for_count_values.pl
PART ADK66841.1_NA:
18 (values 0.7 >=)
PART YP_003858584.1_BtCoVBM48_gp2:
18 (values 0.7 >=)
输出看起来不像我想要的那样,在计算值时这个脚本加上每个部分的正值 (9 + 9 = 18)。输出必须是:
$ cat Test00.txt | perl File_for_count_values.pl
PART ADK66841.1_NA:
9 (values 0.7 >=)
PART YP_003858584.1_BtCoVBM48_gp2:
9 (values 0.7 >=)
关于必须在脚本中进行哪些更改才能做到这一点的任何想法?
欢迎任何评论。
解决方案
请调查以下重新设计的 perl 脚本是否有用。
注意:原始代码假定一个基于指令的标题if($. == 1)
——见$.
进行了一些更改以提高脚本的可读性
$threshold
在脚本顶部定义的变量- 跳过标题/第一行
next unless $. > 1
(下一个,除非行计数器超过一个) - 分割线不仅在空格上,而且
;
避免替代 $id
,在一条指令中$potential
从数组中填充@cols
- 之前作为第一个字段调整的字段编号
;
将为空 - 使用用于格式化输出的格式写入
注意:参见$~,它定义了当前的输出格式write
,用于关闭表格
该脚本使用__DATA__
带有原始发布数据的块来进行输出演示。
while( <> )
用相反的方式更改代码while( <DATA> )
,这将允许您接受来自STDIN
或通过将文件名指定为脚本的参数(运行为./script.pl file.dat
)来接受输入。
#!/usr/bin/env perl
#
# vim: ai ts=4 sw=4
use strict;
use warnings;
my($id,$counter);
my $treshold = 0.7;
while( <DATA> ) {
chomp;
next unless $. > 1;
my @cols = split("[; ]+", $_);
next unless @cols == 7;
my($id,$potential) = @cols[1,4];
$counter->{$id}++ if $potential >= $treshold;
}
my @sorted_ids = sort { $a cmp $b } keys %$counter;
for $id (@sorted_ids) {
write;
}
$~ = "STDOUT_BOTTOM";
write;
exit 0;
format STDOUT_TOP =
Criteria: potential >= @#.##
$treshold
+-----------------------------+-------+
| Part | Count |
+-----------------------------+-------+
.
format STDOUT =
| @<<<<<<<<<<<<<<<<<<<<<<<<<< | @>>>> |
$id,$counter->{$id}
.
format STDOUT_BOTTOM =
+-----------------------------+-------+
.
__DATA__
;YP_003858584.1_BtCoVBM48_gp2 25 NKSP 0.1462 (9/9) ---
;YP_003858584.1_BtCoVBM48_gp2 66 NLTW 0.7837 (9/9) +++
;YP_003858584.1_BtCoVBM48_gp2 116 NTTQ 0.7013 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 126 NGTH 0.7112 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 163 NCTY 0.7620 (9/9) +++
;YP_003858584.1_BtCoVBM48_gp2 173 NIST 0.6556 (8/9) +
;YP_003858584.1_BtCoVBM48_gp2 231 NITY 0.7442 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 273 NGTI 0.7109 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 322 NITQ 0.6116 (8/9) +
;YP_003858584.1_BtCoVBM48_gp2 334 NITS 0.7296 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 361 NSSA 0.5388 (6/9) +
;YP_003858584.1_BtCoVBM48_gp2 462 NPSG 0.4656 (5/9) -
;YP_003858584.1_BtCoVBM48_gp2 541 NSTK 0.5883 (8/9) +
;YP_003858584.1_BtCoVBM48_gp2 590 NASS 0.5643 (6/9) +
;YP_003858584.1_BtCoVBM48_gp2 603 NCTD 0.7117 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 646 NSSY 0.5467 (4/9) +
;YP_003858584.1_BtCoVBM48_gp2 665 NVSS 0.7980 (9/9) +++
;YP_003858584.1_BtCoVBM48_gp2 695 NNTI 0.4537 (5/9) -
;YP_003858584.1_BtCoVBM48_gp2 703 NFSI 0.5613 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 787 NFSQ 0.6209 (9/9) ++
;YP_003858584.1_BtCoVBM48_gp2 1060 NFTT 0.4540 (6/9) -
;YP_003858584.1_BtCoVBM48_gp2 1084 NGTH 0.5408 (6/9) +
;YP_003858584.1_BtCoVBM48_gp2 1120 NNTV 0.5803 (6/9) +
;YP_003858584.1_BtCoVBM48_gp2 1144 NHTS 0.3828 (8/9) -
;YP_003858584.1_BtCoVBM48_gp2 1149 NVSL 0.4879 (5/9) -
;YP_003858584.1_BtCoVBM48_gp2 1159 NASV 0.5021 (3/9) +
;YP_003858584.1_BtCoVBM48_gp2 1180 NESL 0.5770 (7/9) +
;ADK66841.1_NA 25 NKSP 0.1462 (9/9) ---
;ADK66841.1_NA 66 NLTW 0.7837 (9/9) +++
;ADK66841.1_NA 116 NTTQ 0.7013 (9/9) ++
;ADK66841.1_NA 126 NGTH 0.7112 (9/9) ++
;ADK66841.1_NA 163 NCTY 0.7620 (9/9) +++
;ADK66841.1_NA 173 NIST 0.6556 (8/9) +
;ADK66841.1_NA 231 NITY 0.7442 (9/9) ++
;ADK66841.1_NA 273 NGTI 0.7109 (9/9) ++
;ADK66841.1_NA 322 NITQ 0.6116 (8/9) +
;ADK66841.1_NA 334 NITS 0.7296 (9/9) ++
;ADK66841.1_NA 361 NSSA 0.5388 (6/9) +
;ADK66841.1_NA 462 NPSG 0.4656 (5/9) -
;ADK66841.1_NA 541 NSTK 0.5883 (8/9) +
;ADK66841.1_NA 590 NASS 0.5643 (6/9) +
;ADK66841.1_NA 603 NCTD 0.7117 (9/9) ++
;ADK66841.1_NA 646 NSSY 0.5467 (4/9) +
;ADK66841.1_NA 665 NVSS 0.7980 (9/9) +++
;ADK66841.1_NA 695 NNTI 0.4537 (5/9) -
;ADK66841.1_NA 703 NFSI 0.5613 (9/9) ++
;ADK66841.1_NA 787 NFSQ 0.6209 (9/9) ++
;ADK66841.1_NA 1060 NFTT 0.4540 (6/9) -
;ADK66841.1_NA 1084 NGTH 0.5408 (6/9) +
;ADK66841.1_NA 1120 NNTV 0.5803 (6/9) +
;ADK66841.1_NA 1144 NHTS 0.3828 (8/9) -
;ADK66841.1_NA 1149 NVSL 0.4879 (5/9) -
;ADK66841.1_NA 1159 NASV 0.5021 (3/9) +
;ADK66841.1_NA 1180 NESL 0.5770 (7/9) +
输出
Criteria: potential >= 0.70
+-----------------------------+-------+
| Part | Count |
+-----------------------------+-------+
| ADK66841.1_NA | 9 |
| YP_003858584.1_BtCoVBM48_gp | 9 |
+-----------------------------+-------+
笔记:
您在 GitHub 上引用我的文件不包括;
数据文件中的前导。由于这个原因,数字字段的计数减少了一个,导致没有得到任何结果。
请在 perl 脚本中进行以下更改:
next unless @cols == 7;
my($id,$potential) = @cols[1,4];
至
next unless @cols == 6;
my($id,$potential) = @cols[0,3];
推荐阅读
- ios - AVAudioPlayer 播放问题,两种声音作为回声
- timestamp - 批量乱序数据导入 QuestDB
- node.js - SailsJS action2 多文件上传
- django - 按用户组过滤 Django 管理站点
- python - 如何检查子字符串是否以相同的顺序出现在父字符串中
- junit - JUnit 5 + Apache Surefire 插件 - 如何使用自定义监听器
- angular - Sass math.div 函数在 Angular 应用程序中未定义,但在 Angular 库中定义
- docker - 是否可以强制 k8s worker 只使用特定的 GPU?
- memory - 启用 PSRAM 的 ESP-ADF I2S 记录断断续续
- c - 链表在C中插入结束节点