perl - 使用 perl 计算矩阵中列的平均值或中值
问题描述
Var_ID sample1 sample2 sample3 sample4 sample5 sample6 sample7
A_1 18.66530716 0 10.45969216 52.71893547 40.04726048 32.16758825 38.27754435
A_2 25.19816467 0 12.5516306 37.95763354 28.39714834 25.7340706 37.581589
A_3 61.5006053 0 6.807664053 4.57493135 23.69514333 9.304974679 29.44245014
A_4 46.71317515 4.988346264 21.47872616 36.08568845 7.47600779 18.34871344 75.02919728
A_5 38.12488272 0 0 28.71499464 19.82997811 19.46785483 66.33787183
A_6 44.16019386 3.313750449 10.70121259 38.35466425 8.691025042 13.40792311 42.72152213
B_1 38.39720331 13.32601073 0 19.28006783 9.985810405 9.803455466 95.44530538
B_2 46.53021582 1.899838598 24.54086634 13.74342921 24.20186228 6.988206544 47.62545788
B_3 48.42890507 0 6.0308135 20.26433556 20.99119304 10.30393217 64.20344867
A_7 32.10687649 0 20.56239825 23.03079775 9.542753971 10.5395511 44.46513374
B_4 34.82673166 0 6.122746633 39.08916191 8.524472297 14.64540603 54.99744731
B_5 32.49685303 2.910517165 15.66506159 35.79294964 8.723952928 10.7058016 52.11522135
B_6 30.38974634 0 0 30.51870034 10.53778987 17.24225836 50.36058827
B_7 59.60856159 0 8.097826192 19.0468412 2.818575518 11.06841746 10.77608287
A_8 36.07790915 6.260541956 0 31.70212496 14.07396097 4.605650219 67.26011453
C_1 0 17.27445836 0 382.0309737 1.849224149 0 0
C_2 344.0389416 119.4010562 32.13217433 0 22.36821531 285.4766232 21.37974841
C_3 235.5547989 37.86357293 22.23167043 2.490045661 2.579360621 30.38709443 14.79226135
C_4 0 2.801263518 0 334.3615367 0 0 0
C_5 9.397916894 128.2900334 187.2504332 25.16745451 22.81140838 14.39668285 0
这是数据矩阵。行是变量,列是样本 ID。
A_1 - A_8 是集群A,B_1 - B_7 是集群B,C_1 - C_5 是集群C。
现在我想计算 A_1 - A_8 的平均值或中位数作为 clusterA 的值,得到中位数结果为:
Var_ID sample1 sample2 sample3 sample4 sample5 sample6 sample7
clusterA 37.10139593 0 10.58045238 33.89390671 16.95196954 15.87831827 43.59332793
谁能帮我用 perl 脚本解决这个问题?
解决方案
计算均值和中位数:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use List::Util qw(sum);
use POSIX qw(floor ceil);
my %data = ();
my %avg = ();
my %median = ();
while (<>) {
next if $. == 1;
my @fields = split;
my $cluster = substr($fields[0],0,1);
$data{$cluster} = [] unless exists($data{$cluster});
push @{$data{$cluster}}, [ @fields[1..$#fields] ];
}
for my $cluster (keys(%data)) {
for my $sampleNo (0..scalar(@{$data{$cluster}[0]})-1) {
my @samples = map { $_->[$sampleNo] } @{$data{$cluster}};
my $cnt = @samples;
$avg{$cluster}[$sampleNo] = sum(@samples)/$cnt;
my @sorted = sort @samples;
$median{$cluster}[$sampleNo] = (@sorted[floor(($cnt+1)/2)-1] +
@sorted[ceil(($cnt+1)/2)-1])/2;
}
}
print "Mean\n";
for my $cluster (sort keys (%data)) {
print join("\t", ($cluster,map {sprintf "%15.9f",$_ } @{$avg{$cluster}})),"\n";
}
print "Median\n";
for my $cluster (sort keys (%data)) {
print join("\t", ($cluster,map {sprintf "%15.9f",$_ } @{$median{$cluster}})),"\n";
}
输出:
perl test.pl <sample.txt
Mean
A 37.818389312 1.820329834 10.320165477 31.642471301 18.969159754 16.697040778 50.139427875
B 41.525459546 2.590909499 8.636759179 25.390783670 12.254808048 11.536782519 53.646221676
C 117.798331479 61.126076882 48.322855592 148.810002114 9.921641692 66.052080096 7.234401952
Median
A 37.101395935 0.000000000 11.626421595 37.021660995 34.222204410 22.600962715 43.593327935
B 38.397203310 0.000000000 24.540866340 20.264335560 24.201862280 14.645406030 52.115221350
C 235.554798900 17.274458360 187.250433200 25.167454510 2.579360621 14.396682850 0.000000000
推荐阅读
- jira-xray - Cucumber 测试执行的测试环境
- android - React 本机相机无法渲染
- oracle - 在不等待结果的情况下运行 oracle sql 命令
- python - 如何将 pandas 数据帧的 hdf5 二进制文件保存在内存中?
- reactjs - React - 从 17.0.1 更新到 17.0.2 时出错
- typescript - 在手写的 d.ts 文件中,如何从模块根目录中的一个命名空间公开函数?
- opencv - darknet_images.py 没有检测到任何物体。暗网 YOLOv4
- wordpress-rest-api - phpcUrl 不返回最新的 JSON 数据
- visual-studio-code - 远程 VS Code 上的 jupyter notebook 的颜色主题很奇怪
- excel - 在 Excel 中使用 VLOOKUP 不断返回 N/A 错误