首页 > 解决方案 > 哈希中键的 Perl 计数频率

问题描述

我从多维哈希中提取了第一级键,如下所示:

my @string = keys %hash;

print "@string\n";

Bacteroides fragilis (strain YCH46).Agrocybe aegerita (Black poplar mushroom) (Agaricus 
aegerita).Parabacteroides distasonis (strain ATCC 8503 / DSM 20701 / CIP 104284 / JCM 5825 / NCTC 
11152).Pelodictyon phaeoclathratiforme (strain DSM 5477 / BU-1).Clostridium kluyveri (strain NBRC 
12016).Torpedo marmorata (Marbled electric ray).Aethionema grandiflorum (Persian stone-cress).Conus 
consors (Singed cone).Saguinus labiatus (Red-chested mustached tamarin).Staphylococcus haemolyticus 
(strain JCSC1435).Aeromonas salmonicida (strain A449).Acinetobacter genomosp. 13.Staphylococcus 
aureus (strain USA300 / TCH1516).Loxosceles variegata (Recluse spider). and so on...

我试图计算一个相同的有机体重复了多少次(我确定其中一些重复了很多次)。

我试过这段代码:

my %count;

foreach my $os (@string)  
{ 
$count{$os}++; 
} 


foreach my $os (sort keys %count)  
{ 
print $os, " ", $count{$os}, "\n";
} 

但是我像所有只出现一次的生物一样获得输出,尽管我知道情况并非如此。

奇怪的是,当我尝试手动定义一个测试字符串并重复一些有机体时,代码起作用了。

我的哈希键发生了什么?

我可以在列表中单独访问它们,因此它们原则上定义明确......

有什么帮助吗?

编辑:

有机体为值时的翻斗车结构:

'ACYP_SYNJB' => {
                        '94' => 'Synechococcus sp. (strain JA-2-3B\'a(2-13)) 
(Cyanobacteria bacterium Yellowstone B-Prime).'
                      },
      'ACTM_STRPU' => {
                        '374' => 'Strongylocentrotus purpuratus (Purple sea 
urchin).'
                      },
      'A2ML1_HUMAN' => {
                         '1454' => 'Homo sapiens (Human).'
                       },
      'ACTP_SALDC' => {
                        '549' => 'Salmonella dublin (strain CT_02021853).'
                      },
      'ACBG2_XENLA' => {
                         '739' => 'Xenopus laevis (African clawed frog).'
                       },
      'ACO1_AJECA' => {
                        '476' => 'Ajellomyces capsulatus (Darling\'s disease 
fungus) (Histoplasma capsulatum).'
                      },
      'ACTM_PISOC' => {
                        '376' => 'Pisaster ochraceus (Ochre sea star) 
(Asterias ochracea).'
                      },
      '3MGH_RHOPB' => {
                        '200' => 'Rhodopseudomonas palustris (strain 
BisB18).'
                      }
    };

当键:

$VAR3585 = 'Geobacter sulfurreducens (strain ATCC 51573 / DSM 12127 / PCA).';
$VAR3586 = {
         'ACPS_GEOSL' => 126,
         'ACP_GEOSL' => 77,
         'ACKA_GEOSL' => 421,
         'ACYP_GEOSL' => 91,
         'ACCA_GEOSL' => 319
       };
$VAR3587 = 'Bactrocera dorsalis (Oriental fruit fly) (Dacus dorsalis).';
$VAR3588 = {
         'ACT3_BACDO' => 376,
         'ACT5_BACDO' => 376,
         'ACT1_BACDO' => 376,
         'ACT2_BACDO' => 376
       };
$VAR3589 = 'Caenorhabditis elegans.';
$VAR3590 = {
         'ACH5_CAEEL' => 511,
         '6PGD_CAEEL' => 484,
         'ACM2_CAEEL' => 627,
         'ACADM_CAEEL' => 417,
         'ADAL_CAEEL' => 388,
         'ACON_CAEEL' => 777,
         'ACBP3_CAEEL' => 116,
         '2AB1_CAEEL' => 495,
         '3HIDH_CAEEL' => 299,
         'ACH1_CAEEL' => 498,
         '6PGL_CAEEL' => 269,
         '2A51_CAEEL' => 542,
         '2AAA_CAEEL' => 590,
         'A16L2_CAEEL' => 534,
         'ACH4_CAEEL' => 548,
         'ACC2_CAEEL' => 445,
         'ADA17_CAEEL' => 686,
         'ACR5_CAEEL' => 598,
         'ACTL1_CAEEL' => 360,
         'ADBP1_CAEEL' => 217,
         'ACH8_CAEEL' => 474,
         '5NT3_CAEEL' => 376,
         'ACT2_CAEEL' => 376,
         'AAR2_CAEEL' => 357,
         'ACH23_CAEEL' => 545,
         'ACD11_CAEEL' => 617,
         'ABF2_CAEEL' => 85,
         'ABDH3_CAEEL' => 375,
         'ABF1_CAEEL' => 85,
         'ABH51_CAEEL' => 355,
         'ACX15_CAEEL' => 659,
         'ACC1_CAEEL' => 466,
         'ABL1_CAEEL' => 1224,
         'ACC3_CAEEL' => 517,
         'ABH52_CAEEL' => 444,
         'ACT4_CAEEL' => 376,
         'ACH2_CAEEL' => 493,
         'ACBP1_CAEEL' => 86,
         '14332_CAEEL' => 248,
         'ACR7_CAEEL' => 538,
         'ACC4_CAEEL' => 408,
         'ACE1_CAEEL' => 620,
         'AATC_CAEEL' => 408,
         'ACH6_CAEEL' => 502,
         'ACH3_CAEEL' => 564,
         'ACR3_CAEEL' => 487,
         'ACMSD_CAEEL' => 401,
         'ACH7_CAEEL' => 507,
         'ACR2_CAEEL' => 575,
         'ACASE_CAEEL' => 272,
         'ACM3_CAEEL' => 611,
         'AAPK2_CAEEL' => 626,
         'ACN1_CAEEL' => 906,
         '3HAO_CAEEL' => 281,
         'ADAS_CAEEL' => 597,
         'ACT1_CAEEL' => 376,
         'A4_CAEEL' => 686,
         'ADA10_CAEEL' => 922,
         'A16L1_CAEEL' => 578,
         'ACT3_CAEEL' => 376,
         'ACP1_CAEEL' => 426,
         'ACM1_CAEEL' => 713,
         'AAPK1_CAEEL' => 589,
         'ACOC_CAEEL' => 887,
         'ACLY_CAEEL' => 1106,
         '14331_CAEEL' => 248
       };
$VAR3591 = 'Anopheles stephensi (Indo-Pakistan malaria mosquito).';
$VAR3592 = {
         'ACES_ANOST' => 664
       };
$VAR3593 = 'Bacillus thuringiensis subsp. konkukian (strain 97-27).';
$VAR3594 = {
         'ACKA_BACHK' => 397,
         'ACCD_BACHK' => 289,
         'ACPS_BACHK' => 119,
         '3MGH_BACHK' => 205,
         'ACCA_BACHK' => 324,
         'ACP_BACHK' => 77
       };

更准确地说,我想知道哪些生物在我的哈希中具有超过 50 个蛋白质 ID,然后选择它们,摆脱蛋白质数量较少的其他生物

标签: stringperlfrequency

解决方案


更准确地说,我想知道哪些生物在我的哈希中具有超过 50 个蛋白质 ID,然后选择它们,摆脱蛋白质数量较少的其他生物

我不确定我是否完全理解了你的问题,但看起来你有以下类型的哈希:

my %hash = (
    'protein_id#1' => {
         'some-number' => 'organism-name'
    },
    'protein_id#2' => {
         'some-number' => 'same-or-other-organism-name',
    },
    ...
);

你想计算有多少protein_id#X´ are for each different有机体名称。

在这种情况下,以下应该起作用:

 my %organism;
 # "outer" hash has protein_id as key
 while (my ($protein,$h2) = each %hash) {
     # "inner" hash has organism-name as value
     # same organism could maybe be multiple times inside the same inner hash
     # but should only be counted once per protein_id
     my %organism;
     while (my ($some_number,$o) = each %$h2) {
         $organism{$o}++
     } 
     for (keys %organism) {
          $count{$_}++;
     }
 }

推荐阅读