首页 > 解决方案 > Perl 解析 csv 文件并迭代 curl

问题描述

我正在尝试解析一个 csv 文件并用 curl 遍历它。以下是我的数据集:

Act No. 2,Sep/1900/28
Act No. 3,Sep/1900/28
Act No. 10,Oct/1900/28

我遵循了这个 Stackoverflow 问题:CSV into hash基本上为我的数据集创建哈希。这是我的代码:

#!/usr/bin/perl
use strict;
use warnings;

use Text::CSV_XS;
use IO::File;

use WWW::Curl::Easy;

my $url = "https://elibrary.judiciary.gov.ph/thebookshelf/docmonth/";
#my $filestoprocess = 'list_acts.csv';

# Usage example:
my $hash_ref = csv_file_hashref('toharvest_og_sourcing.csv');

foreach my $key (sort keys %{$hash_ref}){

my $urlcomplete = "$url"."@{$hash_ref->{$key}}";
   
#start the curl
my $user_agent = "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20140319 Firefox/24.0 Iceweasel/24.4.0";

my $curl = WWW::Curl::Easy->new;

$curl->setopt(CURLOPT_HEADER,1);
$curl->setopt(CURLOPT_USERAGENT, $user_agent);
$curl->setopt(CURLOPT_FOLLOWLOCATION, 1);
#$curl->setopt(CURLOPT_SSL_VERIFYPEER, 1L);
#$curl->curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 1L);
$curl->setopt(CURLOPT_SSL_VERIFYPEER, 0);
$curl->setopt(CURLOPT_URL, $urlcomplete);

# A filehandle, reference to a scalar or reference to a typeglob can be used here.
my $response_body;
$curl->setopt(CURLOPT_WRITEDATA,\$response_body);

# Starts the actual request
my $retcode = $curl->perform;

# Looking at the results...
    if ($retcode == 0) {
        my $response_code = $curl->getinfo(CURLINFO_HTTP_CODE);
  my $curledurldate = $response_body;
  our ($issuancelink) = $curledurldate =~ /a href='(https.*?)'>.*?<STRONG>$key/s;
  #print "$issuancelink\n";

        if (defined $issuancelink) {

my $user_agent = "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20140319 Firefox/24.0 Iceweasel/24.4.0";

#my $curl = WWW::Curl::Easy->new;

$curl->setopt(CURLOPT_HEADER,1);
$curl->setopt(CURLOPT_USERAGENT, $user_agent);
$curl->setopt(CURLOPT_FOLLOWLOCATION, 1);
#$curl->setopt(CURLOPT_SSL_VERIFYPEER, 1L);
#$curl->curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 1L);
$curl->setopt(CURLOPT_SSL_VERIFYPEER, 0);

$curl->setopt(CURLOPT_URL, $issuancelink);

# A filehandle, reference to a scalar or reference to a typeglob can be used here.
my $response_body;
$curl->setopt(CURLOPT_WRITEDATA,\$response_body);

# Starts the actual request
my $retcode = $curl->perform;

# Looking at the results...
if ($retcode == 0) {
#       print("Transfer went ok\n");
        my $response_code = $curl->getinfo(CURLINFO_HTTP_CODE);
      my $curledsource = $response_body;
      our ($ogsourcing) = $curledsource =~ /<br>\s+(\w+.*?)\s+?<CENTER>.*?H2/s;
    
        my $filename = 'ogsourcingharvested.txt';
              open (FH, '>>', $filename) or die("Could not open file. $!");
                #print "Error processing ".$fh."$_\n";
                                print FH $ogsourcing."|"."{$key}\n";
               close (FH);       
        }

        else {
        # Error code, type of error, error message
        print("An error happened: $retcode ".$curl->strerror($retcode)." ".$curl->errbuf."\n");

        }
} else {
        # Error code, type of error, error message
        print("An error happened: $retcode ".$curl->strerror($retcode)." ".$curl->errbuf."\n");
}

}
}

# Implementation:
sub csv_file_hashref {
   my ($filename) = @_;

   my $csv_fh = IO::File->new($filename, 'r');
   my $csv = Text::CSV_XS->new ();

   my %output_hash;

   while(my $colref = $csv->getline ($csv_fh))
   {
      $output_hash{shift @{$colref}} = $colref;
   }

   return \%output_hash;
}

基本上,代码遍历第二列,将其添加到 URL 的末尾,然后该 URL 被卷曲。之后,在卷曲 URL 的内容中搜索特定内容:

our ($issuancelink) = $curledurldate =~ /a href='(https.*?)'>.*?<STRONG>$key/s;

当该链接出现在搜索中时,它被放入一个变量 ($issuancelink) 中,然后该变量 $issuancelink 被卷曲。然后搜索卷曲文件中的特定文本,然后捕获该特定文本并将其保存到文本文件中。但是,如果不重复第二列(在这种情况下为 Sep/1900/28,Oct/1900/28),我的代码很好。但是,如果重复,那就是我遇到问题的地方,似乎第一次迭代就是被捕获的迭代。因此,就我而言,第 3 号法案的链接与第 2 号法案(https :// elibrary.judiciary.gov.ph/thebookshelf/docmonth/Sep/1900/28),第 2 号法案的链接被捕获。提前致谢!

标签: csvperlparsingcurl

解决方案


但是,如果不重复第二列(在这种情况下为 Sep/1900/28,Oct/1900/28),我的代码很好。

当您将值存储在散列中时,散列键是唯一的。这意味着如果您有相同的键名,它们将相互覆盖。

这部分代码:

   while(my $colref = $csv->getline ($csv_fh))
   {
      $output_hash{shift @{$colref}} = $colref;
   }

看来要负责了。您可以做的是将值保存在数组而不是标量中(在本例中,保存在数组 ref 中)。

我会做这样的事情:

   while(my $colref = $csv->getline ($csv_fh))
   {
      my ($key, $value) = @$colref;
      push @{$output_hash{$key}}, $value;       # store values in array
   }

这样做的另一个好处是值被复制。在您的代码中,数组 ref 被复制。您因变量的有限范围而免于问题my $colref,但一般来说,复制值将使您免于问题。

要访问数组值,您可能需要为每个哈希键循环。就像是

for my $key (sort keys %$hash_ref) {
    for my $values (@{$hash_ref{$key}}) {
          # do stuff...

推荐阅读