首页 > 解决方案 > 如何从perl中的多个标签中提取准确的信息

问题描述

我想从 ask 中提取 url 信息。com

这是标签

<p class="PartialSearchResults-item-url">maps.google.com </p>

这是代码,我试过了,但它正在用它提取垃圾信息。

$p = HTML::TokeParser->new(\$rrs);

while ($p->get_tag("p")) {

    my @link = $p->get_trimmed_text("/p");

     foreach(@link) { print "$_\n"; }

      open(OUT, ">>askurls.txt"); print OUT "@link\n"; close(OUT);

  }

我只想要域网址,例如 maps.google.com

但它正在提取 Source 、 Images 和各种其他 p 类信息垃圾,用不相关的信息填充 askurls.txt

添加:

askurls.txt filled with this information:
Videos
Change Settings
OK
Sites Google
Sites Google.com Br
Google
Cookie Policy
assistant.google.com
Meet your Google Assistant. Ask it questions. Tell it to do things. It's your own personal Google, always ready to help whenever you need it.
www.google.com/drive
Safely store and share your photos, videos, files and more in the cloud. Your first 15 GB of storage are free with a Google account.
translate.google.com
Google's free service instantly translates words, phrases, and web pages between English and over 100 other languages.
duo.google.com

标签: htmlperl

解决方案


您可以使用一个简单的正则表达式来解析您想要的内容

use strict;
use warnings;

my $text = <<'HTML'; # we are creating example data using a heredoc
<p class="PartialSearchResults-item-url"> maps.google.com </p>
<p class="PartialSearchResults-item-url">example.com</p>
HTML

while ($text =~ m/class="PartialSearchResults-item-url">(.*?)<\/p>/g) { # while loop to check all the existing match for the regex
  print $1."\n";
}

如果您不确定域所在的标签中是否有空格

(像这里<p class="PartialSearchResults-item-url">maps.google.com </p>

你可以\s*像这样使用:

m/class="PartialSearchResults-item-url">\s*(.*?)\s*<\/p>/g # here we are checking if there is space before and after the url

如果你想检查域是否有效,你可以使用is_domain()模块Data::Validate::Domain

# previous script
use Data::Validate::Domain qw(is_domain);

while ($text =~ m/class="PartialSearchResults-item-url">(.*?)<\/p>/g) {
   if (is_domain($1)) {
      print $1."\n";
   }
}

推荐阅读