html - 如何从perl中的多个标签中提取准确的信息
问题描述
我想从 ask 中提取 url 信息。com
这是标签
<p class="PartialSearchResults-item-url">maps.google.com </p>
这是代码,我试过了,但它正在用它提取垃圾信息。
$p = HTML::TokeParser->new(\$rrs);
while ($p->get_tag("p")) {
my @link = $p->get_trimmed_text("/p");
foreach(@link) { print "$_\n"; }
open(OUT, ">>askurls.txt"); print OUT "@link\n"; close(OUT);
}
我只想要域网址,例如 maps.google.com
但它正在提取 Source 、 Images 和各种其他 p 类信息垃圾,用不相关的信息填充 askurls.txt
添加:
askurls.txt filled with this information:
Videos
Change Settings
OK
Sites Google
Sites Google.com Br
Google
Cookie Policy
assistant.google.com
Meet your Google Assistant. Ask it questions. Tell it to do things. It's your own personal Google, always ready to help whenever you need it.
www.google.com/drive
Safely store and share your photos, videos, files and more in the cloud. Your first 15 GB of storage are free with a Google account.
translate.google.com
Google's free service instantly translates words, phrases, and web pages between English and over 100 other languages.
duo.google.com
解决方案
您可以使用一个简单的正则表达式来解析您想要的内容
use strict;
use warnings;
my $text = <<'HTML'; # we are creating example data using a heredoc
<p class="PartialSearchResults-item-url"> maps.google.com </p>
<p class="PartialSearchResults-item-url">example.com</p>
HTML
while ($text =~ m/class="PartialSearchResults-item-url">(.*?)<\/p>/g) { # while loop to check all the existing match for the regex
print $1."\n";
}
如果您不确定域所在的标签中是否有空格
(像这里<p class="PartialSearchResults-item-url">maps.google.com </p>
)
你可以\s*
像这样使用:
m/class="PartialSearchResults-item-url">\s*(.*?)\s*<\/p>/g # here we are checking if there is space before and after the url
如果你想检查域是否有效,你可以使用is_domain()
模块Data::Validate::Domain
:
# previous script
use Data::Validate::Domain qw(is_domain);
while ($text =~ m/class="PartialSearchResults-item-url">(.*?)<\/p>/g) {
if (is_domain($1)) {
print $1."\n";
}
}
推荐阅读
- python - 根据条件将列值与列中的值相除
- regex - 正则表达式匹配 CSS var
- android - 在 ubuntu 20.04 中构建 Android Alexa Auto SDK 3.2.1
- c# - 如何在 .NET 5 控制台应用程序中正确设置配置?
- javascript - 复制并粘贴带有已填充表单字段的图章,而没有弹出框提示我每次在 adobe 中完成填写的字段?
- c# - 对于在 C# 中生成的文件,Excel 无法计算状态栏中的总和
- javascript - 未捕获的类型错误:无法读取未定义的属性“值”(带有 javascript 项目的购物车)
- c# - C# 编译器错误:允许从 Nullable 进行转换
到十进制 - html - 设置宽度的 Swiper 滑块坏了 - 25.8.2021
- linux - 如何使用内核 4.4 在 bpftrace 中打印字符串