php - 如果不包含某些字符串,则替换某些子值?还是重写 XPATH 查询?网站抓取
问题描述
前言:这是我编写的第一个 XPath 和 DOM 脚本。
以下代码在一定程度上有效。
如果应该是 price 的 child->nodevalue 为空,它会丢弃其余元素,然后从那里滚雪球。我花了几个小时阅读、重写,但无法想出解决它的方法。
我正处于我认为我的 XPath 查询可能是问题的地步,因为我不知道如何测试这是正确的子值。
我正在抓取的内容看起来像这样(实际上它看起来不像这样,每个产品都有 148 行 HTML,但这些是相关的):
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$1,000,000
</span>
</div>
这是我正在使用的代码。
$html =file_get_contents('http://localhost:8888/scraper/source.html');
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xpath = new \DOMXpath($doc);
$xpath->preserveWhiteSpace = FALSE;
$nodes= $xpath->query("//a[@class = 'a-link-normal s-no-outline'] | //span[@class = 'a-size-base-plus a-color-base a-text-normal'] | //span[@class = 'a-price']");
$data =[];
foreach ($nodes as $node) {
$url = $node->getAttribute('href');
if(trim($url,"\xc2\xa0 \n \t \r") != ''){
array_push($data,$url);
}
foreach ($node->childNodes as $child) {
if (trim($child->nodeValue, "\xc2\xa0 \n \t \r") != '') {
array_push($data, $child->nodeValue);
}
}
}
$chunks = (array_chunk($data, 4));
foreach($chunks as $chunk) {
$newarray = [
'url' => $chunk[0],
'title' => $chunk[1],
'todaysprice' => $chunk[2],
'hiddenprice' => $chunk[3]
];
echo '<p>' . $newarray['url'] . '<br>' . $newarray['title'] . '<br>' .
$newarray['todaysprice'] . '</p>';
}
输出:
URL
Title
Price
URL
Title
Price
URL
Title
URL. <---- "Price was missing so it used the next child node value and now everything from here down is wrong."
Title
Price
URL
我知道这段代码离右边很远,但我必须从某个地方开始。
解决方案
如果我理解正确,您可能正在寻找类似下面的内容。为了简单起见,我跳过了数组构建部分,只是回显了目标数据。
因此,假设您的 html 如下所示:
$html = '
<body>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$1,000,000
</span>
</div>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed2.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The other Title I Need
</span>
</a>
</h2>
</div>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed3.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Final Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$2,000,000
</span>
</div>
</body>
';
尝试这个:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$data = $xpath->query('//h2[@class="second class"]');
foreach($data as $datum){
echo trim($xpath->query('.//a/@href',$datum)[0]->nodeValue),"\r\n";
echo trim($xpath->query('.//a/span',$datum)[0]->nodeValue),"\r\n";
#$price = $xpath->query('./following-sibling::span',$datum);
#EDITED
$price = $xpath->query('./following-sibling::span[@class="a-offscreen"]',$datum);
if ($price->length>0) {
echo trim($price[0]->nodeValue), "\r\n";
} else {
echo("No Price"),"\r\n";
}
echo "\r\n";
};
输出:
TheURLINeed.php
The Title I Need
$1,000,000
TheURLINeed2.php
The other Title I Need
No Price
TheURLINeed3.php
The Final Title I Need
$2,000,000
推荐阅读
- java - WSO2 - 将 java 类作为项目的一部分包含的正确方法是什么
- javascript - 如何从 list.stdout.on 方法中获取数据
- webhooks - Podio Webhooks 响应未出现在 webhook.site 中
- apache-kafka - Kafka Streams 容错理解
- regex - 使用 Archivarix 搜索和替换删除所有文章图片
- python-3.x - 生成复杂网络的方法
- javascript - 如何对具有两个日期的对象数组进行排序?
- python - python中Flask api服务之间的数据持久化
- javascript - 如何比较两个对象数组,如果第二个数组项值这么说,则仅从第一个数组中删除项
- java - IntelliJ - Maven 添加外部 jar 文件但 java.lang.NoClassDefFoundError