首页 > 解决方案 > 如果不包含某些字符串,则替换某些子值?还是重写 XPATH 查询?网站抓取

问题描述

前言:这是我编写的第一个 XPath 和 DOM 脚本。

以下代码在一定程度上有效。

如果应该是 price 的 child->nodevalue 为空,它会丢弃其余元素,然后从那里滚雪球。我花了几个小时阅读、重写,但无法想出解决它的方法。

我正处于我认为我的 XPath 查询可能是问题的地步,因为我不知道如何测试这是正确的子值。

我正在抓取的内容看起来像这样(实际上它看起来不像这样,每个产品都有 148 行 HTML,但这些是相关的):

<div class="some really long class name">
    <h2 class="second class">
        <a class="a-link-normal s-no-outline" href="TheURLINeed.php">
            <span class="a-size-base-plus a-color-base a-text-normal">
                The Title I Need
            </span>
        </a>
    </h2>
    <span class="a-offscreen">
      $1,000,000
    </span>
</div>

这是我正在使用的代码。

    $html =file_get_contents('http://localhost:8888/scraper/source.html');

    $doc = new \DOMDocument();
    $doc->loadHTML($html);
    $xpath = new \DOMXpath($doc);
    $xpath->preserveWhiteSpace = FALSE;

    $nodes= $xpath->query("//a[@class = 'a-link-normal s-no-outline'] | //span[@class = 'a-size-base-plus a-color-base a-text-normal'] | //span[@class = 'a-price']");

    $data =[];
    foreach ($nodes as $node) {
        $url =  $node->getAttribute('href');
        if(trim($url,"\xc2\xa0 \n \t \r") != ''){
            array_push($data,$url);
        }
        foreach ($node->childNodes as $child) {
            if (trim($child->nodeValue, "\xc2\xa0 \n \t \r") != '') {
                array_push($data, $child->nodeValue);
            }
        }
    }
    $chunks = (array_chunk($data, 4));

    foreach($chunks as $chunk) {
        $newarray = [
            'url' => $chunk[0],
            'title' => $chunk[1],
            'todaysprice' => $chunk[2],
            'hiddenprice' => $chunk[3]
            ];

    echo '<p>' . $newarray['url'] . '<br>' . $newarray['title'] . '<br>' .                 
    $newarray['todaysprice'] . '</p>';
}

输出:

URL
Title
Price

URL
Title
Price

URL
Title
URL.   <---- "Price was missing so it used the next child node value and now everything from here down is wrong."

Title
Price
URL

我知道这段代码离右边很远,但我必须从某个地方开始。

标签: phpdomweb-scrapingxpath

解决方案


如果我理解正确,您可能正在寻找类似下面的内容。为了简单起见,我跳过了数组构建部分,只是回显了目标数据。

因此,假设您的 html 如下所示:

$html = '
<body>
<div class="some really long class name">
    <h2 class="second class">
        <a class="a-link-normal s-no-outline" href="TheURLINeed.php">
            <span class="a-size-base-plus a-color-base a-text-normal">
                The Title I Need
            </span>
        </a>
    </h2>
    <span class="a-offscreen">
      $1,000,000
    </span>
</div>
<div class="some really long class name">
    <h2 class="second class">
        <a class="a-link-normal s-no-outline" href="TheURLINeed2.php">
            <span class="a-size-base-plus a-color-base a-text-normal">
                The other Title I Need
            </span>
        </a>
    </h2>
   
</div>
<div class="some really long class name">
    <h2 class="second class">
        <a class="a-link-normal s-no-outline" href="TheURLINeed3.php">
            <span class="a-size-base-plus a-color-base a-text-normal">
                The Final Title I Need
            </span>
        </a>
    </h2>
    <span class="a-offscreen">
      $2,000,000
    </span>
</div>
</body>
';

尝试这个:

$doc = new DOMDocument();
$doc->loadHTML($html);

$xpath = new DOMXpath($doc);
$data = $xpath->query('//h2[@class="second class"]');

foreach($data as $datum){
    echo trim($xpath->query('.//a/@href',$datum)[0]->nodeValue),"\r\n";
    echo trim($xpath->query('.//a/span',$datum)[0]->nodeValue),"\r\n";
    #$price = $xpath->query('./following-sibling::span',$datum);
    #EDITED
    $price = $xpath->query('./following-sibling::span[@class="a-offscreen"]',$datum);
    if ($price->length>0) {
    echo trim($price[0]->nodeValue), "\r\n";
} else {
    echo("No Price"),"\r\n";
    
}
   
echo "\r\n";
};

输出:

TheURLINeed.php
The Title I Need
$1,000,000

TheURLINeed2.php
The other Title I Need
No Price

TheURLINeed3.php
The Final Title I Need
$2,000,000

推荐阅读