首页 > 解决方案 > 如何设置 PHP Spatie Crawler - 输出结果

问题描述

我正在尝试设置Spatie,一个 PHP 爬虫,但很难解释文档。代码看起来相当健壮,但由于缺乏明确的路径来“这是如何在不做太多假设的情况下获得工作示例”,文档似乎存在一些非常基本的差距。

也就是说,我一直在阅读一堆其他 GitHub 线程和文章,试图让事情至少“更接近”设置。

我做了什么

我被困在哪里

任何有关我所缺少的见解的见解都将不胜感激。

我的代码:

use Spatie\Crawler\Crawler;
use Spatie\Crawler\CrawlObservers;
use Spatie\Crawler\CrawlObservers\CrawlObserver; // I had to specify this namespace, without it I kept getting an Exception: Class 'CrawlObserver' not found error
use GuzzleHttp\Exception\RequestException;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\UriInterface; // If I don't set this, I get an error: " Could not check compatibility between myClassExtendingCrawlObserver..."

class myClassExtendingCrawlObserver extends CrawlObserver {
    /**
     * Called when the crawler will crawl the url.
     *
     * @param \Psr\Http\Message\UriInterface $url
     */
    public function willCrawl(UriInterface $url)
    {
    }

    /**
     * Called when the crawler has crawled the given url successfully.
     *
     * @param \Psr\Http\Message\UriInterface $url
     * @param \Psr\Http\Message\ResponseInterface $response
     * @param \Psr\Http\Message\UriInterface|null $foundOnUrl
     */
    public function crawled(
        UriInterface $url,
        ResponseInterface $response,
        ?UriInterface $foundOnUrl = null
    ){

    }

    /**
     * Called when the crawler had a problem crawling the given url.
     *
     * @param \Psr\Http\Message\UriInterface $url
     * @param \GuzzleHttp\Exception\RequestException $requestException
     * @param \Psr\Http\Message\UriInterface|null $foundOnUrl
     */
    public function crawlFailed(
        UriInterface $url,
        RequestException $requestException,
        ?UriInterface $foundOnUrl = null
    ){
      
    }

    /**
     * Called when the crawl has ended.
     */
    public function finishedCrawling()
    {
    }
}

if(!class_exists('Spatie\\Crawler\\CrawlObservers\\CrawlObserver')){ // I was using this to check what to include
  $myClassExtendingCrawlObserver = new myClassExtendingCrawlObserver();
  $url = 'https://www.example.com';
  try {
    Crawler::create()
    ->setCrawlObserver($myClassExtendingCrawlObserver)
    ->startCrawling($url);
  } catch (exception $e) {
    error_log(e);
  }
}

标签: phpweb-crawler

解决方案


Spatie Crawler 循环遍历 URL 中的链接,并返回状态和其他信息。您可以通过以下方式获取更多信息:

public function crawled(
    UriInterface      $url,
    ResponseInterface $response,
    ?UriInterface     $foundOnUrl = null
): void
{
    echo 'Crawling URL: ' . urldecode($url) . ' ... ' . PHP_EOL;
    echo 'Crawl result: ' . $response->getStatusCode() . ' - ' . $response->getReasonPhrase() . PHP_EOL;
    if (isset($response->getHeaders()['Server'])) {
        echo 'Server: ' . $response->getHeaders()['Server'][0] . PHP_EOL;
    }

    if (isset($response->getHeaders()['Set-Cookie'])) {
        // You can use loop here
        echo 'Cookies: ' . $response->getHeaders()['Set-Cookie'][0] . PHP_EOL;
    }

    if ($response->getStatusCode() == 301 || $response->getStatusCode() == 302) {
        echo $response->getHeaders()['Location'][0] . PHP_EOL;
        echo "Redirect: " . rtrim($url, '/') . $response->getHeaders()['Location'][0] . PHP_EOL;
    }
}

您可以在此处处理失败的请求:

public function crawlFailed(
    UriInterface     $url,
    RequestException $requestException,
    ?UriInterface    $foundOnUrl = null
): void
{
    echo '!!! Crawl Failed !!! : ' . $url . PHP_EOL;
} 

推荐阅读