首页 > 解决方案 > 用 Scrapy 和 Selenium 抓取 CNN

问题描述

我想创建一个高度自动化的爬虫,它可以打开 cnn.com 的搜索结果页面(这就是我需要 Selenium 的原因),从每篇文章中提取一些信息,然后进入下一页,但是,到目前为止几乎没有成功。

目前我的代码看起来像这样(我知道,这可能很糟糕,它是我发现的其他蜘蛛的拼凑而成):

    import scrapy
from scrapy import signals
from scrapy.http import TextResponse 
from scrapy.xlib.pydispatch import dispatcher
from cnn.items import CNNitem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class CNNspider(CrawlSpider):
    name = "cnn_spider"
    allowed_domains = ['cnn.com']
    start_urls = ['https://www.cnn.com/search?q=elizabeth%20warren&size=10&page=1']
    rules = [
        Rule(LinkExtractor(restrict_xpaths='//div[@class="cnn-search__results-list"]//h3/a/@href'), callback='parse_post', follow= True),
        ]

    def __init__(self, *a, **kw):
        self.driver = webdriver.Chrome()
        super(CNNspider, self).__init__(*a, **kw)



    def parse_page(self, response):
        # selenium part of the job
        self.driver.get(response.url)
        while True:
            more_btn = WebDriverWait(self.driver, 10).until(
                EC.visibility_of_element_located((By.XPATH, "//div[@class='pagination-bar']/div[contains(text(), 'Next')]"))
            )

            more_btn.click()

            # stop when we reach the desired page
            if self.driver.current_url.endswith('page=161'):
                break

        # now scrapy should do the job
        response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
        for post in response.xpath('//div[@class="cnn-search__results-list"]/div[@class="cnn-search__result cnn-search__result--article"]'):
            item = CNNitem()
            item['Title'] = post.xpath('.//h3[@class="cnn-search__result-headline"]/a/text()').extract()
            item['Link'] = post.xpath('.//h3[@class="cnn-search__result-headline"]/a/@href').extract()

            yield scrapy.Request(item['Link'], meta={'item': item}, callback=self.parse_post)

    def parse_post(self, response):
        item = response.meta['item']
        item["Body"] = response.xpath('//section[@id="body-text"]/div[1]/div/text()').extract()
        return item

现在 Chrome 所做的就是打开第一页,然后几乎立即关闭它,而不做任何事情。有人可以帮我把它放在一起吗?

标签: pythonseleniumscrapy

解决方案


推荐阅读