首页 > 解决方案 > 如何从需要使用scrapy-selenium单击的选项卡中抓取页面

问题描述

所以我想从这个网站上抓取数据,特别是从公司详细信息部分:

要抓取的网站

我从一个人那里得到了一些帮助来让它与 python playwright 一起工作,但我需要用 python scrapy-selenium 来完成这件事。

我想将代码从这里的答案重写为scrapy-selenium方式。

原始问题

我试过这样做,就像在这个问题中建议的那样

刮痧硒

但没有运气=/

我的代码:

资源/search_results_searchpage.yml:

products:
    css: 'div[data-content="productItem"]'
    multiple: true
    type: Text
    children:
        link:
            css: a.elements-title-normal 
            type: Link

爬虫.py:

import scrapy
import csv
from scrapy_selenium import SeleniumRequest
import os
from selectorlib import Extractor
from scrapy import Selector

class Spider(scrapy.Spider):
    name = 'alibaba_crawler'
    allowed_domains = ['alibaba.com']
    start_urls = ['http://alibaba.com/']
    link_extractor = Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__), "../resources/search_results_searchpage.yml"))

    def start_requests(self):
        search_text="Headphones"
        url="https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText={0}&viewtype=G".format(search_text)

        yield SeleniumRequest(url=url, callback = self.parse, meta = {"search_text": search_text})


    def parse(self, response):
        data = self.link_extractor.extract(response.text, base_url=response.url)
        for product in data['products']:
            parsed_url=product["link"]

            yield SeleniumRequest(url=parsed_url, callback=self.crawl_mainpage)
    
    def crawl_mainpage(self, response):
        driver = response.request.meta['driver']
        button = driver.find_element_by_xpath( "//span[@title='Company Profile']")
        button.click()
        driver.quit()

        yield {
            'name': response.xpath("//h1[@class='module-pdp-title']/text()").extract(),
            'Year of Establishment': response.xpath("//td[contains(text(), 'Year Established')]/following-sibling::td/div/div/div/text()").extract()
         }
        

运行代码:

scrapy crawl alibaba_crawler -o out.csv -t csv

公司名称被正确返回。成立年份仍为空,应返回年份。

标签: pythonseleniumscrapyscrapy-selenium

解决方案


我没有正确使用选择器。这现在可以正常工作

def crawl_mainpage(self, response):
    driver = response.request.meta['driver']
    driver.find_element_by_xpath( "//span[@title='Company Profile']").click()
    sel = Selector(text=driver.page_source)
    driver.quit()

    yield {
    sel.xpath("//td[contains(text(), 'Year Established')]/following-sibling::td/div/div/div/text()").extract()
    }

推荐阅读