首页 > 解决方案 > 运行使用带有 selenium 的 scrapy 创建的解析器时出现问题

问题描述

我用 Python scrapy 结合 selenium 编写了一个刮板,titles从网站上刮取一些。我的css selectors刮刀中定义的完美无瑕。我希望我的爬虫继续点击下一页并解析每一页中嵌入的信息。它在第一页上做得很好,但是当涉及到硒部分的作用时,刮板会一遍又一遍地点击同一个链接。

因为这是我第一次使用 selenium 和 scrapy,所以我不知道如何继续成功。任何修复将不胜感激。

如果我这样尝试,那么它可以顺利运行(选择器没有任何问题):

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"

    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

    def parse(self,response):
        self.driver.get(response.url)

        while True:
            for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"h1.faqsno-heading"))):
                name = elem.find_element_by_css_selector("div[id^='arrowex']").text
                print(name)

            try:
                self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()
                self.wait.until(EC.staleness_of(elem))
            except TimeoutException:break

但我的目的是让我的脚本以这种方式运行:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"

    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

    def click_nextpage(self,link):
        self.driver.get(link)
        elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))

        #It keeeps clicking on the same link over and over again

        self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()  
        self.wait.until(EC.staleness_of(elem))


    def parse(self,response):
        while True:
            for item in response.css("h1.faqsno-heading"):
                name = item.css("div[id^='arrowex']::text").extract_first()
                yield {"Name": name}

            try:
                self.click_nextpage(response.url) #initiate the method to do the clicking
            except TimeoutException:break

这些是该登录页面上可见的标题(让您知道我在追求什么):

INDIA INCLUSION FOUNDATION
INDIAN WILDLIFE CONSERVATION TRUST
VATSALYA URBAN AND RURAL DEVELOPMENT TRUST

我不愿意从那个站点获取数据,所以除了我上面尝试过的任何替代方法对我来说都是无用的。我唯一的目的是提供与我在第二种方法中尝试的方式相关的任何解决方案。

标签: pythonpython-3.xseleniumweb-scrapingscrapy

解决方案


如果您需要纯硒溶液:

driver.get("https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx")

while True:
    for item in wait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div[id^='arrowex']"))):
        print(item.text)
    try:
        driver.find_element_by_xpath("//input[@text='Next' and not(contains(@class, 'disabledImageButton'))]").click()
    except NoSuchElementException:
        break

推荐阅读