首页 > 解决方案 > 如何将代理添加到 Scrapy 和 Selenium 脚本

问题描述

我想在我的脚本中添加一个代理。

我该怎么做?我必须使用 Selenium 或 Scrapy 吗?

我认为 Scrapy 是在发出初始请求,所以使用 scrapy 是有意义的。但我到底该怎么做?

您能推荐任何非常可靠的代理列表吗?

这是我当前的脚本:

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request

from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver

import re
import csv
from time import sleep

class PostsSpider(Spider):
    name = 'posts'
    allowed_domains = ['xyz']
    start_urls = ('xyz',)

    def parse(self, response):
        with open("urls.txt", "rt") as f:
            start_urls = [url.strip() for url in f.readlines()]
            for url in start_urls:
                self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
                self.driver.get(url)

                try:
                    self.driver.find_element_by_id('currentTab').click()
                    sleep(3)
                    self.logger.info('Sleeping for 5 sec.')
                    self.driver.find_element_by_xpath('//*[@id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
                    sleep(7)
                    self.logger.info('Sleeping for 7 sec.')
                except NoSuchElementException:
                    self.logger.info('Blog does not exist anymore')

                while True:                 
                    try:
                        element = self.driver.find_element_by_id('last_item')
                        self.driver.execute_script("arguments[0].scrollIntoView(0, document.documentElement.scrollHeight-5);", element)
                        sleep(3)
                        self.driver.find_element_by_id('last_item').click()
                        sleep(7)


                    except NoSuchElementException:
                        self.logger.info('No more tipps')                       
                        sel = Selector(text=self.driver.page_source)
                        allposts = sel.xpath('//*[@class="block media _feedPick feed-pick"]')

                        for post in allposts:
                            username = post.xpath('.//div[@class="col-sm-7 col-lg-6 no-padding"]/a/@title').extract()
                            publish_date = post.xpath('.//*[@class="bet-age text-muted"]/text()').extract()

                            yield {'Username': username,
                                'Publish date': publish_date}
                        self.driver.close()
                        break

标签: seleniumweb-scrapingscrapy

解决方案


你应该阅读 Scrapy ProxyMiddleware以最好地探索它。其中还提到了使用提到的代理的方法


推荐阅读