首页 > 解决方案 > 如何轮换代理和用户代理

问题描述

我正在编写一个 Scrapy 程序,我在这个网站上登录并抓取不同扑克牌的数据,http://www.starcitygames.com/buylist/. 但是我只从这个 url 中抓取 ID 值,然后我使用该 ID 号重定向到不同的 URL,并抓取该 JSON 网页,并对所有 207 个不同类别的卡片执行此操作。我看起来更真实一点,然后直接使用 JSON 数据访问 URL。无论如何,我之前用多个 URL 编写了 Scrapy 程序,并且我能够将这些程序设置为轮换代理和用户代理,但是我将如何在这个程序中做到这一点?由于从技术上讲只有一个 URL,有没有办法将其设置为在抓取 5 个左右不同的 JSON 数据页后切换到不同的代理和用户代理?我不希望它随机旋转。我希望它每次都使用相同的代理和用户代理抓取相同的 JSON 网页。我希望一切都说得通。

# Import needed functions and call needed python files
import scrapy
import json
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from ..items import DataItem

# Spider class
class LoginSpider(scrapy.Spider):
    # Name of spider
    name = "LoginSpider"

    #URL where dated is located
    start_urls = ["http://www.starcitygames.com/buylist/"]

    # Login function
    def parse(self, response):
        # Login using email and password than proceed to after_login function
        return scrapy.FormRequest.from_response(
        response,
        formcss='#existing_users form',
        formdata={'ex_usr_email': 'example@email.com', 'ex_usr_pass': 'password'},
        callback=self.after_login
        )


    # Function to barse buylist website
    def after_login(self, response):
        # Loop through website and get all the ID numbers for each category of card and plug into the end of the below
        # URL then go to parse data function
        for category_id in response.xpath('//select[@id="bl-category-options"]/option/@value').getall():
            yield scrapy.Request(
                    url="http://www.starcitygames.com/buylist/search?search-type=category&id={category_id}".format(category_id=category_id),
                    callback=self.parse_data,
                    )
    # Function to parse JSON dasta
    def parse_data(self, response):
        # Declare variables
        jsonreponse = json.loads(response.body_as_unicode())
        # Call DataItem class from items.py
        items = DataItem()

        # Scrape category name
        items['Category'] = jsonreponse['search']
        # Loop where other data is located
        for result in jsonreponse['results']:
            # Inside this loop, run through loop until all data is scraped
            for index in range(len(result)):
                # Scrape the rest of needed data
                items['Card_Name'] = result[index]['name']
                items['Condition'] = result[index]['condition']
                items['Rarity'] = result[index]['rarity']
                items['Foil'] = result[index]['foil']
                items['Language'] = result[index]['language']
                items['Buy_Price'] = result[index]['price']
                # Return all data
                yield items

标签: pythonscrapyscrapy-splash

解决方案


我会为你推荐这个包 Scrapy-UserAgents

pip install scrapy-useragents

在您的 setting.py 文件中

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,

}

要轮换的用户代理示例列表

更多用户代理

USER_AGENTS = [
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/57.0.2987.110 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.79 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) '
     'Gecko/20100101 '
     'Firefox/55.0'),  # firefox
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.91 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/62.0.3202.89 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/63.0.3239.108 '
     'Safari/537.36'),  # chrome
]

注意这个中间件不能处理 COOKIES_ENABLED 为 True 的情况,并且网站将 cookie 与 User-Agent 绑定,可能会导致爬虫的不可预知的结果。


推荐阅读