web-scraping - 当设置tor获取轮换IP时，即使IP尚未改变，网页的“登录”也会中断

问题描述

介绍

首先，我确实尝试了这个主题的解决方案，但它在我的情况下不起作用。我的意思是 IP 正在旋转，但我收到empty socket content了消息，除了网站被抓取但不是我想要的，因为有些信息只有在我登录时才能抓取。所以我设置torcc文件来获取一个旋转的IP，MaxCircuitDirtiness 20它正在旋转，这次没有袜子问题，但我很快完成了被取消登录，然后我没有获得我感兴趣的信息。

那是我刮的那种东西

{'_id': 'Bidule',
'field1': ['A','C','D','E'], #it requires to be logged to the page
'field2': 'truc de bidule',
'field3': [0,1,2,3],#it requires to be logged to the page
'field4': 'le champ quatre'}

它适用于第一个项目，但有一段时间它会变坏，因为：

{'_id': 'Machine',
'field1': [], #empty because not logged
'field2': 'truc de machine',
'field3': [],#empty because not logged
'field4': 'le champ quatre'}

什么时候变坏？

下面是我的一次抓取过程中发生的情况的说明，参考了日志文件和终端信息。

IP: 178.239.176.73 #first item scraped as expected
IP: 178.239.176.73 #second item scraped as expected
IP: 178.239.176.73 #third item scraped as expected
IP: 178.239.176.73 #fourth item scraped as expected
IP: 178.239.176.73 #fifth item scraped as expected
IP: 178.239.176.73 #sixth item scraped as expected
IP: 178.239.176.73 #seventh item scraped as expected
IP: 178.239.176.73 #eighth item scraped as expected
IP: 178.239.176.73 #nineth item NOT scraped as expected
IP: 178.239.176.73 #and item NOT scraped as expected until the end
IP: 178.239.176.73
IP: 178.239.176.73
IP: 162.247.74.27
IP: 162.247.74.27
IP: 162.247.74.27
IP: 162.247.74.27
IP: 162.247.74.27
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227

最重要的是，即使 IP 尚未更改，它也会取消我的登录，然后我的项目就不再有趣了。所以我不明白为什么会这样。

请注意，当它未设置为旋转 IP 时，它运行良好，没问题，但我想要一个设置旋转 IP。

我试图包含该参数COOKIES_ENABLED = True，因为我想知道是否避免使用 cookie 会使我失去对网页的登录，但这显然不是原因。所以我仍然想知道是什么原因。

如果您想测试并使结果可重现并提供帮助，请编写代码：

scrapy startproject project

项目目录的组织：

project #directory
 |_ scrapy.cfg #file
 |__project #directory
     |_ __init__.py (empty) #file
     |_ items.py (unnecessary to test) #file
     |_ middlewares.py #file
     |_ pipelines.py (unnecessary to test)#file
     |_ settings.py #file
     |__ spiders #directory
          |_ spiders.py #file

middlewares.py：

from scrapy import signals
import random
from scrapy.conf import settings

class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        ua = random.choice(settings.get('USER_AGENT_LIST'))
        if ua:
            request.headers.setdefault('User-Agent', ua)

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = settings.get('HTTP_PROXY')
        spider.log('Proxy : %s' % request.meta['proxy'])

class ProjectSpiderMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        return None

    def process_spider_output(self, response, result, spider):
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        pass

    def process_start_requests(self, start_requests, spider):
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

 class ProjectDownloaderMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

settings.py：

USER_AGENT_LIST = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
    'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
    'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
    'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:59.0) Gecko/20100101 Firefox/59.0',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 OPR/43.0.2442.991',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36 OPR/42.0.2393.94',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36 OPR/48.0.2685.52'
]

#proxy for polipo
HTTP_PROXY = 'http://127.0.0.1:8123'
#retry if needed
RETRY_ENABLED = True
RETRY_TIMES = 5  # initial response + 2 retries = 3 requests
RETRY_HTTP_CODES = [401, 403, 404, 408, 500, 502, 503, 504]
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Disable cookies (enabled by default)
COOKIES_ENABLED = True #commented or not did not work for me
DOWNLOADER_MIDDLEWARES = {
    'project.middlewares.RandomUserAgentMiddleware': 400,
    'project.middlewares.ProxyMiddleware': 410,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60

spiders.py在spiders目录中：

import scrapy
from re import search

class ChevalSpider(scrapy.Spider):
    name = "fiche_cheval"
    start_urls = ['https://www.paris-turf.com/compte/login']
    def __init__ (self,username=None, mdp=None):
        self.username= username #create a fake account by yourself
        self.mdp = mdp

    def parse(self, response):
        token = response.css('[name="_csrf_token"]::attr(value)').get()
        data_log = {
                '_csrf_token': token,
                '_username': self.username,
                '_password': self.mdp
                 }
        yield scrapy.FormRequest.from_response(response, formdata=data_log, callback=self.after_login)

    def after_login(self, response):
        liste_ch=['alexandros-751044','annette-girl-735523','citoyenne-743132','everest-748084','goudurix-687456','lady-zorreghuietta-752292','petit-dandy-671825','ritvilik-708712','scarface-686119','siamese-713651','tic-tac-toe-685508',
        'velada-745272','wind-breaker-755107','zodev-715463','ballerian-813033','houpala-riquette-784415','jemykos-751551','madoudal-736164','margerie-778614','marquise-collonges-794335','mene-thou-du-plaid-780155']#Only a sample of thousands ids.
        url=['https://www.paris-turf.com/fiche-cheval/'+ch for ch in liste_ch]
        for link,cheval in zip(url,liste_ch):
            yield scrapy.Request(
                url=link,
                callback=self.page_cheval,
                meta={'nom':cheval}
            )
    def page_cheval(self, response):
        def lister_valeur(x_path,x_path2):
            """Here a customed function to get value none even if the tag
           does not exist in the page. Then the list fields have the same
           length. Important for my applications."""
            liste_valeur=[]
            for valeur in response.xpath(x_path):
                val=valeur.xpath(x_path2).extract_first()
                if val is None or val =="" or val=="." or val=="-" or val==" ":
                    liste_valeur.append(None)
                else:
                    liste_valeur.append(val)
            return liste_valeur

        cat_course1,cat_course2="//html//td[@class='italiques']","text()"
        cat_course=lister_valeur(cat_course1,cat_course2)#'Course A', 'Course B', ...
        gains1,gains2="//html//td[@class='rapport']","a/text()"
        gains=lister_valeur(gains1,gains2)#'17 100', '14 850', '0',...
        gains=[
        int(search(r'(\d{1,10})', gain.replace('\n','').replace(' ','').replace('.','')).group(1))
        if search(r'(\d{1,10})', gain.replace('\n','').replace(' ','').replace('.','')) != None else 0
        for gain in gains
    ]
        _id_course=response.xpath("//html//td[1]/@data-id").extract()

        item={
            '_id':response.request.meta['nom'],
            'cat_de_course':cat_course,
            'gains':gains,
            'id_de_course':_id_course
        }
        yield scrapy.Request('http://checkip.dyndns.org/', callback=self.checkip, dont_filter=True)
        yield item
    def checkip(self, response):
        print('IP: {}'.format(response.xpath('//body/text()').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')[0]))
        logging.warning('IP: {}'.format(response.xpath('//body/text()').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')[0]))

启动蜘蛛：

scrapy crawl fiche_cheval -a username=yourfakeemailaccount -a mdp=password -o items.json -s LOG_FILE=Project.log

一些注意事项：在几分钟前我做的最后一次爬网中，虽然 IP 正在改变，但除了最后一个之外，其他项目都还好，所以如果你运行一次，你看到每个项目都正常，那肯定不是可重现的结果当您再次启动它时。

Tor 和 Polipo 配置：

/etc/tor/torrc文件：

MaxCircuitDirtiness 20
SOCKSPort 9050
ControlPort 9051
CookieAuthentication 1

/etc/tor/torsocks.conf文件：

TorAddress 127.0.0.1
TorPort 9050

/etc/polipo/config文件：

logSyslog = true
logFile = /var/log/polipo/polipo.log
socksParentProxy = localhost:9050
diskCacheRoot=""
disableLocalInterface=true

操作系统：Ubuntu 18.04.2 LTS，Tor：0.3.2.10 (git-0edaa32732ec8930) 在 Linux 上运行，带有 Libevent 2.1.8-stable、OpenSSL 1.1.0g、Zlib 1.2.11、Liblzma 5.2.2 和 Libzstd 1.3.3。，我找不到检查 polipo 版本的方法。

更新

我做了一个循环。我的意思是当项目不符合预期时，它会parse再次返回到它的日志，因为我在循环中检查了是否在循环data_log期间仍然可以，并且它是在再次使用时。因此，当我登录时，我没有按预期获得项目。这很奇怪。

标签： web-scrapingscrapytor

web-scraping - 当设置tor获取轮换IP时，即使IP尚未改变，网页的“登录”也会中断

问题描述

介绍

什么时候变坏？

如果您想测试并使结果可重现并提供帮助，请编写代码：

Tor 和 Polipo 配置：

更新

解决方案

推荐阅读