首页 > 解决方案 > 使用scrapy代理的ssl握手失败

问题描述

我正在尝试在一个scrapy项目上设置一个代理。我按照这个答案的说明进行操作:

“1-创建一个名为“middlewares.py”的新文件并将其保存在您的scrapy项目中,并将以下代码添加到其中:

import base64
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

要获得代理,我使用以下免费订阅:https ://proxy.webshare.io/

其中提供了端口、用户和地址:

import base64
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] =  "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "sarnencj:password"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

但是当我运行蜘蛛时,我收到以下错误:

2018-04-30 21:44:30 [scrapy] DEBUG: Gave up retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]

编辑。

设置中的中间件如下:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'moocs.middlewares.ProxyMiddleware': 100,
}

完整的日志

2018-05-02 12:28:38 [scrapy] INFO: Scrapy 1.0.3 started (bot: moocs)
2018-05-02 12:28:38 [scrapy] INFO: Optional features available: ssl, http11, boto
2018-05-02 12:28:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'moocs.spiders', 'SPIDER_MODULES': ['moocs.spiders'], 'BOT_NAME': 'moocs'}
2018-05-02 12:28:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2018-05-02 12:28:39 [boto] DEBUG: Retrieving credentials from metadata server.
2018-05-02 12:28:39 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
URLError: <urlopen error [Errno 101] Network is unreachable>
2018-05-02 12:28:40 [boto] ERROR: Unable to read instance data, giving up
2018-05-02 12:28:40 [py.warnings] WARNING: /usr/local/lib/python2.7/dist-packages/scrapy/utils/deprecate.py:155: ScrapyDeprecationWarning: `scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware` class is deprecated, use `scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` instead
  ScrapyDeprecationWarning)

2018-05-02 12:28:40 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2018-05-02 12:28:40 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2018-05-02 12:28:40 [scrapy] INFO: Enabled item pipelines: 
2018-05-02 12:28:40 [scrapy] INFO: Spider opened
2018-05-02 12:28:40 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-02 12:28:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-02 12:28:42 [scrapy] DEBUG: Retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
2018-05-02 12:28:44 [scrapy] DEBUG: Retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
2018-05-02 12:28:45 [scrapy] DEBUG: Gave up retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
2018-05-02 12:28:45 [scrapy] ERROR: Error downloading <GET https://www.coursetalk.com/subjects/data-science/courses>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
2018-05-02 12:28:45 [scrapy] INFO: Closing spider (finished)
2018-05-02 12:28:45 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
 'downloader/request_bytes': 909,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 5, 2, 16, 58, 45, 996708),
 'log_count/DEBUG': 5,
 'log_count/ERROR': 3,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2018, 5, 2, 16, 58, 40, 255414)}
2018-05-02 12:28:45 [scrapy] INFO: Spider closed (finished)

编辑。

我尝试在蜘蛛类中设置代理:

import scrapy
from scrapy import  Request
from scrapy.loader import ItemLoader

from urlparse import urljoin 
from moocs.items import MoocsItem,MoocsReviewItem



class MoocsSpiderSpider(scrapy.Spider):
    name = "moocs_spider"
    #allowed_domains = ["https://www.coursetalk.com/subjects/data-science/courses"]
    start_urls = (
        'https://www.coursetalk.com/subjects/data-science/courses',
    )

    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'moocs.middlewares.ProxyMiddleware': 100
        }
    }
    def parse(self, response):
        #print response.body#xpath()
        courses_xpath = '//*[@class="course-listing-card"]//a[contains(@href, "/courses/")]/@href'
        courses_url = [urljoin(response.url,relative_url)  for relative_url in response.xpath(courses_xpath).extract()]  
        for course_url in courses_url[0:30]:
            print course_url
            yield Request(url=course_url, callback=self.parse_reviews)

在中间件.py 中:

class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] =  "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128"

现在我得到一个不同的错误:

2018-05-03 18:07:17 [scrapy] ERROR: Error downloading <GET https://www.coursetalk.com/subjects/data-science/courses>: Could not open CONNECT tunnel.
2018-05-03 18:07:17 [scrapy] INFO: Closing spider (finished)
2018-05-03 18:07:17 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 1,
 'downloader/request_bytes': 245,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'finish_reason': 'finished',

编辑2

我正在使用 linux Mint 17。我没有在虚拟环境中安装 scrapy。

来自“点冻结”

Warning: cannot find svn location for apsw==3.8.2-r1
BeautifulSoup==3.2.1
CherryPy==3.2.2
EasyProcess==0.2.2
Flask==0.11.1
GDAL==2.1.0
GraphLab-Create==1.6.1
Jinja2==2.8
Mako==0.9.1
Markdown==2.4
MarkupSafe==0.18
PAM==0.4.2
Pillow==2.3.0
PyAudio==0.2.7
PyInstaller==2.1
PyVirtualDisplay==0.2
PyYAML==3.11
Pygments==2.0.2
Routes==2.0
SFrame==2.1
SQLAlchemy==0.8.4
Scrapy==1.0.3
Send2Trash==1.5.0
Shapely==1.5.17
Sphinx==1.2.2
Theano==0.8.2
Twisted==16.2.0
Twisted-Core==13.2.0
Twisted-Names==13.2.0
Twisted-Web==13.2.0
Werkzeug==0.11.10
adblockparser==0.7
## FIXME: could not find svn URL in dependency_links for this package:
apsw==3.8.2-r1
apt-xapian-index==0.45
apturl==0.4.1ubuntu4
argparse==1.2.1
backports-abc==0.4
backports.ssl-match-hostname==3.4.0.2
beautifulsoup4==4.4.1
bokeh==0.11.1
boto==2.41.0
branca==0.1.1
bz2file==0.98
captcha-solver==0.1.1
certifi==2015.9.6.2
characteristic==14.3.0
chardet==2.0.1
click==5.1
cloudpickle==0.2.1
colorama==0.2.5
command-not-found==0.3
configglue==1.1.2
cssselect==0.9.1
cssutils==0.9.10
cymem==1.31.2
debtagshw==0.1
decorator==4.0.2
defer==1.0.6
deluge==1.3.6
dirspec==13.10
dnspython==1.11.1
docutils==0.11
drawnow==0.71.1
duplicity==0.6.23
enum34==1.1.6
feedparser==5.1.3
folium==0.2.1
functools32==3.2.3-2
futures==3.0.5
gensim==0.13.1
geocoder==1.8.2
geolocation-python==0.2.2
geopandas==0.2.1
geopy==1.11.0
gmplot==1.1.1
googlemaps==2.4.2
gyp==0.1
html5lib==0.999
httplib2==0.8
ipykernel==4.0.3
ipython==4.0.0
ipython-genutils==0.1.0
ipywidgets==4.0.3
itsdangerous==0.24
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.2
jupyter-console==4.0.2
jupyter-core==4.4.0
jupyterlab==0.31.8
jupyterlab-launcher==0.10.5
lockfile==0.8
lxml==3.3.3
matplotlib==1.3.1
mechanize==0.2.5
mistune==0.7.1
mpmath==0.19
murmurhash==0.26.4
mysql-connector-python==1.1.6
nbconvert==4.0.0
nbformat==4.3.0
netifaces==0.8
nltk==3.2.1
nose==1.3.1
notebook==5.4.0
numpy==1.14.0
oauth2==1.9.0.post1
oauthlib==1.1.2
oneconf==0.3.7
opencage==1.1.4
pandas==0.22.0
paramiko==1.10.1
path.py==7.6
patsy==0.4.1
pexpect==3.1
pickleshare==0.5
piston-mini-client==0.7.5
plac==0.9.6
plotly==2.0.6
preshed==0.46.4
protobuf==2.5.0
psutil==5.0.1
psycopg2==2.4.5
ptyprocess==0.5
py==1.4.31
pyOpenSSL==0.13
pyasn1==0.1.9
pyasn1-modules==0.0.8
pycrypto==2.6.1
pycups==1.9.66
pycurl==7.19.3
pygobject==3.12.0
pyinotify==0.9.4
pymongo==3.3.0
pyparsing==2.0.1
pyserial==2.7
pysmbc==1.0.14.1
pyspatialite==3.0.1
pysqlite==2.6.3
pytesseract==0.2.0
pytest==2.9.2
python-Levenshtein==0.12.0
python-apt==0.9.3.5
python-dateutil==2.6.1
python-debian==0.1.21-nmu2ubuntu2
python-libtorrent==0.16.13
pytz==2017.3
pyxdg==0.25
pyzmq==14.7.0
qt5reactor==0.3
qtconsole==4.0.1
queuelib==1.4.2
ratelim==0.1.6
reportlab==3.0
repoze.lru==0.6
requests==2.10.0
requests-oauthlib==0.6.2
roman==2.0.0
scikit-learn==0.17
scipy==0.17.1
scrapy-random-useragent==0.1
scrapy-splash==0.7.1
seaborn==0.7.0
selenium==2.53.6
semver==2.6.0
service-identity==14.0.0
sessioninstaller==0.0.0
shub==1.3.4
simpledbf==0.2.6
simplegeneric==0.8.1
simplejson==3.3.1
singledispatch==3.4.0.3
six==1.11.0
smart-open==1.3.3
smartystreets.py==0.2.4
spacy==0.101.0
sputnik==0.9.3
spyder==2.3.9
statsmodels==0.6.1
stevedore==0.14.1
subprocess32==3.2.7
sympy==1.0
system-service==0.1.6
terminado==0.8.1
tesseract==0.1.3
textblob==0.11.1
textrazor==1.2.2
thinc==5.0.8
tornado==4.3
traitlets==4.3.2
tweepy==3.3.0
uTidylib==0.2
urllib3==1.7.1
utils==0.9.0
vboxapi==1.0
vincent==0.4.4
virtualenv==15.0.2
virtualenv-clone==0.2.4
virtualenvwrapper==4.1.1
w3lib==1.12.0
wordcloud==1.2.1
wsgiref==0.1.2
yelp==1.0.2
zope.interface==4.0.5

我跑:

curl -v --proxy "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128" "https://www.coursetalk.com/subjects/data-science/courses" and see if it works or not 

确实有效并加载页面:

> Host: www.coursetalk.com:443
> Proxy-Authorization: Basic c2FybmVuY2otdXMtMTprZDk5NzIybDJrN3k=
> User-Agent: curl/7.35.0
> Proxy-Connection: Keep-Alive
> 
< HTTP/1.1 200 Connection established
< Date: Fri, 04 May 2018 22:02:00 GMT
< Age: 0
< Transfer-Encoding: chunked
* CONNECT responded chunked
< Proxy-Connection: keep-alive
< Server: Webshare
< 
* Proxy replied OK to CONNECT request
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):

编辑3

这是当前日志:

2018-05-04 19:17:07 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: moocs)
2018-05-04 19:17:07 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 2.7.6 (default, Jun 22 2015, 18:00:18) - [GCC 4.8.2], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Linux-3.13.0-107-generic-i686-with-LinuxMint-17-qiana
2018-05-04 19:17:07 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'moocs.spiders', 'SPIDER_MODULES': ['moocs.spiders'], 'DOWNLOAD_DELAY': 3, 'BOT_NAME': 'moocs'}
2018-05-04 19:17:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-05-04 19:17:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['moocs.middlewares.ProxyMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-05-04 19:17:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-05-04 19:17:07 [py.warnings] WARNING: /media/luis/DATA/articulos/moocs/scripts/moocs/moocs/pipelines.py:9: ScrapyDeprecationWarning: Importing from scrapy.xlib.pydispatch is deprecated and will no longer be supported in future Scrapy versions. If you just want to connect signals use the from_crawler class method, otherwise import pydispatch directly if needed. See: https://github.com/scrapy/scrapy/issues/1762
  from scrapy.xlib.pydispatch import dispatcher

2018-05-04 19:17:07 [scrapy.middleware] INFO: Enabled item pipelines:
['moocs.pipelines.MultiCSVItemPipeline']
2018-05-04 19:17:07 [scrapy.core.engine] INFO: Spider opened
2018-05-04 19:17:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-04 19:17:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
^C2018-05-04 19:17:08 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 
2018-05-04 19:17:08 [scrapy.core.engine] INFO: Closing spider (shutdown)

标签: pythonproxyscrapy

解决方案


我认为这个问题可能与您触摸ProxyMiddleware. 我更新了您的代码并像下面一样运行它

从scrapy导入蜘蛛

class Test(Spider):
    name ="proxyapp"
    start_urls = ["https://www.coursetalk.com/subjects/data-science/courses"]


    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'jobs.middlewares.ProxyMiddleware': 100
        }
    }

    def parse(self, response):
        print(response.text)

middlewares.py

class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] =  "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128"

并运行代码,它工作正常

工作正常

我测试的scrapy版本如下

Scrapy==1.5.0

为了 100% 确定代理正在运行,我运行了它ipinfo.io/json

代理信息

相信我,我不会坐在特拉华州甚至美国


推荐阅读