首页 > 解决方案 > 无法使用请求和 bs4 抓取数据

问题描述

我编写了一个从电子商务网站提取数据的脚本,我使用 bs4 来抓取页面的内容并请求提取数据。当我在我的机器上本地运行脚本时,一切正常。列出数据需要 3-4 秒,但是是的,它有效。现在,当我在 Heroku 上部署脚本时,问题就开始了。即使将其推送到 Heroku,脚本也可以正常工作,但速度有点慢,最烦人的部分是它经常崩溃。所以它会抓取数据 6-7 次,然后会抛出一大块错误。作为一个初学者,我无法从中做出任何事情。这是从 Heroku 找到的完整回溯日志:

2020-09-11T18:39:48.896959+00:00 app[worker.1]: Traceback (most recent call last):
2020-09-11T18:39:48.897027+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connection.py", line 159, in _new_conn
2020-09-11T18:39:48.897328+00:00 app[worker.1]: conn = connection.create_connection(
2020-09-11T18:39:48.897333+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection
2020-09-11T18:39:48.897547+00:00 app[worker.1]: raise err
2020-09-11T18:39:48.897569+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection
2020-09-11T18:39:48.897793+00:00 app[worker.1]: sock.connect(sa)
2020-09-11T18:39:48.897834+00:00 app[worker.1]: OSError: [Errno 113] No route to host
2020-09-11T18:39:48.897835+00:00 app[worker.1]: 
2020-09-11T18:39:48.897891+00:00 app[worker.1]: During handling of the above exception, another exception occurred:
2020-09-11T18:39:48.897892+00:00 app[worker.1]: 
2020-09-11T18:39:48.897898+00:00 app[worker.1]: Traceback (most recent call last):
2020-09-11T18:39:48.897898+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
2020-09-11T18:39:48.898299+00:00 app[worker.1]: httplib_response = self._make_request(
2020-09-11T18:39:48.898322+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connectionpool.py", line 381, in _make_request
2020-09-11T18:39:48.898652+00:00 app[worker.1]: self._validate_conn(conn)
2020-09-11T18:39:48.898672+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connectionpool.py", line 978, in _validate_conn
2020-09-11T18:39:48.899235+00:00 app[worker.1]: conn.connect()
2020-09-11T18:39:48.899238+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connection.py", line 309, in connect
2020-09-11T18:39:48.899483+00:00 app[worker.1]: conn = self._new_conn()
2020-09-11T18:39:48.899488+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connection.py", line 171, in _new_conn
2020-09-11T18:39:48.899630+00:00 app[worker.1]: raise NewConnectionError(
2020-09-11T18:39:48.899656+00:00 app[worker.1]: urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fd5906c0250>: Failed to establish a new connection: [Errno 113] No route to host
2020-09-11T18:39:48.899658+00:00 app[worker.1]: 
2020-09-11T18:39:48.899658+00:00 app[worker.1]: During handling of the above exception, another exception occurred:
2020-09-11T18:39:48.899659+00:00 app[worker.1]: 
2020-09-11T18:39:48.899661+00:00 app[worker.1]: Traceback (most recent call last):
2020-09-11T18:39:48.899678+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
2020-09-11T18:39:48.899896+00:00 app[worker.1]: resp = conn.urlopen(
2020-09-11T18:39:48.899899+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
2020-09-11T18:39:48.900165+00:00 app[worker.1]: retries = retries.increment(
2020-09-11T18:39:48.900180+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/util/retry.py", line 439, in increment
2020-09-11T18:39:48.900369+00:00 app[worker.1]: raise MaxRetryError(_pool, url, error or ResponseError(cause))
2020-09-11T18:39:48.900409+00:00 app[worker.1]: urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.flipkart.com', port=443): Max retries exceeded with url: /search?q=shoes&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fd5906c0250>: Failed to establish a new connection: [Errno 113] No route to host'))
2020-09-11T18:39:48.900411+00:00 app[worker.1]: 
2020-09-11T18:39:48.900411+00:00 app[worker.1]: During handling of the above exception, another exception occurred:
2020-09-11T18:39:48.900412+00:00 app[worker.1]: 
2020-09-11T18:39:48.900412+00:00 app[worker.1]: Traceback (most recent call last):
2020-09-11T18:39:48.900414+00:00 app[worker.1]: File "server.py", line 103, in <module>
2020-09-11T18:39:48.900542+00:00 app[worker.1]: reply= bot.flipkart(product= message_type)
2020-09-11T18:39:48.900567+00:00 app[worker.1]: File "/app/bot.py", line 86, in flipkart
2020-09-11T18:39:48.900823+00:00 app[worker.1]: datas= Test.scrape(product)
2020-09-11T18:39:48.900828+00:00 app[worker.1]: File "/app/Test.py", line 7, in __init__
2020-09-11T18:39:48.901017+00:00 app[worker.1]: self.source= requests.get('https://www.flipkart.com/search?q={}&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off'.format(search_query)).content
2020-09-11T18:39:48.901049+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/requests/api.py", line 76, in get
2020-09-11T18:39:48.901257+00:00 app[worker.1]: return request('get', url, params=params, **kwargs)
2020-09-11T18:39:48.901262+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/requests/api.py", line 61, in request
2020-09-11T18:39:48.901466+00:00 app[worker.1]: return session.request(method=method, url=url, **kwargs)
2020-09-11T18:39:48.901471+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/requests/sessions.py", line 530, in request
2020-09-11T18:39:48.901887+00:00 app[worker.1]: resp = self.send(prep, **send_kwargs)
2020-09-11T18:39:48.901891+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/requests/sessions.py", line 643, in send
2020-09-11T18:39:48.902410+00:00 app[worker.1]: r = adapter.send(request, **kwargs)
2020-09-11T18:39:48.902413+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/requests/adapters.py", line 516, in send
2020-09-11T18:39:48.902823+00:00 app[worker.1]: raise ConnectionError(e, request=request)
2020-09-11T18:39:48.902882+00:00 app[worker.1]: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.flipkart.com', port=443): Max retries exceeded with url: /search?q=shoes&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fd5906c0250>: Failed to establish a new connection: [Errno 113] No route to host'))
2020-09-11T18:39:48.991351+00:00 heroku[worker.1]: Process exited with status 1
2020-09-11T18:39:49.047690+00:00 heroku[worker.1]: State changed from up to crashed

我很抱歉没有分享整个代码。我会分享它,但我已经将两个或三个文件链接在一起,所以不可能在这里分享整个代码。我很努力但无法理解错误,所以任何帮助将不胜感激!

标签: beautifulsouppython-requestspython-3.7

解决方案


您显示的错误是由于没有互联网或互联网速度很慢引起的。如果不起作用,请尝试检查是否有适当的互联网重新启动您当前的 python 环境


推荐阅读