首页 > 解决方案 > 抓取网页后,我立即收到错误 http.client.RemoteDisconnected

问题描述

我试试这段代码:

import gspread
import requests
import datetime 
from bs4 import BeautifulSoup
from oauth2client.service_account import ServiceAccountCredentials
from pprint import pprint
from datetime import timedelta

datetime.datetime.now()

scope = [
'https://www.googleapis.com/auth/spreadsheets',
'https://www.googleapis.com/auth/drive'
]

URL = 'https://colnect.com/cs/coins/list/country/57-%C4%8Cesk%C3%A1_republika/series/76375-1993~sou%C4%8Dasnost_-_ob%C4%9B%C5%BEn%C3%A9'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

#Google sheet
data = ServiceAccountCredentials.from_json_keyfile_name("data.json", scope)
client = gspread.authorize(data)
sheet = client.open("skript").worksheet('ColnectTest')
data = sheet.get_all_records()

#Scraping
results = soup.find_all('div', attrs={'class':'pl-it'})
for job_data in results:
    
    mince = job_data.find('h2', attrs={"class":"item_header"})
    mince_final = mince.text.strip()

    #přidání řádku do sheetu
    insertRow = ["colnect.cz", mince_final]
    sheet.insert_row(insertRow,2)

但我立即收到此错误消息:

Traceback (most recent call last):
  File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen      
    httplib_response = self._make_request(
  File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Python\Python38-32\lib\http\client.py", line 1347, in getresponse
    response.begin()
  File "C:\Programs\Python\Python38-32\lib\http\client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "C:\Programs\Python\Python38-32\lib\http\client.py", line 276, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:


Traceback (most recent call last):
  File "C:\Programs\Python\Python38-32\lib\site-packages\requests\adapters.py", line 439, in send
    resp = conn.urlopen(
  File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\util\retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\packages\six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Programs\Python\Python38-32\lib\http\client.py", line 1347, in getresponse
    response.begin()
  File "C:\Programs\Python\Python38-32\lib\http\client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "C:\Programs\Python\Python38-32\lib\http\client.py", line 276, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:/Projekty/python/skript/t.py", line 17, in <module>
    page = requests.get(URL)
  File "C:\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "C:\Programs\Python\Python38-32\lib\site-packages\requests\adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

我的代码有什么问题?我对其他网页使用相同的代码,这对他们来说没问题。我这边是否有任何解决方案,或者 Web 服务器端有阻塞(有些阻止)?

我想将网页中的一些数据插入到 Google 表格中。我尝试 h2 class=item_header 中的第一个元素来获取硬币的名称,并在成功插入后继续其他元素。

标签: python-3.xweb-scraping

解决方案


您需要指定User-Agent从服务器获得正确的响应,例如:

import requests
from bs4 import BeautifulSoup


headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}

URL = "https://colnect.com/cs/coins/list/country/57-%C4%8Cesk%C3%A1_republika/series/76375-1993~sou%C4%8Dasnost_-_ob%C4%9B%C5%BEn%C3%A9"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")

print(soup.title)

印刷:

<title>Česká republika : Mince [Série: 1993~současnost - oběžné] [1/2]</title>

推荐阅读