python-3.x - 抓取网页后,我立即收到错误 http.client.RemoteDisconnected
问题描述
我试试这段代码:
import gspread
import requests
import datetime
from bs4 import BeautifulSoup
from oauth2client.service_account import ServiceAccountCredentials
from pprint import pprint
from datetime import timedelta
datetime.datetime.now()
scope = [
'https://www.googleapis.com/auth/spreadsheets',
'https://www.googleapis.com/auth/drive'
]
URL = 'https://colnect.com/cs/coins/list/country/57-%C4%8Cesk%C3%A1_republika/series/76375-1993~sou%C4%8Dasnost_-_ob%C4%9B%C5%BEn%C3%A9'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
#Google sheet
data = ServiceAccountCredentials.from_json_keyfile_name("data.json", scope)
client = gspread.authorize(data)
sheet = client.open("skript").worksheet('ColnectTest')
data = sheet.get_all_records()
#Scraping
results = soup.find_all('div', attrs={'class':'pl-it'})
for job_data in results:
mince = job_data.find('h2', attrs={"class":"item_header"})
mince_final = mince.text.strip()
#přidání řádku do sheetu
insertRow = ["colnect.cz", mince_final]
sheet.insert_row(insertRow,2)
但我立即收到此错误消息:
Traceback (most recent call last):
File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "C:\Python\Python38-32\lib\http\client.py", line 1347, in getresponse
response.begin()
File "C:\Programs\Python\Python38-32\lib\http\client.py", line 307, in begin
version, status, reason = self._read_status()
File "C:\Programs\Python\Python38-32\lib\http\client.py", line 276, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Programs\Python\Python38-32\lib\site-packages\requests\adapters.py", line 439, in send
resp = conn.urlopen(
File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\util\retry.py", line 403, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\packages\six.py", line 734, in reraise
raise value.with_traceback(tb)
File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "C:\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "C:\Programs\Python\Python38-32\lib\http\client.py", line 1347, in getresponse
response.begin()
File "C:\Programs\Python\Python38-32\lib\http\client.py", line 307, in begin
version, status, reason = self._read_status()
File "C:\Programs\Python\Python38-32\lib\http\client.py", line 276, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:/Projekty/python/skript/t.py", line 17, in <module>
page = requests.get(URL)
File "C:\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "C:\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "C:\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "C:\Programs\Python\Python38-32\lib\site-packages\requests\adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
我的代码有什么问题?我对其他网页使用相同的代码,这对他们来说没问题。我这边是否有任何解决方案,或者 Web 服务器端有阻塞(有些阻止)?
我想将网页中的一些数据插入到 Google 表格中。我尝试 h2 class=item_header 中的第一个元素来获取硬币的名称,并在成功插入后继续其他元素。
解决方案
您需要指定User-Agent
从服务器获得正确的响应,例如:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
URL = "https://colnect.com/cs/coins/list/country/57-%C4%8Cesk%C3%A1_republika/series/76375-1993~sou%C4%8Dasnost_-_ob%C4%9B%C5%BEn%C3%A9"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
print(soup.title)
印刷:
<title>Česká republika : Mince [Série: 1993~současnost - oběžné] [1/2]</title>
推荐阅读
- c++ - 错误:使用 std::isnan 的命名空间“std”中没有名为“isnan”的成员;
- angularjs - 带有自定义属性的 Angular 1.x 图像上传
- r - 将 R 中的两列转换为唯一出现的行
- cuda - “占用图”中显示的 SM 是否对应于 `blockIdx.x` 或寄存器 `%smid`?
- docker - Docker in LXD: can't communicate between services in swarm, but can in docker-compose
- c# - How can a .net-core application read the keyboard state, irrespective of the operating system?
- postgresql - Keeping id continuity after deletion of rows
- docker - Docker 构建挂起 ---> 正在运行
- windows - 当文件进入回收站时权限会发生什么变化?
- r - 将字符串通过多个过滤器进行匹配