python - Python异步爬取失败:aiohttp.client_exceptions.ClientPayloadError:响应负载未完成
问题描述
目前我尝试进入 python 和异步请求的可能性。
因此,我喜欢抓取包含 1000 多个站点的页面。每个站点还有 100 个我需要关注和抓取的链接。所以我尝试分两阶段进行。
第一:爬取所有主要页面并收集数据,复制链接。第二:遍历响应并对主页上的链接发出所有其他请求。
但我总是遇到错误:
有人看到这里出了什么问题吗?
Traceback (most recent call last):
File "/Users/jonathansaudhof/Programming/Python basics/test_async.py", line 91, in <module>
main()
File "/Users/jonathansaudhof/Programming/Python basics/test_async.py", line 82, in main
loop.run_until_complete(future)
File "/usr/local/Cellar/python@3.8/3.8.4/Frameworks/Python.framework/Versions/3.8/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/Users/jonathansaudhof/Programming/Python basics/test_async.py", line 53, in getPageQuestions
questions = await asyncio.gather(*tasks)
File "/Users/jonathansaudhof/Programming/Python basics/test_async.py", line 40, in get_html
html = await resp.content.read()
File "/Users/jonathansaudhof/Programming/Python basics/env/lib/python3.8/site-packages/aiohttp/streams.py", line 358, in read
block = await self.readany()
File "/Users/jonathansaudhof/Programming/Python basics/env/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in readany
await self._wait('readany')
File "/Users/jonathansaudhof/Programming/Python basics/env/lib/python3.8/site-packages/aiohttp/streams.py", line 296, in _wait
await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
我的代码:
import aiohttp
import asyncio
from bs4 import BeautifulSoup
import json
result={}
result["data"]=[]
baseUrl= "https://example.com"
def getQuestionAnswers(html):
soup = BeautifulSoup(html, "lxml")
rows = soup.tbody.findAll("tr")
r=[]
for row in rows:
rowTd = row.findAll("td")
data = {}
data["answer"] = ("").join(rowTd[0].a.contents)
data["characters"] = int(rowTd[1].contents[0].split()[0])
r.append(data)
return r
async def get_html(url, session, sem):
async with sem:
async with session.get(url) as resp:
html = await resp.content.read()
return html
async def getPageQuestions(until):
r = []
tasks = []
# First Step: getting all main pages
sem = asyncio.Semaphore(30)
async with aiohttp.ClientSession() as session1:
for n in range(until):
task = asyncio.ensure_future(get_html(baseUrl + /page=" + str(n), session1, sem))
tasks.append(task)
questions = await asyncio.gather(*tasks)
# Second Step: getting all main pages
answers=[]
async with aiohttp.ClientSession() as sessions:
for html in questions:
soup = BeautifulSoup(html, "lxml")
rows = soup.tbody.findAll("tr")
for row in rows:
rowTd = row.findAll("td")
answers.append(asyncio.ensure_future(get_html(baseUrl + rowTd[0].a.get("href"), session2)))
answers = await asyncio.gather(*answers)
# TODO merge questions and answers together in an object
result["data"].append(r)
def main():
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(getPageQuestions(1751))
loop.run_until_complete(future)
print(result)
with open('data_a.json', 'w') as outfile:
json.dump(result, outfile)
print("Done.")
if __name__ == '__main__':
main()
解决方案
推荐阅读
- javascript - 用 jQuery 插入复杂的 HTML?
- javascript - 在 ListView 中使用 react-native-modalbox 会导致 modalbox 仅填充列表项空间而不是全屏?
- performance - 笛卡尔积的 Spark 性能调优
- javascript - 如何在javascript中将字节数组(.h264格式)解码为视频?
- c++ - 使用 SDL2 和 stb_vorbis 破解音频
- c++ - 如何在使用 TensorRT C++ API 编写的 TensorRT 模型上运行半精度推理?
- r - Creating Dummy Data
- javascript - for+if loop doing me some problems
- html - Border Hover color error On Region of US Map SVG
- ios - THREEJS - 影子神器 - IOS 设备