首页 > 解决方案 > Python异步爬取失败:aiohttp.client_exceptions.ClientPayloadError:响应负载未完成

问题描述

目前我尝试进入 python 和异步请求的可能性。

因此,我喜欢抓取包含 1000 多个站点的页面。每个站点还有 100 个我需要关注和抓取的链接。所以我尝试分两阶段进行。

第一:爬取所有主要页面并收集数据,复制链接。第二:遍历响应并对主页上的链接发出所有其他请求。

但我总是遇到错误:

有人看到这里出了什么问题吗?

Traceback (most recent call last):
  File "/Users/jonathansaudhof/Programming/Python basics/test_async.py", line 91, in <module>
    main()
  File "/Users/jonathansaudhof/Programming/Python basics/test_async.py", line 82, in main
    loop.run_until_complete(future)
  File "/usr/local/Cellar/python@3.8/3.8.4/Frameworks/Python.framework/Versions/3.8/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/Users/jonathansaudhof/Programming/Python basics/test_async.py", line 53, in getPageQuestions
    questions = await asyncio.gather(*tasks)
  File "/Users/jonathansaudhof/Programming/Python basics/test_async.py", line 40, in get_html
    html = await resp.content.read()
  File "/Users/jonathansaudhof/Programming/Python basics/env/lib/python3.8/site-packages/aiohttp/streams.py", line 358, in read
    block = await self.readany()
  File "/Users/jonathansaudhof/Programming/Python basics/env/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in readany
    await self._wait('readany')
  File "/Users/jonathansaudhof/Programming/Python basics/env/lib/python3.8/site-packages/aiohttp/streams.py", line 296, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed

我的代码:

import aiohttp 
import asyncio
from bs4 import BeautifulSoup
import json


result={}
result["data"]=[]

baseUrl= "https://example.com"

def getQuestionAnswers(html):

    soup = BeautifulSoup(html, "lxml")
    rows = soup.tbody.findAll("tr")

    r=[]
    for row in rows:
        rowTd = row.findAll("td")
        data = {}
        data["answer"] = ("").join(rowTd[0].a.contents)
        data["characters"] = int(rowTd[1].contents[0].split()[0])
        r.append(data)
    return r



async def get_html(url, session, sem):
    async with sem:
        async with session.get(url) as resp:
         
            html = await resp.content.read()
            return html


async def getPageQuestions(until):
    r = []
    tasks = []

    # First Step: getting all main pages

    sem = asyncio.Semaphore(30)
    async with aiohttp.ClientSession() as session1:
        for n in range(until):
            task = asyncio.ensure_future(get_html(baseUrl + /page=" + str(n), session1, sem))
            tasks.append(task)
        
        questions = await asyncio.gather(*tasks)
    
    
    # Second Step: getting all main pages

    answers=[]    
    async with aiohttp.ClientSession() as sessions:
       for html in questions:
          soup = BeautifulSoup(html, "lxml") 
          rows = soup.tbody.findAll("tr")
       
          for row in rows:
              rowTd = row.findAll("td")
              answers.append(asyncio.ensure_future(get_html(baseUrl + rowTd[0].a.get("href"), session2)))   
       answers = await asyncio.gather(*answers)

    # TODO merge questions and answers together in an object
    result["data"].append(r)




    
def main():

    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(getPageQuestions(1751))
    loop.run_until_complete(future)
    print(result)
    with open('data_a.json', 'w') as outfile:
        json.dump(result, outfile)
    
    print("Done.")

if __name__ == '__main__':

    main()

标签: pythonbeautifulsouppython-asyncio

解决方案


推荐阅读