首页 > 解决方案 > 我如何在 BeautifulSoup 中使用 lxml-xml 获取页面内容

问题描述

import asyncio
import aiohttp
import lxml
from bs4 import BeautifulSoup


async def get_content(session,url):
    async with session.get(url) as responce:
            data = await responce.read()
    return BeautifulSoup(data.decode('utf-8'), 'lxml-xml')
    

async def parse(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [asyncio.create_task(get_content(session,i)) for i in urls]
        soups = await asyncio.gather(*tasks,return_exceptions=True)
    return soups


url = "https://kolesa.kz/cars/almaty/?page={}"
urls = [url.format(i) for i in range(2,201)]

loop = asyncio.get_event_loop()
soups = loop.run_until_complete(parse(urls))
loop.close()

print(soups[0])

无法使用 BeautifulSoup lxml-xml 解析站点的 200 页内容。汤[0] 显示了这一点<?xml version="1.0" encoding="utf-8"?>。我可以使用 lxml-xml 获取 html 页面吗?

标签: pythonparsingbeautifulsouputf-8lxml

解决方案


parserlxml-xml 等于xmlparser。您可能不想将文档解析为 XML,而是 HTML。将其更改为lxml(或html5lib/ html.parser):

async def get_content(session,url):
    async with session.get(url) as responce:
            data = await responce.read()
    return BeautifulSoup(data.decode('utf-8'), 'lxml')  # <--- change to only `lxml`

然后输出是:

<!DOCTYPE html>
<html lang="en" xmlns:xlink="http://www.w3.org/1999/xlink">
<head>
<meta charset="utf-8"/>

...and so on.

推荐阅读