首页 > 解决方案 > 如何解析 BeautifulSoup 对象中的所有 HTML 标签?

问题描述

我无法解析嵌套的 BeautifulSoup 对象中的 HTML 标记。这里

response = requests.get(
'myurl',
headers={'Authorization': 'Bearer ' + auth_token},
params=params
)
soup = BeautifulSoup(response.content, 'html.parser')
soup = json.loads(str(soup))
all_data.extend(soup['data'])

但是 soup['data'] 是这样的字典列表:

[{"_id":"123","tags":[],"user":{"_id":"u1","name":"ASD Na"},"shared":"<p>Personal: Parents </p><p><br/></p><p>KM: </p><p><br/></p>","private":"","created":"2019-01-26T16:54:56.283Z","district":"543543","creator":{"_id":"c432","name":"Cass Man"},"lastModified":"2019-01-26T16:54:56.284Z"},
{"_id":"234","tags":[],"user":{"_id":"u2","name":"Tyler Dass"},"shared":"Hi,<p>It's great to see your clear.</p>","private":"","created":"2019-11-26T15:48:43.314Z","district":"543543","creator":{"_id":"432","name":"John"},"lastModified":"2019-11-26T15:48:43.315Z"}]

尽管标签只出现在shared键中,但它们确实出现在多个字段中。如何访问soup和使用各种 BeautifulSoup 函数来获取所有字段中的所有正确文本?我尝试使用soup.get_text(),但没有奏效。

标签: pythonbeautifulsoup

解决方案


从我看到的示例中,您收到了 JSON 响应,因此您不需要 BeautifulSoup 来解析它:

response = requests.get('myurl', headers={'Authorization': 'Bearer ' + auth_token}, params=params)
data = response.json()   # <-- note the .json() call

all_data.extend(data['data'])

然后,要从sharedkey 获取信息,您可以将其转换为 BeautifulSoup 对象:

for d in all_data:
    soup = BeautifulSoup(d['shared'], 'html.parser')
    # print only text from <p> tags:
    print([p.get_text(strip=True) for p in soup.select('p')])

印刷:

['Personal: Parents', '', 'KM:', '']
["It's great to see your clear."]

推荐阅读