首页 > 解决方案 > BeautifulSoup 返回 Null 结果

问题描述

我对使用 beauifulsoup 很陌生,我正在尝试使用下面的代码从网站上抓取文本。但是, find_all 什么也不返回。

import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://beta.regulations.gov/document/USCIS-2019-0010-9175').read()
soup = BeautifulSoup(page.content,'html.parser')
text = soup.find_all(class_="px-2")
print(text)

网站的html

标签: pythonbeautifulsoup

解决方案


如评论中所述,数据是通过 Javascript 动态加载的。但是当您打开 Firefox/Chrome 网络选项卡时,您可以看到数据的来源:

import requests

url = 'https://beta.regulations.gov/document/USCIS-2019-0010-9175'
ajax_url = 'https://beta.regulations.gov/api/documentdetails/{}'

document_id = url.split('/')[-1]
data = requests.get(ajax_url.format(document_id)).json()

# from pprint import pprint # <-- uncoment to see all data
# pprint(data)

print(data['data']['attributes']['content'])

印刷:

Rescind the increase in fees. This is draconian. For all intents and purposes, denying access to this information will prevent many Americans from knowing where they came from. This is an outrage. This is not the mark of a democracy. I strongly disagree with this fee increase

推荐阅读