首页 > 解决方案 > 用 Python bs4 从 HTML 中提取文本

问题描述

我正在尝试从中提取值, <div class="number">如下图所示,但输出返回None,我该如何获取该值?

的HTML:

我要提取的 HTML 附在此处

我已经尝试过的代码:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from pylogix import PLC   

my_url = 'https://www.aeso.ca/'
uClient =  uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
report = page_soup.findAll("div",{"class":"number"})

print(report)

标签: pythonbeautifulsoup

解决方案


该网站是动态加载的,因此requests不支持它。我们可以使用Selenium作为抓取页面的替代方案。

安装它:pip install selenium

从这里下载正确的 ChromeDriver 。

from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup


URL = "https://www.aeso.ca/"
driver = webdriver.Chrome(r"c:\path\to\chromedriver.exe")

driver.get(URL)
# Wait for the page to fully render before parsing it
sleep(5)

# The source of the page is in the `page_source` method of the `driver`
soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()

report = soup.find_all("div", {"class": "number"})
print(report)

输出:

[<div class="number">10421 <span class="unit">MW</span></div>, <div class="number">$37.57 <span class="unit">/ MWh</span></div>]

要仅获取文本,请调用该.text方法:

for tag in report:
print(tag.text)

输出:

10421 MW
$37.57 / MWh

要仅获取“矿池价格”的输出,请使用 CSS 选择器:

print(soup.select_one(".chart-price div.number").text)

# Or uncomment this to only extract the price, and remove `/ MWh` from the output
# print(soup.select_one(".chart-price div.number").text.split("/")[0])

输出(当前):

$37.57 / MWh

推荐阅读