首页 > 解决方案 > BeautifulSoup - find() 函数不适用于某些元素

问题描述

我正在尝试从该 URL 中刮取财务数据:https ://www.londonstockexchange.com/stock/STAN/standard-chartered-plc/fundamentals

h1在这个网页中,通过引用它的类来抓取标签可以完美地工作。

源 HTML

<h1 _ngcontent-ng-lseg-c11="" class="company-name font-bold hero-font"><!----><!---->STANDARD CHARTERED PLC<!----><!----><!----></h1>

我的 Python 代码

from bs4 import BeautifulSoup
import requests


headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
url = 'https://www.londonstockexchange.com/stock/{}/{}'

stock = 'STAN/standard-chartered-plc'
info = 'fundamentals'

full_url = url.format(stock, info)

print(full_url)

r = requests.get(full_url)

soup = BeautifulSoup(r.text, 'lxml')

title = soup.find('title')
print(title)

rows = soup.find(class_='company-name font-bold hero-font')

print(rows)

输出

https://www.londonstockexchange.com/stock/STAN/standard-chartered-plc/fundamentals
<title>STANDARD CHARTERED PLC STAN Fundamentals - Stock | London Stock Exchange</title>
<h1 _ngcontent-sc12="" class="company-name font-bold hero-font"><!-- --><!-- -->STANDARD CHARTERED PLC<!-- --><!-- --><!-- --></h1>

但是当试图抓取网页的另一部分,即以下标签时,此功能停止工作:

<thead _ngcontent-ng-lseg-c21="" class="accordion-header gtm-trackable">

我的 Python 代码

from bs4 import BeautifulSoup
import requests


headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
url = 'https://www.londonstockexchange.com/stock/{}/{}'

stock = 'STAN/standard-chartered-plc'
info = 'fundamentals'

full_url = url.format(stock, info)

print(full_url)

r = requests.get(full_url)

soup = BeautifulSoup(r.text, 'lxml')

title = soup.find('title')
print(title)

rows = soup.find(class_='accordion-header gtm-trackable')

print(rows)

我的输出如下:

https://www.londonstockexchange.com/stock/STAN/standard-chartered-plc/fundamentals
<title>STANDARD CHARTERED PLC STAN Fundamentals - Stock | London Stock Exchange</title>
None

我试过使用 'html.parser' 和 'lxml' 都导致同样的问题。

标签: pythonhtmlweb-scrapingbeautifulsoup

解决方案


数据是从脚本标签动态加载的。您可以正则表达式输出保存数据的字符串,然后对某些实体进行替换以获取字符串 json 可以转换为 json 对象。然后解析出你是什么。

import requests
from bs4 import BeautifulSoup
import json

r = requests.get('https://www.londonstockexchange.com/stock/STAN/standard-chartered-plc/fundamentals')
soup = BeautifulSoup(r.text, 'lxml')
data = json.loads(soup.select_one('#ng-lseg-state').string.replace('&q;','"'))
print(data['sortedComponents']['content'][1]['status']['childComponents'][1]['content'].keys())

可能还有一些其他实体需要替换。添加以下内容可能就足够了:

import html

然后

data = json.loads(html.unescape(soup.select_one('#ng-lseg-state').string.replace('&q;','"')))

数据样本:


要匹配图像:

from pprint import pprint

pprint(data['sortedComponents']['content'][1]['status']['childComponents'][1]['content'])

粘贴到 json 查看器中的字符串:

json.dumps(data['sortedComponents']['content'][1]['status']['childComponents'][1]['content'])

推荐阅读