python-3.x - 使用 selenium 在 Python 中抓取 React 图表
问题描述
我正在尝试使用 selenium从网站链接上的 React 图表中抓取数据。我能够找到元素,但无法获取数据。我需要从该图表中获取的特定数据位于嵌套系列中:
"data":[{"name":"December 2019",
"....",
"coverage":107.9}
元素内<script id=react_5X8YGgN8H0GoMMQ4RLqjrQ </script>
最终结果应如下所示,从 data.name 和 data.coverage 中提取:
months = [December 2019, Januari 2020, Februari 2020, etc.]
coverages = [107.9, 107.8, 107.2, etc.]
到目前为止的一些代码:
from selenium import webdriver
url = 'https://www.aholddelhaizepensioen.nl/over-ons/financiele-situatie/beleidsdekkingsgraad'
website = url
driver = webdriver.Firefox()
driver.get(website)
time.sleep(4)
driver.find_element_by_id("react_5X8YGgN8H0GoMMQ4RLqjrQ")
解决方案 2
由于 chitown88 声明脚本标签是静态的,即不需要 selenium,因为请求可以解决问题,这是另一个获得我需要的数据的解决方案。
import requests
import BeautifulSoup as bs4
import pandas as pd
# Fetch site data
url = 'https://www.aholddelhaizepensioen.nl/over-ons/financiele-situatie/beleidsdekkingsgraad'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = bs4(r.content, 'html.parser')
# Find script
script_data = soup.find('script', attrs={'id':'react_5X8YGgN8H0GoMMQ4RLqjrQ'})
script_to_string = str(script) # cast to string for regex
# Regex
coverage_pattern = r'(?<="coverage":)\d{2,3}.\d{1}' #positive lookup, find everything after "coverage": with 2 or 3 numbers, a dot, and another number
months_pattern = r'(?<="name":")\w+\s\d{4}' #same as coverage_pattern, now based on word followed by four digits
# Data
coverages = re.findall(coverage_pattern,script_to_string)
months = re.findall(months_pattern,scrip_to_string)
frame = pd.DataFrame({'months':months,'coverages':coverages})
解决方案
实际上不需要使用 selenium,因为数据嵌入在静态响应的脚本标签中。只需将其拉出,稍微操作字符串以进入 json 格式,然后将其读入。然后只需遍历它即可:
import pandas as pd
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.aholddelhaizepensioen.nl/over-ons/financiele-situatie/beleidsdekkingsgraad'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'coverage' in script.text:
jsonStr = script.text
break
jsonStr = jsonStr.split('Section, ')[-1]
loop = True
while loop == True:
try:
jsonData = json.loads(jsonStr + '}')
loop = False
except:
jsonStr = jsonStr.rsplit('}',1)[0]
data = jsonData['data']['data']
months = []
coverages = []
for each in data:
months.append(each['name'])
coverages.append(each['coverage'])
输出:
print(months)
['December 2019', 'Januari 2020', 'Februari 2020', 'Maart 2020', 'April 2020', 'Mei 2020', 'Juni 2020', 'Juli 2020', 'Augustus 2020', 'September 2020', 'Oktober 2020', 'November 2020']
和
print(coverages)
[107.9, 107.8, 107.2, 106.1, 105.1, 104.3, 103.7, 103.0, 102.8, 102.3, 101.9, 101.6]
推荐阅读
- c# - 查看基于分号分隔符的多个结果
- swift - Passing a function with the same name gets "ambiguous" error
- elixir - Elixir phoenix 设置 websocket 响应头
- c# - 如何编写乌尔都语字符串 C#
- android - 如何根据firestore中的特定字段检索集合中的文档?
- c# - 关闭 Windows 窗体
- javascript - 变量在 AngularJS 中显示未定义,尽管值可用
- java - 在 RecyclerView.ViewHolder 中获取上下文
- window - 如何使用 PowerShell 更新 COM+ 应用程序并创建其 msi 文件?
- typescript - 将类分配给 DOM 对象的打字稿错误