javascript - 使用 python 抓取 javascript 网站和脚本标签
问题描述
我正在尝试抓取一个 javascript 网页。在阅读了一些帖子后,我设法写了以下内容:
from bs4 import BeautifulSoup
import requests
website_url = requests.get('https://ec.europa.eu/health/documents/community-register/html/reg_hum_atc.htm').text
soup= BeautifulSoup(website_url,'lxml')
print(soup.prettify())
并恢复以下脚本,如下所示:
soup.find_all('script')[3]
这使:
<script type="text/javascript">
// Initialize script parameters.
var exportTitle ="Centralised medicinal products for human use by ATC code";
// Initialise the dataset.
var dataSet = [
{"id":"A","parent":"#","text":"A - Alimentary tract and metabolism"},
{"id":"A02","parent":"A","text":"A02 - Drugs for acid related disorders"},
{"id":"A02B","parent":"A02","text":"A02B - Drugs for treatment of peptic ulcer"},
{"id":"A02BC","parent":"A02B","text":"A02BC - Proton pump inhibitors"},
{"id":"A02BC01","parent":"A02BC","text":"A02BC01 - omeprazole"},
{"id":"ho15861","parent":"A02BC01","text":"Losec and associated names (referral)","type":"pl"},
...
{"id":"h154","parent":"V09IA05","text":"NeoSpect (withdrawn)","type":"pl"},
{"id":"V09IA09","parent":"V09IA","text":"V09IA09 - technetium (<sup>99m</sup>Tc) tilmanocept"},
{"id":"h955","parent":"V09IA09","text":"Lymphoseek (active)","type":"pl"},
{"id":"V09IB","parent":"V09I","text":"V09IB - Indium (<sup>111</sup>In) compounds"},
{"id":"V09IB03","parent":"V09IB","text":"V09IB03 - indium (<sup>111</sup>In) antiovariumcarcinoma antibody"},{"id":"h025","parent":"V09IB03","text":"Indimacis 125 (withdrawn)","type":"pl"},
...
]; </script>
现在我面临的问题是将 .text() 应用于soup.find_all('script')[3]
并从中恢复一个 json 文件。当我尝试应用 .text() 时,结果是一个空字符串:''。
所以我的问题是:为什么会这样?理想情况下,我想结束:
A02BC01 Losec and associated names (referral)
...
V09IA05 NeoSpect (withdrawn)
V09IA09 Lymphoseek
V09IB03 Indimacis 125 (withdrawn)
...
解决方案
首先,您获取文本,然后进行一些字符串处理 - 获取 'dataSet = ' 之后的所有文本并删除最后一个 ';' 拥有一个漂亮的 JSON 数组。最后以小 json 格式处理 JSON 数组并打印数据。
data = soup.find_all("script")[3].string
dataJson = data.split('dataSet = ')[1].split(';')[0]
jsonArray = json.loads(dataJson)
for jsonElement in jsonArray:
print(jsonElement['parent'], end=' ')
print(jsonElement['text'])
推荐阅读
- c++ - 在 Qt Creator 中调试 c++ 程序时如何正确忽略 throw 中断?
- vue.js - 图像src vue中的动态路径
- api - 使用 Tensorflow-API 运行实时对象检测时出错
- excel - 尝试从嵌套目录返回或列出所有具有特定工作表名称的 Excel 文件
- php - 通过定义静态的方法访问数据,php代码不提供任何输出
- swift - 有没有办法找到实际的 SwiftUI API 文档(而不仅仅是开发者文档)?
- android - 使用 adb 发送带有额外 `Bundle` 的广播
- mongodb - Mongodb 聚合管道优化 - $match 的 2 阶段
- sql - 如何自动将一个单元格中的多个文本字段转换为不同的行或其他格式以进行报告?
- c++11 - QComboBox currentIndexChanged 在 QGraphicsView 上无法正常工作