python - 如何在 webscraping 中使用 python 正则表达式提取值 fom 脚本?
问题描述
我正在尝试学习如何使用正则表达式来使用 python 提取值。这是脚本,那么如何获取salesprice、seller_name 和skuId<script>
define('app/pc', ['//laz-g-cdn.alicdn.com/lzdfe/pdp-platform/0.1.8/pc.js'], function(app) {
try {
app.run( {
"data": {
"root": {
"fields": {"skuInfos": {
"0": {
"categoryId":"8711", "dataLayer": {
"pdt_category":["Mother & Baby", "Feeding", "Milk Formula", "Follow On (6 - 12 mnths)"], "pagetype":"pdp", "pdt_discount":"-8%", "pdt_photo":"//laz-img-sg.alicdn.com/original/6bdf9b4b759b97f57b438a605f0e37e7.jpg", "v_voya":1, "brand_name":"Dumex", "brand_id":"30360", "pdt_sku":153105871, "core": {
"country": "SG", "layoutType": "desktop", "language": "en", "currencyCode": "SGD"
}
, "seller_name":"Dumex", "pdt_simplesku":191142180, "pdt_name":"Dumex Mamil Gold Stage 2 Follow On Baby Milk Formula (850g)", "page": {
"regCategoryId": "180101030000", "xParams": "_p_typ=pdp&_p_ispdp=1&_p_item=DU741TBAATAO7DSGAMZ-61110782&_p_prod=153105871&_p_sku=191142180&_p_slr=100047849"
}
, "supplier_id":100047849, "pdt_price":"47.9"
}
, "image":"//laz-img-sg.alicdn.com/original/6bdf9b4b759b97f57b438a605f0e37e7.jpg", "inWishlist":false, "itemId":"153105871", "operation": {
"operationWeight": 6, "text": "Add to Cart", "type": "default"
}
, "price": {
"discount":"-8%", "originalPrice": {
"text": "SGD47.90", "value": 47.9
}
, "salePrice": {
"text": "SGD44.29", "value": 44.29
}
} ,
], "sellerId":"100047849", "skuId":"191142180", "stock":18, "stockList":[ {
"stoock": 18, "type": "default"
}
]
}
</script>
解决方案
一种方法是基于 HTML 解析、正则表达式和 json 加载的组合:
- 找到所需的
script
元素BeautifulSoup
(您已经展示了一个script
元素,但我认为它实际上是在更大的 HTML 中) - 使用正则表达式提取所需的 Javascript 对象
- 用于
json.loads()
将其加载到 Python 字典/列表中 - 从此 Python 对象中获取所需的东西
沿着这些思路:
import json
import re
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<script>
define('app/pc', ['//laz-g-cdn.alicdn.com/lzdfe/pdp-platform/0.1.8/pc.js'], function(app) {
try {
app.run( {
"data": {
"root": {
"fields": {"skuInfos": {
"0": {
"categoryId":"8711", "dataLayer": {
"pdt_category":["Mother & Baby", "Feeding", "Milk Formula", "Follow On (6 - 12 mnths)"], "pagetype":"pdp", "pdt_discount":"-8%", "pdt_photo":"//laz-img-sg.alicdn.com/original/6bdf9b4b759b97f57b438a605f0e37e7.jpg", "v_voya":1, "brand_name":"Dumex", "brand_id":"30360", "pdt_sku":153105871, "core": {
"country": "SG", "layoutType": "desktop", "language": "en", "currencyCode": "SGD"
}
, "seller_name":"Dumex", "pdt_simplesku":191142180, "pdt_name":"Dumex Mamil Gold Stage 2 Follow On Baby Milk Formula (850g)", "page": {
"regCategoryId": "180101030000", "xParams": "_p_typ=pdp&_p_ispdp=1&_p_item=DU741TBAATAO7DSGAMZ-61110782&_p_prod=153105871&_p_sku=191142180&_p_slr=100047849"
}
, "supplier_id":100047849, "pdt_price":"47.9"
}
, "image":"//laz-img-sg.alicdn.com/original/6bdf9b4b759b97f57b438a605f0e37e7.jpg", "inWishlist":false, "itemId":"153105871", "operation": {
"operationWeight": 6, "text": "Add to Cart", "type": "default"
}
, "price": {
"discount":"-8%", "originalPrice": {
"text": "SGD47.90", "value": 47.9
}
, "salePrice": {
"text": "SGD44.29", "value": 44.29
}
} ,
"sellerId":"100047849", "skuId":"191142180", "stock":18, "stockList":[ {
"stoock": 18, "type": "default"
}
]
}
</script>"""
soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r'"skuInfos": {\s+"0": ({.*})$', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
json_string = pattern.search(script.get_text()).group(1)
data = json.loads(json_string)
print(data['price']['salePrice']['value'])
print(data['skuId'])
print(data['dataLayer']['pdt_category'])
印刷:
44.29
191142180
['Mother & Baby', 'Feeding', 'Milk Formula', 'Follow On (6 - 12 mnths)']
请注意,要使其正常工作,我必须修复 JS 本身的语法错误,因为我认为这不是您实际拥有的完整脚本。无论如何,我可以想象您需要调整模式以更好地匹配您特定用例所需的 JS 对象。
推荐阅读
- django - 模型get_absolute_url中命名空间视图的Django url反向
- android - 如何根据风味更改 Flutter 应用程序名称?
- php - 根据 CURL size_download / download_content_length 验证保存的 HTML 文件大小?
- c - 使用指针更改变量的值不起作用
- sql - 复合外键引用 2 个不同的表
- laravel - 如何在laravel中显示当前天气报告
- css - 当你有两张图片(一张在另一张上)时如何使图像响应
- algorithm - 如何在具有重复节点值的 n 叉树中找到最短路径?
- c# - 只将非重复值添加到 aa 列表的循环?
- swift - NSAttributedString boundingRect returns wrong height