首页 > 解决方案 > 如何使用正则表达式从 HTML 源代码中提取 JSON

问题描述

Python 脚本

import requests
import json
from bs4 import BeautifulSoup
import re

url = 'https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125'

r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')

# Save source code to file for testing
with open("sourcecode.html", "w", encoding='utf-8') as file:
    file.write(str(soup))

# Regex pattern to capture JSON data within webpage source code
regex_pattern = r"{\"delivery\"*.*false*}}}"

网址https ://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125

我正在尝试使用Regex提取嵌入在上面列出的 URL的源代码中的JSON数据。

我已手动从列出的 URL 中提取源代码,并使用以下正则表达式模式输入regex101.com

{\"delivery\"*.*false*}}}

正则表达式模式似乎可以捕获所需的 JSON 数据。

问题

When I view the contents of the soup variable or saved file it appears to capture the HTML source code.
However, I do not know how to process regex to only capture the JSON data string needed to build my desired Python Dictionary.

Any help would be greatly appreciated.

标签: pythonhtmljsonregexparsing

解决方案


Maybe something like this can help you:

url = 'https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125'

r = requests.get(url)
source_text = r.text
# Regex for extract info
json = re.findall('put your regex here', source_text)

To convert the returned list to json you can use:

import json
json_format = json.dumps(json)

推荐阅读