python - 如何从 Javascript 代码中提取 URL?- Python
问题描述
我的一个网站前段时间下线了,我需要恢复图像。我已经设法编写了一些 python,它使用 Beautiful Soup 从脚本标签中提取代码。我现在需要从提取的文本中解析一些 url。所需的网址与"large"
图像有关。我不确定如何为所有图像合并循环,而不仅仅是第一个图像并删除语音标记。任何帮助将不胜感激
提取文本:
var gallery_items = [{
"type": "image",
"medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg",
"medium-height": 267,
"medium-width": 400,
"large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg",
"large-height": 450,
"large-width": 675,
"awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg",
"caption": ""
}, {
"type": "image",
"medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg",
"medium-height": 267,
"medium-width": 400,
"large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg",
"large-height": 450,
"large-width": 675,
"awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg",
"caption": ""
}];
Python 脚本
from bs4 import BeautifulSoup
import urllib.request as request
import re
folder = r'./gallery'
URL = 'https://web.archive.org/web/20180324152250/http://www.example.com:80/project/test-museum-visitors-center/'
response = request.urlopen(URL)
soup = BeautifulSoup(response, 'html.parser')
scriptCnt = soup.find('div', {'class': 'posts-wrapper'})
script = scriptCnt.find('script').text
try:
found = re.search('"large":(.+?)"', script).group(1)
except AttributeError:
found = 'None Found!'
print(found)
输出
"https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg
解决方案
给定的数据是 JSON 格式,很容易用 Python 的 JSON 库解析。您需要做的就是仔细单独提取 JSON 并提供给 JSON 解析器。代码可能看起来像,
import json
script_str = '''var gallery_items = [{ "type": "image", "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg", "medium-height": 267, "medium-width": 400, "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg", "large-height": 450, "large-width": 675, "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg", "caption": "" }, { "type": "image", "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg", "medium-height": 267, "medium-width": 400, "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg", "large-height": 450, "large-width": 675, "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg", "caption": "" }];'''
json_str = script_str[str(script_str).find('var gallery_items = '):str(script_str).find(';')].replace('var gallery_items = ', '')
json_str = json.loads(json_str)
for item in json_str:
print(item['large'])
希望这可以帮助!干杯!
推荐阅读
- android - 在 Android 中创建 Volley RequestQueue 时出错
- google-cloud-platform - 与正在运行的作业断开连接后,如何在 Google Cloud SDK Shell 中再次显示日志?
- mongodb - 没有排序与排序的MongoDB分页性能?
- multi-tenant - Outsystems:在多租户应用程序中是否可以在不同租户中拥有相同的用户
- c++ - 对象(类)中向量中的 C++ 结构
- javascript - 我怎样才能在哪一行 javascript 触发发布请求?
- php - Laravel - 方法分页不存在
- javascript - 异步函数、promise 和 observable 之间有什么区别?
- ubuntu - 如何在 ubuntu 上运行的谷歌地球服务器上添加“Access-Control-Allow-Origin”?
- android - 在基于列表视图的编辑文本输入字段中突出显示单个项目的颜色