javascript - 将 Javascript 变量抓取到 Python 中
问题描述
我想从http://maps.latimes.com/neighborhoods/population/density/neighborhood/list/抓取以下数据:
var hoodFeatures = {
type: "FeatureCollection",
features: [{
type: "Feature",
properties: {
name: "Koreatown",
slug: "koreatown",
url: "/neighborhoods/neighborhood/koreatown/",
has_statistics: true,
label: 'Rank: 1<br>Population per Sqmi: 42,611',
population: "115,070",
stratum: "high"
},
geometry: { "type": "MultiPolygon", "coordinates": [ [ [ [ -118.286908, 34.076510 ], [ -118.289208, 34.052511 ], [ -118.315909, 34.052611 ], [ -118.323009, 34.054810 ], [ -118.319309, 34.061910 ], [ -118.314093, 34.062362 ], [ -118.313709, 34.076310 ], [ -118.286908, 34.076510 ] ] ] ] }
},
从上面的 html 中,我想获取每个:
name
population per sqmi
population
geometry
并按名称将其转换为数据框
到目前为止我已经尝试过
import requests
import json
from bs4 import BeautifulSoup
response_obj = requests.get('http://maps.latimes.com/neighborhoods/population/density/neighborhood/list/').text
soup = BeautifulSoup(response_obj,'lxml')
该对象具有脚本信息,但我不明白如何使用该线程中建议的 json 模块: Parsing variable data out of a javascript tag using python
json_text = '{%s}' % (soup.partition('{')[2].rpartition('}')[0],)
value = json.loads(json_text)
value
我收到这个错误
TypeError Traceback (most recent call last)
<ipython-input-12-37c4c0188ed0> in <module>
1 #Splits the text on the first bracket and last bracket of the javascript into JSON format
----> 2 json_text = '{%s}' % (soup.partition('{')[2].rpartition('}')[0],)
3 value = json.loads(json_text)
4 value
5 #import pprint
TypeError: 'NoneType' object is not callable
有什么建议么?谢谢
解决方案
我不太确定如何用漂亮的汤做到这一点,但另一种选择可能是设计一个表达式并提取我们想要的值:
(?:name|population per sqmi|population)\s*:\s*"?(.*?)\s*["']|(?:geometry)\s*:\s*({.*})
演示
测试
import re
regex = r"(?:name|population per sqmi|population)\s*:\s*\"?(.*?)\s*[\"']|(?:geometry)\s*:\s*({.*})"
test_str = ("var hoodFeatures = {\n"
" type: \"FeatureCollection\",\n"
" features: [{\n"
" type: \"Feature\",\n"
" properties: {\n"
" name: \"Koreatown\",\n"
" slug: \"koreatown\",\n"
" url: \"/neighborhoods/neighborhood/koreatown/\",\n"
" has_statistics: true,\n"
" label: 'Rank: 1<br>Population per Sqmi: 42,611',\n"
" population: \"115,070\",\n"
" stratum: \"high\"\n"
" },\n"
" geometry: { \"type\": \"MultiPolygon\", \"coordinates\": [ [ [ [ -118.286908, 34.076510 ], [ -118.289208, 34.052511 ], [ -118.315909, 34.052611 ], [ -118.323009, 34.054810 ], [ -118.319309, 34.061910 ], [ -118.314093, 34.062362 ], [ -118.313709, 34.076310 ], [ -118.286908, 34.076510 ] ] ] ] }\n"
" },\n")
matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
推荐阅读
- node.js - NodeJS + Express + Nginx - URL中没有端口无法访问域名(404错误)
- html - Html 卡片文本展开
- google-sheets - 根据一组单元格Google表格选择第n个值
- php - 将数据插入到已经存在的表中
- javascript - 如何在不复制的情况下从 firebase 获取所有文件
- python - 如何使用 lambda 删除列表列表中数字下方的元素?Python
- azure - Is there an alternative way to add another Registered Cloud to Visual Studio?
- windows - 在 Windows 10 上触发无限期通知
- javascript - 将鼠标悬停在R中DT的每一行时如何显示不同的图像
- node.js - 确保 firestore 集合 docChanges 保持活跃