首页 > 解决方案 > 将 Javascript 变量抓取到 Python 中

问题描述

我想从http://maps.latimes.com/neighborhoods/population/density/neighborhood/list/抓取以下数据:

  var hoodFeatures = {
            type: "FeatureCollection",
            features: [{
                type: "Feature",
                properties: {
                    name: "Koreatown",
                    slug: "koreatown",
                    url: "/neighborhoods/neighborhood/koreatown/",
                    has_statistics: true,
                    label: 'Rank: 1<br>Population per Sqmi: 42,611',
                    population: "115,070",
                    stratum: "high"
                },
                geometry: { "type": "MultiPolygon", "coordinates": [ [ [ [ -118.286908, 34.076510 ], [ -118.289208, 34.052511 ], [ -118.315909, 34.052611 ], [ -118.323009, 34.054810 ], [ -118.319309, 34.061910 ], [ -118.314093, 34.062362 ], [ -118.313709, 34.076310 ], [ -118.286908, 34.076510 ] ] ] ] }
            },

从上面的 html 中,我想获取每个:

name
population per sqmi
population
geometry

并按名称将其转换为数据框

到目前为止我已经尝试过

import requests
import json
from bs4 import BeautifulSoup

response_obj = requests.get('http://maps.latimes.com/neighborhoods/population/density/neighborhood/list/').text
soup = BeautifulSoup(response_obj,'lxml')

该对象具有脚本信息,但我不明白如何使用该线程中建议的 json 模块: Parsing variable data out of a javascript tag using python

json_text = '{%s}' % (soup.partition('{')[2].rpartition('}')[0],)
value = json.loads(json_text)
value

我收到这个错误

TypeError                                 Traceback (most recent call last)
<ipython-input-12-37c4c0188ed0> in <module>
      1 #Splits the text on the first bracket and last bracket of the javascript into JSON format
----> 2 json_text = '{%s}' % (soup.partition('{')[2].rpartition('}')[0],)
      3 value = json.loads(json_text)
      4 value
      5 #import pprint

TypeError: 'NoneType' object is not callable

有什么建议么?谢谢

标签: javascriptpythonbeautifulsoup

解决方案


我不太确定如何用漂亮的汤做到这一点,但另一种选择可能是设计一个表达式并提取我们想要的值:

(?:name|population per sqmi|population)\s*:\s*"?(.*?)\s*["']|(?:geometry)\s*:\s*({.*})

演示

测试

import re

regex = r"(?:name|population per sqmi|population)\s*:\s*\"?(.*?)\s*[\"']|(?:geometry)\s*:\s*({.*})"

test_str = ("var hoodFeatures = {\n"
    "            type: \"FeatureCollection\",\n"
    "            features: [{\n"
    "                type: \"Feature\",\n"
    "                properties: {\n"
    "                    name: \"Koreatown\",\n"
    "                    slug: \"koreatown\",\n"
    "                    url: \"/neighborhoods/neighborhood/koreatown/\",\n"
    "                    has_statistics: true,\n"
    "                    label: 'Rank: 1<br>Population per Sqmi: 42,611',\n"
    "                    population: \"115,070\",\n"
    "                    stratum: \"high\"\n"
    "                },\n"
    "                geometry: { \"type\": \"MultiPolygon\", \"coordinates\": [ [ [ [ -118.286908, 34.076510 ], [ -118.289208, 34.052511 ], [ -118.315909, 34.052611 ], [ -118.323009, 34.054810 ], [ -118.319309, 34.061910 ], [ -118.314093, 34.062362 ], [ -118.313709, 34.076310 ], [ -118.286908, 34.076510 ] ] ] ] }\n"
    "            },\n")

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

推荐阅读