首页 > 解决方案 > 网页抓取不适用于所有链接。错误:需要“,”分隔符:第 1 行第 2198 列(字符 2197)

问题描述

我正在进行网络抓取以检索图表的信息,以获取具有相同网页格式的不同网址。这是代码,其中 list_resorts 是要查找的 url 列表:

for resort in list_resorts:
    html_doc = requests.get(resort).text
    data = re.search(r"wpDataCharts\[.*?\] = ({.*})", html_doc).group(1)
    data = re.sub(r"([a-z_]+):", r'"\1":', data)
    data = re.sub(r'"http":', "http:", data)
    data = json.loads(data)
    for series in data["render_data"]["options"]["series"]:
        for i in range(0,len(data["render_data"]["options"]["xAxis"]["categories"])):
            df.at[resort,series["name"]+"_"+str(data["render_data"]["options"]["xAxis"]["categories"][i])]=series["data"][i]
df

所以我的目标是提取感兴趣的数字并将它们放入先前创建的 df 中:提炼

对于某些 url,它可以完美运行,而对于其他 url,它会引发此错误JSONDecodeError: Expecting ',' delimiter: line 1 column 2198 (char 2197)。例如这个:ski-resort-stats.com/westendorf-skiwelt-wilder-kaiser-brixental。

我试图研究工作 url 和其他 url 之间的差异,但我真的不明白发生了什么。有人可以帮忙吗?

标签: pythonweb-scrapingbeautifulsoup

解决方案


尝试:

import re
import json
import requests

# url = "https://ski-resort-stats.com/Hemsedal/"
url = "https://ski-resort-stats.com/westendorf-skiwelt-wilder-kaiser-brixental/"
html_doc = requests.get(url).text

data = re.search(r"wpDataCharts\[.*?\] = ({.*})", html_doc).group(1)
data = re.sub(r"([a-z_]+):", r'"\1":', data)
data = re.sub(r'"(https?)":', r"\1:", data)
data = json.loads(data)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for series in data["render_data"]["options"]["series"]:
    print(series["name"], series["data"])

print()
print("week =", data["render_data"]["options"]["xAxis"]["categories"])

印刷:

2013-2014 [0, 0, 0, 0, 21, 32.5, 32.5, 32, 32.5, 30.5, 31, 50.5, 64, 55, 53, 50, 49.5, 40.5, 37.5, 27, 0, 0, 0]
2014-2015 [0, 0, 0, 0, 9, 11, 12.5, 31.5, 77.5, 47, 39, 57.5, 85, 90, 74, 60, 62, 62, 48.5, 48.5, 40.5, 35, 20.5]
2015-2016 [0, 0, 0, 3.5, 20, 20, 20, 17, 23, 29.5, 63, 60, 57, 60, 60, 59, 55, 55, 55, 55, 50.5, 8.5, 0]
2016-2017 [0, 0, 0, 0, 18, 31, 36.5, 30, 35, 59, 69, 70, 68, 65, 63, 55, 56, 60, 46, 28.5, 0, 0, 0]
2017-2018 [0, 0, 0, 35, 65, 65, 90, 92.5, 92.5, 82.5, 75, 122.5, 117.5, 127.5, 127.5, 127.5, 127.5, 127.5, 112.5, 87.5, 87.5, 67.5, 0]

week = [45, 46, 47, 48, 49, 50, 51, 52, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

推荐阅读