首页 > 解决方案 > 抓取从 Mapbox 获取的经纬度位置

问题描述

我正在开发一个 divvy 数据集项目。

我想从这里http://suggest.divvybikes.com/抓取每个建议位置和评论的信息。

我可以从 Mapbox 中抓取这些信息吗?它显示在地图上,因此它必须在某处具有信息。

标签: pythonweb-scraping

解决方案


我访问了该页面,并使用 Google Chrome 的开发者工具记录了我的网络流量。过滤请求以仅查看 XHR (XmlHttpRequest) 请求,我看到了大量对各种 REST API 的 HTTP GET 请求。这些 REST API 返回 JSON,这是理想的。这些 API 中只有两个似乎与您的目的相关 - 一个用于places,另一个用于comments与这些地方相关联。API的placesJSON 包含有趣的信息,例如地点 ID 和坐标。API 的 JSON 包含有关特定地点的comments所有评论,由其 id 标识。使用第三方requests模块模仿这些调用非常简单。幸运的是,API 似乎并不关心请求标头。查询字符串参数(params字典)当然需要精心制定。

我能够想出以下两个函数:get_places对同一个 API 进行多次调用,每次都使用不同的page查询字符串参数。似乎“页面”是他们在内部使用的术语,用于将所有数据拆分为不同的块——所有不同的位置/功能/站点都拆分为多个页面,每次 API 调用只能获得一个页面。while 循环将所有位置累积在一个巨大的列表中,它会一直运行,直到我们收到一个告诉我们没有更多页面的响应。循环结束后,我们返回地点列表。

另一个函数是get_comments,它将一个位置 id(字符串)作为参数。然后它向适当的 API 发出 HTTP GET 请求,并返回该位置的评论列表。如果没有评论,此列表可能为空。

def get_places():
    import requests
    from itertools import count

    api_url = "http://suggest.divvybikes.com/api/places"

    page_counter = count(1)

    places = []

    for page_nr in page_counter:

        params = {
            "page": str(page_nr),
            "include_submissions": "true"
        }

        response = requests.get(api_url, params=params)
        response.raise_for_status()

        content = response.json()

        places.extend(content["features"])

        if content["metadata"]["next"] is None:
            break

    return places


def get_comments(place_id):
    import requests

    api_url = "http://suggest.divvybikes.com/api/places/{}/comments".format(place_id)

    response = requests.get(api_url)
    response.raise_for_status()

    return response.json()["results"]


def main():

    from operator import itemgetter

    places = get_places()

    place_id = places[12]["id"]

    print("Printing comments for the thirteenth place (id: {})\n".format(place_id))

    for comment in map(itemgetter("comment"), get_comments(place_id)):
        print(comment)

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

输出:

Printing comments for the thirteenth place (id: 107062)

I contacted Divvy about this five years ago and would like to pick the conversation back up! The Evanston Divvy bikes are regularly spotted in Wilmette and we'd love to expand the system for riders. We could easily have four stations - at the Metra Train Station, and the CTA station, at the lakefront Gillson Park and possibly one at Edens Plaza in west Wilmette.  Please, please, please contact me directly.  Thanks.
>>> 

对于此示例,我将打印位置列表中第 13 位的所有评论。我选择那个是因为它是第一个真正有评论的地方(0 - 11 没有任何评论,大多数地方似乎没有评论)。在这种情况下,这个地方只有一条评论。


编辑 - 如果您想将地点 ID、纬度、经度和评论保存在 CSV 中,您可以尝试将main函数更改为:

def main():

    import csv

    print("Getting places...")
    places = get_places()
    print("Got all places.")

    fieldnames = ["place id", "latitude", "longitude", "comments"]

    print("Writing to CSV file...")

    with open("output.csv", "w") as file:
        writer = csv.DictWriter(file, fieldnames)
        writer.writeheader()

        num_places_to_write = 25

        for place_nr, place in enumerate(places[:num_places_to_write], start=1):
            print("Writing place #{}/{} with id {}".format(place_nr, num_places_to_write, place["id"]))
            writer.writerow(dict(zip(fieldnames, [place["id"], *place["geometry"]["coordinates"], [c["comment"] for c in get_comments(place["id"])]])))

    return 0

有了这个,我得到了如下结果:

place id,latitude,longitude,comments

107098,-87.6711076553,41.9718155716,[]

107097,-87.759540081,42.0121073671,[]

107096,-87.747695446,42.0263916146,[]

107090,-87.6642036438,42.0162096564,[]

107089,-87.6609444613,41.8852953922,[]

107083,-87.6007853815,41.8199433342,[]

107082,-87.6355862613,41.8532736671,[]

107075,-87.6210737228,41.8862644836,[]

107074,-87.6210737228,41.8862644836,[]

107073,-87.6210737228,41.8862644836,[]

107065,-87.6499611139,41.9627251578,[]

107064,-87.6136027649,41.8332984674,[]

107062,-87.7073025402,42.0760990584,"[""I contacted Divvy about this five years ago and would like to pick the conversation back up! The Evanston Divvy bikes are regularly spotted in Wilmette and we'd love to expand the system for riders. We could easily have four stations - at the Metra Train Station, and the CTA station, at the lakefront Gillson Park and possibly one at Edens Plaza in west Wilmette.  Please, please, please contact me directly.  Thanks.""]"

在这种情况下,我使用列表切片语法 ( places[:num_places_to_write]) 仅将前 25 个位置写入 CSV 文件,仅用于演示目的。然而,在写完前十三个之后,我收到了这个异常消息:

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

所以,我猜comment-API 不希望在这么短的时间内收到这么多请求。您可能需要在循环中睡一会儿才能解决这个问题。API也可能不在乎,只是碰巧超时。


推荐阅读