首页 > 解决方案 > 从 URL Python Pandas 加载 JSON

问题描述

我正在为我的学生准备学习材料。为方便起见,我想从 URL 访问数据,而不是要求他们提前下载。在这个例子中,我试图从 Quick, Draw!中访问鸟类绘图!谷歌数据集。

这是访问远程存储的数据并注释掉结果的工作示例:

import pandas as pd
import os
import json
from glob import glob

# Convert top row to one dict
top_row_dict = lambda in_df: list(in_df.head(1).T.to_dict().values())[0]
# Load file from computer
base_dir = os.path.join('input', 'quickdraw_simplified')
obj_files = glob(os.path.join(base_dir, '*.ndjson'))
print(obj_files[0])
# input\quickdraw_simplified\full_simplified_bird.ndjson

c_json = pd.read_json(obj_files[0], lines = True, chunksize = 1)
# <pandas.io.json._json.JsonReader at 0x158ae631f10>

f_row = next(c_json)
# word  countrycode     timestamp   recognized  key_id  drawing
# 0     bird    US  2017-03-09 00:28:55.637750+00:00    True    4926006882205696    [[[0, 11, 23, 50, 72, 96, 97, 132, 158, 224, 2...

f_dict = top_row_dict(f_row)
# {'word': 'bird',
#  'countrycode': 'US',
#  'timestamp': Timestamp('2017-03-09 00:28:55.637750+0000', tz='UTC'),
#  'recognized': True,
#  'key_id': 4926006882205696,
#  'drawing': [[[0, 11, 23, 50, 72, 96, 97, 132, 158, 224, 255],
#    [22, 9, 2, 0, 26, 45, 71, 40, 27, 10, 9]]]}

但是,当我尝试使用API link做同样的事情时,它失败了:

import pandas as pd
import json

top_row_dict = lambda in_df: list(in_df.head(1).T.to_dict().values())[0]

url = 'https://console.cloud.google.com/storage/browser/quickdraw_dataset/full/simplified/bird.ndjson'
# Load dataset
c_json = pd.read_json(url, lines = True, chunksize = 1)
# <pandas.io.json._json.JsonReader at 0x24980a20a90>
f_row = next(c_json)
# __
f_dict = top_row_dict(f_row)
# IndexError: list index out of range

标签: pythonjsonpandasgoogle-cloud-platform

解决方案


您尝试使用的网址需要登录(因为它链接到 Cloud Console)。

但是,数据集存储在可公开访问的 Google Cloud Storage 存储桶中。

这意味着您可以使用http://pypi.org/p/google-cloud-storage包直接从存储桶加载文件。

就像是:

from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('quickdraw_dataset')
blob = bucket.get_blob('full/simplified/bird.ndjson')

c_json = pd.read_json(blob, lines = True, chunksize = 1)
...

推荐阅读