首页 > 解决方案 > 从 URL 加载非常大的 json 文件时出现内存错误

问题描述

我在其他类似的问题上看到他们使用像 ijson 这样的库。但我似乎无法弄清楚如何使用它来解决问题。

这是我到目前为止的代码:

f = urllib.request.urlopen("https://data.medicaid.gov/resource/4qik-skk9.json?$limit=646259")
stuff = ijson.items(f,"")
for items in stuff:
print(items)

这是json结构的样子:

[
    {
        "package_size_code": "60",
        "fda_ther_equiv_code": "NR",
        "fda_application_number": "204153",
        "clotting_factor_indicator": "N",
        "year": "2018",
        "fda_product_name": "LUZU Cream 1% 60gm",
        "labeler_name": "MEDICIS DERMATOLOGICS, INC.",
        "ndc": "99207085060",
        "product_code": "0850",
        "unit_type": "GM",
        "fda_approval_date": "2013-11-14T00:00:00",
        "market_date": "2014-03-14T00:00:00",
        "pediatric_indicator": "N",
        "package_size_intro_date": "2014-03-14T00:00:00",
        "units_per_pkg_size": "60000",
        "labeler_code": "99207",
        "desi_indicator": "1",
        "drug_category": "S",
        "quarter": "3",
        "cod_status": "3"
    },
    {
        "package_size_code": "60",
        "fda_ther_equiv_code": "AB",
        "fda_application_number": "21758",
        "clotting_factor_indicator": "N",
        "year": "2018",
        "fda_product_name": "VANOS CREAM .1%",
        "labeler_name": "MEDICIS DERMATOLOGICS, INC.",
        "ndc": "99207052560",
        "product_code": "0525",
        "unit_type": "GM",
        "fda_approval_date": "2005-02-11T00:00:00",
        "market_date": "2005-02-21T00:00:00",
        "pediatric_indicator": "N",
        "package_size_intro_date": "2005-02-21T00:00:00",
        "units_per_pkg_size": "60000",
        "labeler_code": "99207",
        "desi_indicator": "1",
        "drug_category": "I",
        "quarter": "3",
        "cod_status": "3"
    },
.
.
.
.
]

我要做的是获取所有结果并对它们应用一些过滤器以获取值。例如,我试图从所有条目中获取年份的最大值。为此,我想我需要阅读所有数据。

这似乎解决了内存问题:

parser = ijson.parse(urllib.request.urlopen('https://data.medicaid.gov/resource/4qik-skk9.json?$limit=646259'))
        for prefix, event, value  in parser:
            #print(prefix)
            print(event)
            #print(value)

数据出来的不是很整齐,但它是进步的。

标签: pythonjson

解决方案


推荐阅读