首页 > 解决方案 > How to search and copy an item given the ID in a large json file

问题描述

I have two large files:

I need to search for those IDs in a certain field of the json file and copy the whole item it refers to for later analysis (creating a new file).

I give an example:

IDs.txt

    unique_id_1
    unique_id_2
    ...

schema.json

[
    {
        "id": "unique_id_1",
        "name": "",
        "text": "",
        "date": "",
    },
    {
        "id": "unique_id_aaa",
        "name": "",
        "text": "",
        "date": "",
    },
    {
        "id": "unique_id_2",
        "name": "",
        "text": "",
        "date": "",
    },
    ...
]

I am doing these analysis with Python - Pandas but I am getting troubles due to the large dimension of the files. What is the best way to do this thing? I can also consider using other software / languages

标签: pythonjsonpandasbigdata

解决方案


I implemented my second suggestion: this only works if the schema is flat (there are no nested objects in the JSON file). I also did not check what happens if a value in the JSON file is a dictionary, but probably if would be handled more carefully, as I currently check for } in a line to decide if the object is over.

You still need to load the entire IDs file, you need to check somehow if the object is needed.

If the useful_objects list grows too large, you can easily save that periodically while parsing the file.

import json
from pathlib import Path
import re
from typing import Dict

schema_name = "schema.json"
schema_path = Path(schema_name)
ids_name = "IDs.txt"
ids_path = Path(ids_name)

# read the ids
useful_ids = set()
with ids_path.open() as id_f:
    for line in id_f:
        id_ = line.strip()
        useful_ids.add(id_)
print(useful_ids)

useful_objects = []
temp: Dict[str, str] = {}
was_useful = False

with schema_path.open() as sc_f:

    for line in sc_f:
        # remove start/end whitespace
        line = line.strip()
        print(f"Parsing line {line}")

        # an object is ending
        if line[0] == "}":
            # add it
            if was_useful:
                useful_objects.append(temp)
            # reset the usefulness for the next object
            was_useful = False
            # reset the temp object
            temp = {}

        # parse the line
        match = re.match(r'"(.*?)": "(.*)"', line)

        # if this did not match, skip the line
        if match is None:
            continue

        # extract the data from the regex match
        key = match.group(1)
        value = match.group(2)
        print(f"\tMatched: {key} {value}")

        # build the temp object incrementally
        temp[key] = value

        # check if this object is useful
        if key == "id" and value in useful_ids:
            was_useful = True

useful_json = json.dumps(useful_objects, indent=4)
print(useful_json)

Again, not very elegant and not very robust, but as long as you are aware of the limitations, it does the job.

Cheers!


推荐阅读