python - How to search and copy an item given the ID in a large json file
问题描述
I have two large files:
- one is a text file with a lot of IDs: one ID per row;
- the other one is a 6+ GB json file, containing many items.
I need to search for those IDs in a certain field of the json file and copy the whole item it refers to for later analysis (creating a new file).
I give an example:
IDs.txt
unique_id_1
unique_id_2
...
schema.json
[
{
"id": "unique_id_1",
"name": "",
"text": "",
"date": "",
},
{
"id": "unique_id_aaa",
"name": "",
"text": "",
"date": "",
},
{
"id": "unique_id_2",
"name": "",
"text": "",
"date": "",
},
...
]
I am doing these analysis with Python - Pandas but I am getting troubles due to the large dimension of the files. What is the best way to do this thing? I can also consider using other software / languages
解决方案
I implemented my second suggestion: this only works if the schema is flat (there are no nested objects in the JSON file). I also did not check what happens if a value in the JSON file is a dictionary, but probably if would be handled more carefully, as I currently check for }
in a line to decide if the object is over.
You still need to load the entire IDs
file, you need to check somehow if the object is needed.
If the useful_objects
list grows too large, you can easily save that periodically while parsing the file.
import json
from pathlib import Path
import re
from typing import Dict
schema_name = "schema.json"
schema_path = Path(schema_name)
ids_name = "IDs.txt"
ids_path = Path(ids_name)
# read the ids
useful_ids = set()
with ids_path.open() as id_f:
for line in id_f:
id_ = line.strip()
useful_ids.add(id_)
print(useful_ids)
useful_objects = []
temp: Dict[str, str] = {}
was_useful = False
with schema_path.open() as sc_f:
for line in sc_f:
# remove start/end whitespace
line = line.strip()
print(f"Parsing line {line}")
# an object is ending
if line[0] == "}":
# add it
if was_useful:
useful_objects.append(temp)
# reset the usefulness for the next object
was_useful = False
# reset the temp object
temp = {}
# parse the line
match = re.match(r'"(.*?)": "(.*)"', line)
# if this did not match, skip the line
if match is None:
continue
# extract the data from the regex match
key = match.group(1)
value = match.group(2)
print(f"\tMatched: {key} {value}")
# build the temp object incrementally
temp[key] = value
# check if this object is useful
if key == "id" and value in useful_ids:
was_useful = True
useful_json = json.dumps(useful_objects, indent=4)
print(useful_json)
Again, not very elegant and not very robust, but as long as you are aware of the limitations, it does the job.
Cheers!
推荐阅读
- ruby - 如何通过 ActiveMerchant::AuthorizeNetCimGateway 在我的客户付款资料中保存名称?
- bash - 有没有办法在 Bash 的另一个命令中使用命令的输入?
- vb.net - 如何在 vb.net 中停止闪烁/闪烁任务栏图标?
- python - 从表中抓取数据,但缺少 tbody 标记
- php - 在 SQL 中找不到数据时显示消息
- sqlite - 如何创建一个用户可以在 sqlite 和 kotlin 上创建自己的表的应用程序
- flutter - Flutter - 获取path_provider版本时发布错误
- python - 从 git 安装软件包时 pip 中断
- python - 在pytorch中使用带有distributedDataParallel的多个节点时,运行时连接()超时错误
- r - R:从 2 个 zip 文件夹中读取 csv