首页 > 解决方案 > 删除格式不正确 Python 的 JSON 对象

问题描述

我正在构建一个聊天机器人数据库 atm。我使用来自pushshift.io的数据。为了处理大数据文件,(我知道json将所有内容都加载到 RAM 中,所以如果你只有 16GB 的 RAM 并使用 30GB 的数据,那是一个诺诺),我编写了一个 bash 脚本,将大文件分割成更小的块3GB 的文件,以便我可以通过json.loads(或pd.read_json)运行它。每当我运行我的代码时,它都会返回问题

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

因此,我查看了temp我刚刚创建的 json 文件,我发现这发生在我的 JSON 文件中:

ink_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}

数据的样本校正看起来像这样

{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}

我注意到我的 bash 脚本在没有注意 JSON 对象的情况下拆分了文件。所以我的问题是有没有办法在 python 中编写一个函数来检测格式不正确并删除它的 JSON 对象?

标签: pythonjson

解决方案


没有太多信息可以继续,但我会稍微挑战一下框架。

Python 中有几个增量 json 解析器可用。快速搜索显示ijson应该允许您遍历非常大的数据结构而不会爆炸。

您还应该考虑另一种数据格式(或真实数据库),否则您会很容易发现自己花时间重新实现使用正确工具已经存在的慢得多的功能版本。


推荐阅读