python - 逐行读取大型json(> 5gb)文件并处理每一行并使用Pandas创建DataFrame
问题描述
我正在逐行读取文件并处理每一行。但我没有得到我需要的输出。
输入文件.txt
{"M":{"1":"data","2":"esf"},"D":{"4":12312,"6":"err"},"R":{"33":"eres","wer":454}}
{"M":{"1":"a","2":"2"},"D":{"4":3456,"6":"esrr"},"R":{"33":"esre","wer":447}}
{"M":{"1":"data3","2":"fer"},"D":{"4":9873,"6":"errs"},"R":{"33":"eret","wer":189,"55":"rt"}}
代码:
import pandas as pd;
import json
with open("inputfile.txt") as f:
for line in f:
data=(json.loads(f))
d=[{k1+k2:v2 for k2,v2 in v1.items()} for k1,v1 in data.items()]
keys=[k for x in d for k in x.items()]
keys=list(set(keys))
df=pd.DataFrame(d,columns=keys)
print (df)
我需要的输出:
M1,M2,D4,D6,R33,Rwer,R55
data,esf,12312,err,eres,454,NA
a,2,3456,esrr,esre,447,NA
data3,fer,9873,errs,eret,189,rt
解决方案
使用中间文本 I/O 缓冲区的扩展解决方案(也充当上下文管理器):
import pandas as pd
import json
import io
with open('input.json') as f, io.StringIO() as temp_file:
for line in f:
d = {}
json_data = json.loads(line)
d = {k + sub_k: val for k, inner_d in json_data.items()
for sub_k, val in inner_d.items()}
temp_file.write(json.dumps(d) + '\n')
temp_file.seek(0)
df = pd.read_json(temp_file, orient='columns', lines=True)
print(df.to_string())
样本输出:
D4 D6 M1 M2 R33 R55 Rwer
0 12312 err data esf eres NaN 454
1 3456 esrr a 2 esre NaN 447
2 9873 errs data3 fer eret rt 189
推荐阅读
- multithreading - 分配许多小作业的设计模式
- python - 在 for 循环中通过数据框附加字符串
- c# - 不使用 Collection 构建队列
- reactjs - 延迟更新数组 React 的 setState
- mysql - 在mysql中以相反的顺序将表加入到自身中
- android - Android Room:查询不返回所有包含连接的行
- sql - 寻求改进的 SQL 查询
- proxy - Squid 代理无法访问受 Cloudflare 保护的网站
- c++ - 需要一个参考折叠规则的例子 T&&&& -> T&& on VS2017
- javascript - 将带有许多 && 的长 if 语句转换为带有 javascript 的 for 循环