首页 > 解决方案 > 逐行读取大型json(> 5gb)文件并处理每一行并使用Pandas创建DataFrame

问题描述

我正在逐行读取文件并处理每一行。但我没有得到我需要的输出。

输入文件.txt

{"M":{"1":"data","2":"esf"},"D":{"4":12312,"6":"err"},"R":{"33":"eres","wer":454}}
{"M":{"1":"a","2":"2"},"D":{"4":3456,"6":"esrr"},"R":{"33":"esre","wer":447}}
{"M":{"1":"data3","2":"fer"},"D":{"4":9873,"6":"errs"},"R":{"33":"eret","wer":189,"55":"rt"}}

代码:

import pandas as pd;
import json
with open("inputfile.txt") as f:
  for line in f:
    data=(json.loads(f))
    d=[{k1+k2:v2 for k2,v2 in v1.items()} for k1,v1 in data.items()]
    keys=[k for x in d for k in x.items()]
    keys=list(set(keys))
    df=pd.DataFrame(d,columns=keys)
    print (df)

我需要的输出:

M1,M2,D4,D6,R33,Rwer,R55
data,esf,12312,err,eres,454,NA
a,2,3456,esrr,esre,447,NA
data3,fer,9873,errs,eret,189,rt

标签: pythonpandas

解决方案


使用中间文本 I/O 缓冲区的扩展解决方案(也充当上下文管理器):

import pandas as pd
import json
import io

with open('input.json') as f, io.StringIO() as temp_file:
    for line in f:
        d = {}
        json_data = json.loads(line)
        d = {k + sub_k: val for k, inner_d in json_data.items()
             for sub_k, val in inner_d.items()}
        temp_file.write(json.dumps(d) + '\n')
    temp_file.seek(0)

    df = pd.read_json(temp_file, orient='columns', lines=True)
    print(df.to_string())

样本输出:

      D4    D6     M1   M2   R33  R55  Rwer
0  12312   err   data  esf  eres  NaN   454
1   3456  esrr      a    2  esre  NaN   447
2   9873  errs  data3  fer  eret   rt   189

推荐阅读