首页 > 解决方案 > 如何修复 Python 函数以遍历目录中的 JSON 文件列表并合并到单个 JSON 文件中

问题描述

我有一个不断生成 JSON 文件的设备 - a.json、b.json、c.json 等等,并将它们存储在一个文件夹目录中,如下所示。

“Data/d/a.json” 
“Data/d/b.json”
“Data/d/c.json”
.
.
.
.
“Data/d/g.json”

每个 JSON 文件中的示例数据

一个.json

{"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}
{"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}

b.json

{"artist":"Des'ree","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":1,"lastName":"Summers","length":246.30812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"You Gotta Be","status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
{"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":2,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Upgrade","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106132796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}

c.json

{"artist":"Mr Oizo","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":3,"lastName":"Summers","length":144.03873,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Flat 55","status":200,"ts":1541106352796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
{"artist":"Tamba Trio","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":4,"lastName":"Summers","length":177.18812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Quem Quiser Encontrar O Amor","status":200,"ts":1541106496796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}

这些文件每天可以增长到多达 1000 个 JSON 文件,每周可以增长到数千个文件。为了进一步处理这些 JSON 文件中的数据,我必须将每个 JSON 文件中的数据批量插入 PostgreSQL,正如您在下面的代码片段中看到的那样,但是当前的过程过于手动且效率低下,因为我在每个文件之后插入一个另一个。

import json
import psycopg2

connection = psycopg2.connect("host=localhost dbname=devicedb user=#### password=####")
cursor = connection.cursor()
connection.set_session(autocommit=True)
cursor.execute("create table if not exists events_table(artist text, auth text, firstName text, gender varchar, itemInSession int, lastName text, length text, level text, location text, method varchar, page text, registration text, sessionId int, song text, status int, ts bigint, userAgent text, userId int );")

data = []
with open('Data/d/a.json ') as f:
    for line in f:
        data.append(json.loads(line))

columns = [
    'artist',
    'auth',
    'firstName',
    'gender',
    'itemInSession',
    'lastName',
    'length',
    'level',
    'location',
    'method',
    'page',
    'registration',
    'sessionId',
    'song',
    'status',
    'ts',
    'userAgent',
    'userId'
]

for item in data:
    my_data = [item[column] for column in columns]
    for i, v in enumerate(my_data):
        if isinstance(v, dict):
            my_data[i] = json.dumps(v)

    insert_query = "INSERT INTO events_table VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    cursor.execute(insert_query, tuple(my_data))

为了改进当前的流程,我在网上搜索了一下,发现下面这个功能可以将多个文件合并为一个文件。我对这个函数的理解是,我可以通过将merged.json 指向我的合并文件和包含我的输入JSON 文件列表的目录来定义我的output_filename 和input_filenames,然后运行该函数,但似乎我错了。请问,谁能告诉我我做错了什么?

def cat_json(output_filename, input_filenames):
    with file(output_filename, "w") as outfile:
        first = True
        for infile_name in input_filenames:
            with file(infile_name) as infile:
                if first:
                    outfile.write('[')
                    first = False
                else:
                    outfile.write(',')
                outfile.write(mangle(infile.read()))
        outfile.write(']')

output_filename = 'data/d/merged.json'
input_filenames = 'data/d/*.json'
cat_json(output_filename, input_filenames)

我收到以下错误

TypeError                                 Traceback (most recent call last)
<ipython-input-19-3ff012d91d76> in <module>()
      1 output_filename = 'data/d/merged.json'
      2 input_filenames = 'data/d/*.json'
----> 3 cat_json(output_filename, input_filenames)

<ipython-input-18-760b670f79b1> in cat_json(output_filename, input_filenames)
      1 def cat_json(output_filename, input_filenames):
----> 2     with file(output_filename, "w") as outfile:
      3         first = True
      4         for infile_name in input_filenames:
      5             with file(infile_name) as infile:

TypeError: 'str' object is not callable

@deusxmachine 感谢您的贡献,我按照建议将功能更改为:

def cat_json(output_filename, input_filenames):
    with open(output_filename, "w") as outfile:
        first = True
        for infile_name in input_filenames:
            with open(infile_name) as infile:
                if first:
                    outfile.write('[')
                    first = False
                else:
                    outfile.write(',')
                outfile.write(mangle(infile.read()))
        outfile.write(']')

代码创建了 merge.Json 文件,但没有内容并且出现以下错误

-------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-16-40d7387f704a> in <module>()
      1 output_filename = 'merged.json'
      2 input_filenames = 'data/d/*.json'
----> 3 cat_json(output_filename, input_filenames)

<ipython-input-15-951cbaba7765> in cat_json(output_filename, input_filenames)
      3         first = True
      4         for infile_name in input_filenames:
----> 5             with open(infile_name) as infile:
      6                 if first:
      7                     outfile.write('[')

FileNotFoundError: [Errno 2] No such file or directory: 'd'

我无法弄清楚为什么它会给出上述错误并且说没有这样的文件或目录。a.json、b.json、c.json ... 位于目录“data/d/”中,还是我需要提及每个文件名而不是 *.json?

标签: pythonjson

解决方案


我真的不明白你所说的合并 JSON 是什么意思,但我知道你为什么会收到这个错误

代替

with file(output_filename, "w") as outfile:

做这个

with open(output_filename, "w") as outfile:

file不是函数。open用于打开文件

希望能帮助到你


推荐阅读