首页 > 解决方案 > 每次计数器达到一定数量时,自动将输出保存到新的 Json 文件中 Python

问题描述

我有几个文件夹,每个文件夹都包含几个具有大量行和列的 CSV 文件。我正在尝试将 CSV 文件中的某些列连接到 JSON 文件。我的代码在通过 CSV 文件少于 100 个的文件夹时运行良好。如果我尝试文件超过 100 个,代码会变得非常慢,并且在几个文件之后就会卡住。

我创建了一个包含 4 个数据框的模拟数据,这些数据框复制了我的原始数据:

import pandas as pd
import numpy as np
import glob

data_1 = {'host_identity_verified':['t','t','t','t','t','t','t','t','t','t'],
      'neighbourhood':['q', 'q', 'q', 'q', 'q', 'q', 'q', 'q', 'q', 'q'],

      'neighbourhood_cleansed':['Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West',
                                'Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West'],
     'neighbourhood_group_cleansed': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN'],
      'latitude':[ 52.36575, 52.36509, 52.37297, 52.38761, 52.36719, 52.36575, 52.36509, 52.37297, 52.38761, 52.36719]}

data_2 = {'host_identity_verified':['t','t','t','t','t','t','t','t','t','t'],
      'neighbourhood':['w', 'w', 'w', 'w', 'w', 'w', 'w', 'w', 'w', 'w'],

      'neighbourhood_cleansed':['Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West',
                                'Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West'],
     'neighbourhood_group_cleansed': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN'],
      'latitude':[ 52.36575, 52.36509, 52.37297, 52.38761, 52.36719, 52.36575, 52.36509, 52.37297, 52.38761, 52.36719]}

data_3 = {'host_identity_verified':['t','t','t','t','t','t','t','t','t','t'],
      'neighbourhood':['NaN', 'NaN', 'Chicago, US', 'Chicago, US', 'Chicago, US', 'Chicago, US', 'Chicago, US', 'Chicago, US', 'Chicago, US', 'Chicago, US'],

      'neighbourhood_cleansed':['Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West',
                                'Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West'],
     'neighbourhood_group_cleansed': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN'],
      'latitude':[ 52.36575, 52.36509, 52.37297, 52.38761, 52.36719, 52.36575, 52.36509, 52.37297, 52.38761, 52.36719]}

data_4 = {'host_identity_verified':['t','t','t','t','t','t','t','t','t','t'],
      'neighbourhood':['NaN', 'NaN', 'Chicago, US', 'Chicago, US', 'Chicago, US', 'Chicago, US', 'Chicago, US', 'Chicago, US', 'Chicago, US', 'Chicago, US'],

      'neighbourhood_cleansed':['Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West',
                                'Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West'],
     'neighbourhood_group_cleansed': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN'],
      'latitude':[ 52.36575, 52.36509, 52.37297, 52.38761, 52.36719, 52.36575, 52.36509, 52.37297, 52.38761, 52.36719]}


df_1 = pd.DataFrame(data_1)
df_2 = pd.DataFrame(data_2)
df_3 = pd.DataFrame(data_3)
df_4 = pd.DataFrame(data_4)

df_list_1 = []
df_list_2 = []
df_list_3 = []
df_list_4 = []

df_list_1.append(df_1)
df_list_2.append(df_2)
df_list_3.append(df_3)
df_list_4.append(df_4)

df_all = df_list_1 + df_list_2 + df_list_3 +df_list_4
count = 0
li = []
for df in df_all:
    count = count +1
    print(count)
    if count < 3:
        df_n = df
        li.append(df_n)
        frame_1 = pd.concat(li, axis=0, ignore_index= True)

        def Get_Columns(file_name):
            return file_name[['host_identity_verified', 'latitude']]


        concat_data_1 = Get_Columns(frame_1)
        with open('Booking_Data_%s.json' % count,'w') as outfile:
            concat_data_j_1 = concat_data_1.to_json()
            outfile.write(concat_data_j_1)

如您所见,为了将 x 个文件连接起来并保存到一个 JASON 文件中,我必须通过编写许多 elif 语句来手动执行此操作。我有一个包含 900 个文件以下的文件夹,因此我必须编写大约 19 个条件才能将每 50 个 CSV 文件保存到 JSON 文件中。

因此,我想让代码更短,并在每次计数器达到 20 的倍数时自动将输出保存到一个新的 JSON 文件中。将前 20 个保存在文件中,将第二个 20 保存在文件中,依此类推。

例如,我在一个文件夹中有 58 个文件。如果我想将每 20 个文件保存在一个 JSON 文件中,我必须有 3 个 JSON 文件,前 2 个有 20 个 CSV,最后一个有 18 个。

另外,由于 JSON 文件太大,我会在分析 JSON 文件时遇到问题吗?它是保存大数据的最佳文件类型吗?我们谈论的是每个文件中的近百万行,如果不是更多,大小为数百 MB。

标签: pythonjsoncounterlarge-data

解决方案


推荐阅读