首页 > 解决方案 > 使用 python 合并数据并删除许多 JSON 文件中的重复记录

问题描述

我收到了很多具有以下格式的 JSON 文件。

{
    "y":[
        [0,0,0,0,0,0,0,0,0,0],
        [0,0,0,0,0,0,0,0,0,0],
        [0,0,0,0,0,0,0,0,0,0],
        [0,0,0,15866,15866,15866,16869,17116,17400,17412],
        [53,3253,3253,3253,3253,3253,3253,3253,3249,3249],
        [0,0,0,0,0,0,0,0,0,0],
        [342,16342,16342,16342,16342,16342,16342,16342,16342,16342],
        [13427,14033,14606,115822,120711,121270,125757,145946,150498,150634],
        [0,0,0,25,81,12,0,0,0,0],
        [0,0,0,0,0,0,0,0,0,0],
        [0,2193,2175,2175,4050,4059,4059,4089,4079,3695],
        [4,0,0,0,0,0,0,77,0,0],
        [0,75,75,75,78,78,78,734,732,732]
        ],
    "labels":[
        "Developer 1",
        "Developer 10",
        "Developer 2",
        "Developer 3",
        "Developer 4",
        "Developer 11",
        "Developer 5",
        "Developer 6",
        "Developer 7",
        "Developer 12",
        "Developer 8",
        "Developer 6",
        "Developer 7"
        ]
}

中的数据元素与y中的标签具有相同的索引labels。我遇到的问题是有时相同的标签会出现两次。在此示例中,Developer 6出现在索引 7 和 11 以及Developer 7出现在索引 8 和 12。

我想合并重复项的数据。我可以通过在重复记录的列表中添加项目来做到这一点。开发人员 6 的示例。

重复的数据行是:

[13427,14033,14606,115822,120711,121270,125757,145946,150498,150634],
[4,0,0,0,0,0,0,77,0,0],

合并的记录将是:

[13431,14033,14606,115822,120711,121270,125757,146023,150498,150634],

这是我卡住的地方。我想删除其中一个旧行和重复的标签。然后我需要能够对任何其他重复标签重复该过程,但此时我已经搞砸了索引。

如何合并重复的数据行、删除重复的标签并对文件中的所有重复标签执行此操作?

标签: pythonjsonlistduplicates

解决方案


你可以试试这个。

import numpy as np
out=list(zip(a['y'],a['labels']))
''' out looks like this
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 1')
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 10')
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 2')
([0, 0, 0, 15866, 15866, 15866, 16869, 17116, 17400, 17412], 'Developer 3')
([53, 3253, 3253, 3253, 3253, 3253, 3253, 3253, 3249, 3249], 'Developer 4')
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 11')
([342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342], 'Developer 5')
([13427, 14033, 14606, 115822, 120711, 121270, 125757, 145946, 150498, 150634], 'Developer 6')
([0, 0, 0, 25, 81, 12, 0, 0, 0, 0], 'Developer 7')
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 12')
([0, 2193, 2175, 2175, 4050, 4059, 4059, 4089, 4079, 3695], 'Developer 8')
([4, 0, 0, 0, 0, 0, 0, 77, 0, 0], 'Developer 6')
([0, 75, 75, 75, 78, 78, 78, 734, 732, 732], 'Developer 7')'''

out=list(map(list,out))

for i,val in enumerate(out):
    out[i][0]=np.array(val[0])

new_dict={}
for v,k in out:
    if not new_dict.get(k):
        new_dict[k]=[v]
    else:
        new_dict[k].append(v)

''' new_dict looks like this
('Developer 1', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 10', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 2', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 3', [array([    0,     0,     0, 15866, 15866, 15866, 16869, 17116, 17400,
       17412])])
('Developer 4', [array([  53, 3253, 3253, 3253, 3253, 3253, 3253, 3253, 3249, 3249])])
('Developer 11', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 5', [array([  342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342,
       16342])])
('Developer 6', [array([ 13427,  14033,  14606, 115822, 120711, 121270, 125757, 145946,
       150498, 150634]), array([ 4,  0,  0,  0,  0,  0,  0, 77,  0,  0])])
('Developer 7', [array([ 0,  0,  0, 25, 81, 12,  0,  0,  0,  0]), array([  0,  75,  75,  75,  78,  78,  78, 734, 732, 732])])
('Developer 12', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 8', [array([   0, 2193, 2175, 2175, 4050, 4059, 4059, 4089, 4079, 3695])])'''

temp=np.zeros(10) #each array corresponding to each developer is of size 10
for idx,i in enumerate(new_dict.items()):
    i[0]
    if len(i[1])>1:
        for l in i[1]:
            temp=temp+l
        new_dict.update({i[0]:temp})
        #print(temp)
        temp=np.zeros(10)

'''Now new_dict,items() will like this
('Developer 1', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 10', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 2', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 3', [array([    0,     0,     0, 15866, 15866, 15866, 16869, 17116, 17400,
       17412])])
('Developer 4', [array([  53, 3253, 3253, 3253, 3253, 3253, 3253, 3253, 3249, 3249])])
('Developer 11', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 5', [array([  342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342,
       16342])])
('Developer 6', array([ 13431.,  14033.,  14606., 115822., 120711., 121270., 125757.,
       146023., 150498., 150634.]))
('Developer 7', array([  0.,  75.,  75., 100., 159.,  90.,  78., 734., 732., 732.]))
('Developer 12', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 8', [array([   0, 2193, 2175, 2175, 4050, 4059, 4059, 4089, 4079, 3695])])'''

a,b=zip(*new_dict.items())
res={'y':a,'label':b}

res是你需要的。


输出

import pandas as pd
print(res)
df=pd.DataFrame(res)
print(df)

{'y': ('Developer 1', 'Developer 10', 'Developer 2', 'Developer 3', 'Developer 4', 
'Developer 11', 'Developer 5', 'Developer 6', 'Developer 7', 'Developer 12', 'Developer 8'),
 'label': ([array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([    0,     0,     0, 15866, 15866, 15866, 16869, 17116, 17400,
           17412])], [array([  53, 3253, 3253, 3253, 3253, 3253, 3253, 3253, 3249, 3249])], [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([  342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342,
           16342])], array([ 13431.,  14033.,  14606., 115822., 120711., 121270., 125757.,
           146023., 150498., 150634.]), array([  0.,  75.,  75., 100., 159.,  90.,  78., 734., 732., 732.]), [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([   0, 2193, 2175, 2175, 4050, 4059, 4059, 4089, 4079, 3695])])}

               y                                              label
0    Developer 1                   [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
1   Developer 10                   [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
2    Developer 2                   [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
3    Developer 3  [[0, 0, 0, 15866, 15866, 15866, 16869, 17116, ...
4    Developer 4  [[53, 3253, 3253, 3253, 3253, 3253, 3253, 3253...
5   Developer 11                   [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
6    Developer 5  [[342, 16342, 16342, 16342, 16342, 16342, 1634...
7    Developer 6  [13431.0, 14033.0, 14606.0, 115822.0, 120711.0...
8    Developer 7  [0.0, 75.0, 75.0, 100.0, 159.0, 90.0, 78.0, 73...
9   Developer 12                   [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
10   Developer 8  [[0, 2193, 2175, 2175, 4050, 4059, 4059, 4089,...

推荐阅读