python - 使用 python 合并数据并删除许多 JSON 文件中的重复记录
问题描述
我收到了很多具有以下格式的 JSON 文件。
{
"y":[
[0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
[0,0,0,15866,15866,15866,16869,17116,17400,17412],
[53,3253,3253,3253,3253,3253,3253,3253,3249,3249],
[0,0,0,0,0,0,0,0,0,0],
[342,16342,16342,16342,16342,16342,16342,16342,16342,16342],
[13427,14033,14606,115822,120711,121270,125757,145946,150498,150634],
[0,0,0,25,81,12,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
[0,2193,2175,2175,4050,4059,4059,4089,4079,3695],
[4,0,0,0,0,0,0,77,0,0],
[0,75,75,75,78,78,78,734,732,732]
],
"labels":[
"Developer 1",
"Developer 10",
"Developer 2",
"Developer 3",
"Developer 4",
"Developer 11",
"Developer 5",
"Developer 6",
"Developer 7",
"Developer 12",
"Developer 8",
"Developer 6",
"Developer 7"
]
}
中的数据元素与y
中的标签具有相同的索引labels
。我遇到的问题是有时相同的标签会出现两次。在此示例中,Developer 6
出现在索引 7 和 11 以及Developer 7
出现在索引 8 和 12。
我想合并重复项的数据。我可以通过在重复记录的列表中添加项目来做到这一点。开发人员 6 的示例。
重复的数据行是:
[13427,14033,14606,115822,120711,121270,125757,145946,150498,150634],
[4,0,0,0,0,0,0,77,0,0],
合并的记录将是:
[13431,14033,14606,115822,120711,121270,125757,146023,150498,150634],
这是我卡住的地方。我想删除其中一个旧行和重复的标签。然后我需要能够对任何其他重复标签重复该过程,但此时我已经搞砸了索引。
如何合并重复的数据行、删除重复的标签并对文件中的所有重复标签执行此操作?
解决方案
你可以试试这个。
import numpy as np
out=list(zip(a['y'],a['labels']))
''' out looks like this
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 1')
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 10')
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 2')
([0, 0, 0, 15866, 15866, 15866, 16869, 17116, 17400, 17412], 'Developer 3')
([53, 3253, 3253, 3253, 3253, 3253, 3253, 3253, 3249, 3249], 'Developer 4')
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 11')
([342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342], 'Developer 5')
([13427, 14033, 14606, 115822, 120711, 121270, 125757, 145946, 150498, 150634], 'Developer 6')
([0, 0, 0, 25, 81, 12, 0, 0, 0, 0], 'Developer 7')
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 12')
([0, 2193, 2175, 2175, 4050, 4059, 4059, 4089, 4079, 3695], 'Developer 8')
([4, 0, 0, 0, 0, 0, 0, 77, 0, 0], 'Developer 6')
([0, 75, 75, 75, 78, 78, 78, 734, 732, 732], 'Developer 7')'''
out=list(map(list,out))
for i,val in enumerate(out):
out[i][0]=np.array(val[0])
new_dict={}
for v,k in out:
if not new_dict.get(k):
new_dict[k]=[v]
else:
new_dict[k].append(v)
''' new_dict looks like this
('Developer 1', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 10', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 2', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 3', [array([ 0, 0, 0, 15866, 15866, 15866, 16869, 17116, 17400,
17412])])
('Developer 4', [array([ 53, 3253, 3253, 3253, 3253, 3253, 3253, 3253, 3249, 3249])])
('Developer 11', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 5', [array([ 342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342,
16342])])
('Developer 6', [array([ 13427, 14033, 14606, 115822, 120711, 121270, 125757, 145946,
150498, 150634]), array([ 4, 0, 0, 0, 0, 0, 0, 77, 0, 0])])
('Developer 7', [array([ 0, 0, 0, 25, 81, 12, 0, 0, 0, 0]), array([ 0, 75, 75, 75, 78, 78, 78, 734, 732, 732])])
('Developer 12', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 8', [array([ 0, 2193, 2175, 2175, 4050, 4059, 4059, 4089, 4079, 3695])])'''
temp=np.zeros(10) #each array corresponding to each developer is of size 10
for idx,i in enumerate(new_dict.items()):
i[0]
if len(i[1])>1:
for l in i[1]:
temp=temp+l
new_dict.update({i[0]:temp})
#print(temp)
temp=np.zeros(10)
'''Now new_dict,items() will like this
('Developer 1', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 10', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 2', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 3', [array([ 0, 0, 0, 15866, 15866, 15866, 16869, 17116, 17400,
17412])])
('Developer 4', [array([ 53, 3253, 3253, 3253, 3253, 3253, 3253, 3253, 3249, 3249])])
('Developer 11', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 5', [array([ 342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342,
16342])])
('Developer 6', array([ 13431., 14033., 14606., 115822., 120711., 121270., 125757.,
146023., 150498., 150634.]))
('Developer 7', array([ 0., 75., 75., 100., 159., 90., 78., 734., 732., 732.]))
('Developer 12', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 8', [array([ 0, 2193, 2175, 2175, 4050, 4059, 4059, 4089, 4079, 3695])])'''
a,b=zip(*new_dict.items())
res={'y':a,'label':b}
res
是你需要的。
输出
import pandas as pd
print(res)
df=pd.DataFrame(res)
print(df)
{'y': ('Developer 1', 'Developer 10', 'Developer 2', 'Developer 3', 'Developer 4',
'Developer 11', 'Developer 5', 'Developer 6', 'Developer 7', 'Developer 12', 'Developer 8'),
'label': ([array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([ 0, 0, 0, 15866, 15866, 15866, 16869, 17116, 17400,
17412])], [array([ 53, 3253, 3253, 3253, 3253, 3253, 3253, 3253, 3249, 3249])], [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([ 342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342,
16342])], array([ 13431., 14033., 14606., 115822., 120711., 121270., 125757.,
146023., 150498., 150634.]), array([ 0., 75., 75., 100., 159., 90., 78., 734., 732., 732.]), [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([ 0, 2193, 2175, 2175, 4050, 4059, 4059, 4089, 4079, 3695])])}
y label
0 Developer 1 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
1 Developer 10 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
2 Developer 2 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
3 Developer 3 [[0, 0, 0, 15866, 15866, 15866, 16869, 17116, ...
4 Developer 4 [[53, 3253, 3253, 3253, 3253, 3253, 3253, 3253...
5 Developer 11 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
6 Developer 5 [[342, 16342, 16342, 16342, 16342, 16342, 1634...
7 Developer 6 [13431.0, 14033.0, 14606.0, 115822.0, 120711.0...
8 Developer 7 [0.0, 75.0, 75.0, 100.0, 159.0, 90.0, 78.0, 73...
9 Developer 12 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
10 Developer 8 [[0, 2193, 2175, 2175, 4050, 4059, 4059, 4089,...
推荐阅读
- reactjs - 有没有办法根据屏幕大小切换网络彩票动画?
- reactjs - 反应:setState 没有更新数组
- python - 与我们在 java 中的自定义相同的 Python 对象序列化
- c - 如何处理发送带有“\n”的字符串到Microsoft Print to PDF引起的多页打印问题
- mysql - 将带有两列逗号分隔字段的 JSONL 插入到 mysql 中的单独行中
- c++ - 在 C++ 中重载原始类型的运算符的正确方法是什么?
- windows - 批处理 - 打开命令提示符,然后导航到目录并运行命令
- html - 从表中提取单行
- laravel - 如何在laravel中更改值更改文本
- angular - js文件执行角度错误