python - 优化大型数据集的迭代和替换
问题描述
我在这里发了一个帖子,但是由于我现在没有得到任何答案,我想也许也可以在这里尝试一下,因为我发现它很相关。
我有以下代码:
import pandas as pd
import numpy as np
import itertools
from pprint import pprint
# Importing the data
df=pd.read_csv('./GPr.csv', sep=',',header=None)
data=df.values
res = np.array([[i for i in row if i == i] for row in data.tolist()], dtype=object)
# This function will make the subsets of a list
def subsets(m,n):
z = []
for i in m:
z.append(list(itertools.combinations(i, n)))
return(z)
# Make the subsets of size 2
l=subsets(res,2)
l=[val for sublist in l for val in sublist]
Pairs=list(dict.fromkeys(l))
# Modify the pairs:
mod=[':'.join(x) for x in Pairs]
# Define new lists
t0=res.tolist()
t0=map(tuple,t0)
t1=Pairs
t2=mod
# Make substitions
result = []
for v1, v2 in zip(t1, t2):
out = []
for i in t0:
common = set(v1).intersection(i)
if set(v1) == common:
out.append(tuple(list(set(i) - common) + [v2]))
else:
out.append(tuple(i))
result.append(out)
pprint(result, width=200)
# Delete duplicates
d = {tuple(x): x for x in result}
remain= list(d.values())
它的作用如下:首先,我们在此处导入要使用的 csv 文件。你可以看到它是一个元素列表,对于每个元素,我们找到大小为 2 的子集。然后我们对子集进行修改并调用它mod
。它的作用是将 say('a','b')
转换为'a:b'
. 然后,对于每一对,我们会检查原始数据,并在哪里找到我们替换它们的对。最后,我们删除所有重复的,因为它是给定的。
该代码适用于少量数据。然而问题是我拥有的文件有 30082 对,其中每个应该扫描 ~49000 列表的列表并替换对。我在 Jupyter 中运行它,一段时间后内核死了。我想知道如何优化这一点?
解决方案
在整个文件上测试。
干得好:
=^..^=
import pandas as pd
import numpy as np
import itertools
# Importing the data
df=pd.read_csv('./GPr_test.csv', sep=',',header=None)
# set new data frame
df2 = pd.DataFrame()
pd.options.display.max_colwidth = 200
for index, row in df.iterrows():
# clean data
clean_list = [x for x in list(row.values) if str(x) != 'nan']
# create combinations
items_combinations = list(itertools.combinations(clean_list, 2))
# create set combinations
joint_items_combinations = [':'.join(x) for x in items_combinations]
# collect rest of item names
# handle firs element
if index == 0:
additional_names = list(df.loc[1].values)
additional_names = [x for x in additional_names if str(x) != 'nan']
else:
additional_names = list(df.loc[index-1].values)
additional_names = [x for x in additional_names if str(x) != 'nan']
# get set data
result = []
for combination, joint_combination in zip(items_combinations, joint_items_combinations):
set_data = [item for item in clean_list if item not in combination] + [joint_combination]
result.append((set_data, additional_names))
# add data to data frame
data = pd.DataFrame({"result": result})
df2 = df2.append(data)
df2 = df2.reset_index().drop(columns=['index'])
对于行:
chicken cinnamon ginger onion soy_sauce
cardamom coconut pumpkin
输出:
result
0 ([ginger, onion, soy_sauce, chicken:cinnamon], [cardamom, coconut, pumpkin])
1 ([cinnamon, onion, soy_sauce, chicken:ginger], [cardamom, coconut, pumpkin])
2 ([cinnamon, ginger, soy_sauce, chicken:onion], [cardamom, coconut, pumpkin])
3 ([cinnamon, ginger, onion, chicken:soy_sauce], [cardamom, coconut, pumpkin])
4 ([chicken, onion, soy_sauce, cinnamon:ginger], [cardamom, coconut, pumpkin])
5 ([chicken, ginger, soy_sauce, cinnamon:onion], [cardamom, coconut, pumpkin])
6 ([chicken, ginger, onion, cinnamon:soy_sauce], [cardamom, coconut, pumpkin])
7 ([chicken, cinnamon, soy_sauce, ginger:onion], [cardamom, coconut, pumpkin])
8 ([chicken, cinnamon, onion, ginger:soy_sauce], [cardamom, coconut, pumpkin])
9 ([chicken, cinnamon, ginger, onion:soy_sauce], [cardamom, coconut, pumpkin])
10 ([pumpkin, cardamom:coconut], [chicken, cinnamon, ginger, onion, soy_sauce])
11 ([coconut, cardamom:pumpkin], [chicken, cinnamon, ginger, onion, soy_sauce])
12 ([cardamom, coconut:pumpkin], [chicken, cinnamon, ginger, onion, soy_sauce])
推荐阅读
- javascript - 地图中两点之间的距离
- influxdb - InfluxDB:从零开始cumulative_sum()/对cumulative_sum和non_negative_difference所需的聚合分组
- jquery - Asp.net Mvc ajax 动作提交两次
- android - 改造 json 数据传递
- c# - textarea asp-for 不显示属性
- ios - 仅适合标签的高度
- mysql - 在一组天(在特定日期和前 2 天)获取不同的 id 计数
- wordpress - 古腾堡附加块属性
- sql - SQL UPDATE用空字符串替换括号内的所有文本
- arrays - Volley Json 解析