首页 > 解决方案 > 使用会杀死内核的大型操作

问题描述

我编写了以下代码来从 postgres 数据库加载数据并对其进行一些操作。大约有 100 万行,内核不断死亡。当我将数据大小限制在 10k 左右时,它可以工作。

import psycopg2
import sys, os
import numpy as np
import pandas as pd
import creds as creds
import pandas.io.sql as psql


## ****** LOAD PSQL DATABASE ***** ##
# Sets up a connection to the postgres server.
conn_string = "host="+ creds.PGHOST +" port="+ "5432" +" dbname="+ creds.PGDATABASE +" user=" + creds.PGUSER \
+" password="+ creds.PGPASSWORD
conn=psycopg2.connect(conn_string)
print("Connected!")

# Create a cursor object
cursor = conn.cursor()


sql_command = "SELECT * FROM {};".format(str("events"))
print (sql_command)

# Load the data
data = pd.read_sql(sql_command, conn)

# taking a subet of the data until algorithm is perfected. 
# seed = np.random.seed(42)

# n = data.shape[0]
# ix = np.random.choice(n,10000)
# df_tmp = data.iloc[ix]

# Taking the source and destination and combining it into a list in another column 
# df_tmp['accounts'] = df_tmp.apply(lambda x: [x['source'], x['destination']], axis=1)
data['accounts'] = data.apply(lambda x: (x['source'], x['destination']), axis=1)
data['accounts_acc'] = data['accounts'].cumsum().apply(set)

有没有更有效的方法来做到这一点而不会一直失败?

标签: pythonpandasnumpybigdata

解决方案


我想问题出在“应用”方法上,因为它消耗了大量内存。

尝试将其替换为:

data['accounts'] = [(t.source, t.destination) for t in data.itertuples()]

让我们尝试测试一个有 600,000 行和 4 列的 Dataframe

内存性能

%memit df['accounts1'] = df.apply(lambda x: (x['col1'], x['col2']), axis=1)

峰值内存:506.66 MiB,增量:114.62 MiB

%memit run_loop()

峰值内存:475.82 MiB,增量:82.15 MiB

%memit df['accounts2'] = [(t.col1, t.col2) for t in df.itertuples()]

峰值内存:430.07 MiB,增量:38.02 MiB

def run_loop():
    new_col = []
    for i, row in df.iterrows():
        result = str(row.col1)+","+str(row.col2)
        new_col.append(result)  

时间表现

%timeit df['accounts1'] = df.apply(lambda x: (x['col1'], x['col2']), axis=1)

每个循环 9.93 秒 ± 345 毫秒(平均值 ± 标准偏差。7 次运行,每个循环 1 个)

%timeit df['accounts2'] = [(t.col1, t.col2) for t in df.itertuples()]

每个循环 598 毫秒 ± 16.1 毫秒(平均值 ± 标准偏差。7 次运行,每个循环 1 个)


推荐阅读