python - 解决熊猫问题的并行编程方法
问题描述
我有以下格式的数据框。
df
A B Target
5 4 3
1 3 4
我正在使用pd.DataFrame(df.corr().iloc[:-1,-1])
.
但问题是 - 我的实际数据框的大小(216, 72391)
至少需要 30 分钟才能在我的系统上处理。有什么方法可以使用 gpu 对其进行并行化吗?我需要多次查找类似类型的值,所以不能等待每次 30 分钟的正常处理时间。
解决方案
在这里,我尝试使用numba
import numpy as np
import pandas as pd
from numba import jit, int64, float64
#
#------------You can ignore the code starting from here---------
#
# Create a random DF with cols_size = 72391 and row_size =300
df_dict = {}
for i in range(0, 72391):
df_dict[i] = np.random.randint(100, size=300)
target_array = np.random.randint(100, size=300)
df = pd.DataFrame(df_dict)
# ----------Ignore code till here. This is just to generate dummy data-------
# Assume df is your original DataFrame
target_array = df['target'].values
# You can choose to restore this column later
# But for now we will remove it, since we will
# call the df.values and find correlation of each
# column with target
df.drop(['target'], inplace=True, axis=1)
# This function takes in a numpy 2D array and a target array as input
# The numpy 2D array has the data of all the columns
# We find correlation of each column with target array
# numba's Jit required that both should have same columns
# Hence the first 2d array is transposed, i.e. it's shape is (72391,300)
# while target array's shape is (300,)
def do_stuff(df_values, target_arr):
# Just create a random array to store result
# df_values.shape[0] = 72391, equal to no. of columns in df
result = np.random.random(df_values.shape[0])
# Iterator over each column
for i in range(0, df_values.shape[0]):
# Find correlation of a column with target column
# In order to find correlation we must transpose array to make them compatible
result[i] = np.corrcoef(np.transpose(df_values[i]), target_arr.reshape(300,))[0][1]
return result
# Decorate the function do_stuff
do_stuff_numba = jit(nopython=True, parallel=True)(do_stuff)
# This contains all the correlation
result_array = do_stuff_numba(np.transpose(df.T.values), target_array)
链接到colab 笔记本。
推荐阅读
- symfony - 为 HttpFoundation/response 返回 Null
- python - 为什么 Python zipfile 不提供与命令行 zip 相同的输出 .zip 文件大小?
- elasticsearch - 使用 NEST 设置 Elasticsearch routing_partition_size
- elasticsearch - 仅来自唯一值的 Date_histogram 和 top_hits
- graphql - 使用 Apollo Server 运行 Jest 失败 global.fetch undefined
- java - 反转屏幕,包括 null 和空字符串
- java - 使用自定义容器作为数据库与 TestContainer
- c++ - qtconcurrent 没有用于调用“运行”的匹配函数
- html - 为什么使用 grid-template-areas 与同一类的网格区域但不同的选择器以相反的顺序显示?
- r - 调整条形图 y 轴,以便可以很好地绘制更高的 y 值