首页 > 解决方案 > 提高计算与熊猫中特定条件匹配的随机样本的性能

问题描述

对于某些数据集group_1,我需要遍历所有行时间以实现稳健性,并根据表示为数据框列的某些标准k找到另一个数据框的匹配随机样本。group_2不幸的是,这相当缓慢。我怎样才能提高性能?

瓶颈是apply-ed 函数,即randomMatchingCondition.

import tqdm                                                                                                   
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

seed = 47
np.random.seed(seed)

###################################################################
# generate dummy data
size = 10000
df = pd.DataFrame({i: np.random.randint(1,100,size=size) for i in ['metric']})
df['label'] =  np.random.randint(0,2, size=size)
df['group_1'] =  pd.Series(np.random.randint(1,12, size=size)).astype(object)
df['group_2'] =  pd.Series(np.random.randint(1,10, size=size)).astype(object)

group_0 = df[df['label'] == 0]
group_0 = group_0.reset_index(drop=True)
group_0 = group_0.rename(index=str, columns={"metric": "metric_group_0"})

join_columns_enrich = ['group_1', 'group_2']
join_real = ['metric_group_0']
join_real.extend(join_columns_enrich)
group_0 = group_0[join_real]
display(group_0.head())
group_1 = df[df['label'] == 1]
group_1 = group_1.reset_index(drop=True)
display(group_1.head())

###################################################################
# naive find random element matching condition
def randomMatchingCondition(original_element, group_0, join_columns, random_state):
    limits_dict = original_element[join_columns_enrich].to_dict()
    query = ' & '.join([f"{k} == {v}" for k, v in limits_dict.items()])
    candidates = group_0.query(query)
    if len(candidates) > 0:
        return candidates.sample(n=1, random_state=random_state)['metric_group_0'].values[0]
    else:
        return np.nan
###################################################################
# iterate over pandas dataframe k times for more robust sampling
k = 3
resulting_df = None
for i in range(1, k+1):
    group_1['metric_group_0'] = group_1.progress_apply(randomMatchingCondition,
                                                                  args=[group_0, join_columns_enrich, None],
                                                                  axis = 1)
    group_1['run'] = i
    if resulting_df is None:
        resulting_df = group_1.copy()
    else:
        resulting_df = pd.concat([resulting_df, group_1])
resulting_df.head()

尝试对数据进行预排序:

group_0 = group_0.sort_values(join_columns_enrich)
group_1 = group_1.sort_values(join_columns_enrich)

没有任何区别。

标签: pythonpandasperformancesampling

解决方案


  1. IIUC 您希望k在输入数据框中为每一行(指标组合)获得随机样本数。那么为什么不candidates.sample(n=k, ...),摆脱for循环呢?或者,您可以将数据帧k时间与pd.concat([group1] * k).

  2. 这取决于您的真实数据,但我会尝试按度量列对输入数据帧进行分组group1.groupby(join_columns_enrich)(如果它们的基数足够低),并对这些组应用随机抽样,k * len(group.index)为每个组选择随机样本。groupby很昂贵,OTOH,一旦完成,您可能会在迭代/采样上节省很多。


推荐阅读