首页 > 解决方案 > 如何从 python 中的 2 个相关列表列创建一列?

问题描述

样品编号 测试名称 结果
23939332 [32131,34343,35566] [负,0.234,3.331]
32332323 [34343,96958,39550,88088] [0,312,0.008,0.1,0.2]

上表是我所拥有的,下表是我想要实现的:

样品编号 32131 34343 39550 88088 96985 35566
23939332 消极的 0.234 3.331
32332323 0,312 0.1 0.2 0.008

所以我需要从列中创建唯一值的testnames列,并用列中的相应值填充单元格results

考虑到这是来自一个非常大的数据集(表)的样本。

标签: pandasdataframe

解决方案


这是一个评论的解决方案:

(df.set_index(['sampleID'])  # keep sampleID out of the expansion
   .apply(pd.Series.explode) # expand testnames and results
   .reset_index()            # reset the index
   .groupby(['sampleID', 'testnames']) # 
   .first()                            # set the expected shape
   .unstack())                         # 

它给出了您预期的结果,但列顺序不同:

            results                                 
testnames     32131  34343  35566 39550 88088  96958
sampleID                                            
23939332   NEGATIVE  0.234  3.331   NaN   NaN    NaN
32332323        NaN  0.312    NaN   0.1   0.2  0.008

让我们看看它如何处理生成的数据:

def build_df(n_samples, n_tests_per_sample, n_test_types):
    df = pd.DataFrame(columns=['sampleID', 'testnames', 'results'])
    test_types = np.random.choice(range(0,100000), size=n_test_types, replace=False)
    for i in range(n_samples):
        testnames = list(np.random.choice(test_types,size=n_tests_per_sample))
        results = list(np.random.random(size=n_tests_per_sample))
        df = df.append({'sampleID': i, 'testnames':testnames, 'results':results}, ignore_index=True)
    return df

def reshape(df):
    df2 = (df.set_index(['sampleID'])  # keep the sampleID out of the expansion
             .apply(pd.Series.explode) # expand testnames and results
             .reset_index()            # reset the index
             .groupby(['sampleID', 'testnames']) # 
             .first()                            # set the expected shape
             .unstack())   
    return df2

%time df = build_df(60000, 10, 100)
# Wall time: 9min 48s (yes, it was ugly)

%time df2 = reshape(df)
# Wall time: 1.01 s

reshape()n_test_types当变得太大时中断,带有ValueError: Unstacked DataFrame is too big, causing int32 overflow.


推荐阅读