首页 > 解决方案 > How to sample dataframe that results in same distribution from a column in another dataframe

问题描述

Using Pandas:

I have a dataframe that has people in it like this:

           member_id on_service  start_date    end_date days_in_study dod  \
12345678    12345678      False  2019-11-03  2020-05-31           210 NaT   
23456789    23456789       True  2019-12-27  2020-05-31           156 NaT    

          last_enrollment_date       RAF   Expense       Age admits_in_range  \
12345678            2020-05-31  0.144511  0.042008  0.716981               0   
23456789            2020-05-31  0.145709  0.033580  0.547170               0   

I am doing some analysis between the on_service group versus not on service.

I would like to sample the not on_service_group to have the same Age distribution as the on_service_group

I have tried

weights = on_service_members["Age"] 
df = no_on_service_members.sample(weights = weights)

But I am getting an error "Invalids weights: weights sum to zero"

I think it is because it is not using the Age column to look up the weight? Or perhaps I am completely on the wrong track.

标签: pythonpandas

解决方案


我相信我已经找到了解决方案,但是这似乎应该是我缺少的标准库的一部分。

 def sample_with_distribution(source_of_distribution,source_to_sample,column_name):
    size_to_sample = len(source_of_distribution)

    bins = source_of_distribution[column_name].value_counts(bins=8,normalize=True)

    new_data_frame = pd.DataFrame(data=None, columns=source_to_sample.columns)

    for iv, bin_size in bins.iteritems():
        m = source_to_sample[(source_to_sample[column_name] > iv.left) & (source_to_sample[column_name] <= iv.right)]
        how_many = int(bin_size * size_to_sample)
        if how_many > len(m):
            print( "ISSUE: How many we want ", how_many, " How big is it ", len(m))
            how_many = len(m)

        a = m.sample(n = how_many, random_state=100)
        new_data_frame = new_data_frame.append(a)
        
    return new_data_frame

它似乎确实有效。当我通过 KDE 运行 TTEST 和图形时,看起来我得到了我想要的。

分配匹配


推荐阅读