首页 > 解决方案 > 对于 pandas DataFrame 列中的每个唯一值,如何随机选择一定比例的行?

问题描述

Python新手在这里。想象一个看起来像这样的 csv 文件:

在此处输入图像描述

(...除了在现实生活中,Person 列中有 20 个不同的名称,每个 Person 有 300-500 行。此外,还有多个数据列,而不仅仅是一个。)

我想要做的是随机 标记每个人行的 10% 并将其标记在新列中。我想出了一个非常复杂的方法来做到这一点——它涉及创建一个随机数的辅助列和各种不必要的复杂的猜谜游戏。它奏效了,但很疯狂。最近,我想出了这个:

import pandas as pd 
df = pd.read_csv('source.csv')
df['selected'] = ''

names= list(df['Person'].unique())  #gets list of unique names

for name in names:
     df_temp = df[df['Person']== name]
     samp = int(len(df_temp)/10)   # I want to sample 10% for each name
     df_temp = df_temp.sample(samp)
     df_temp['selected'] = 'bingo!'   #a new column to mark the rows I've randomly selected
     df = df.merge(df_temp, how = 'left', on = ['Person','data'])
     df['temp'] =[f"{a} {b}" for a,b in zip(df['selected_x'],df['selected_y'])]
        #Note:  initially instead of the line above, I tried the line below, but it didn't work too well:
        #df['temp'] = df['selected_x'] + df['selected_y']
     df = df[['Person','data','temp']]
     df = df.rename(columns = {'temp':'selected'})

df['selected'] = df['selected'].str.replace('nan','').str.strip()  #cleans up the column

如您所见,基本上我为每个人提取了一个临时数据帧,用于DF.sample(number)进行随机化,然后DF.merge用于将“标记”的行返回到原始数据帧中。它涉及遍历列表以创建每个临时 DataFrame ......我的理解是迭代有点蹩脚。

必须有一种更 Pythonic、矢量化的方式来做到这一点,对吧?无需迭代。也许涉及的东西groupby?非常感谢任何想法或建议。

编辑:这是另一种避免merge...但它仍然很笨重的方法:

import pandas as pd
import math
    
   #SETUP TEST DATA:
    y = ['Alex'] * 2321 + ['Doug'] * 34123  + ['Chuck'] * 2012 + ['Bob'] * 9281 
    z = ['xyz'] * len(y)
    df = pd.DataFrame({'persons': y, 'data' : z})
    df = df.sample(frac = 1) #shuffle (optional--just to show order doesn't matter)
    percent = 10  #CHANGE AS NEEDED
    
    #Add a 'helper' column with random numbers
    df['rand'] = np.random.random(df.shape[0])
    df = df.sample(frac=1)  #this shuffles data, just to show order doesn't matter
    
    #CREATE A HELPER LIST
    helper = pd.DataFrame(df.groupby('persons'['rand'].count()).reset_index().values.tolist()
    for row in helper:
        df_temp = df[df['persons'] == row[0]][['persons','rand']]
        lim = math.ceil(len(df_temp) * percent*0.01)
        row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
               
    def flag(name,num):
        for row in helper:
            if row[0] == name:
                if num >= row[2]:
                    return 'yes'
                else:
                    return 'no'
    
    df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)

标签: pythonpandasdataframerandomvectorization

解决方案


如果我对您的理解正确,您可以使用以下方法实现此目的:

df = pd.DataFrame(data={'persons':['A']*10 + ['B']*10, 'col_1':[2]*20})
percentage_to_flag = 0.5
a = df.groupby(['persons'])['col_1'].apply(lambda x: pd.Series(x.index.isin(x.sample(frac=percentage_to_flag, random_state= 5, replace=False).index))).reset_index(drop=True)
df['flagged'] = a

Input:

       persons  col_1
    0        A      2
    1        A      2
    2        A      2
    3        A      2
    4        A      2
    5        A      2
    6        A      2
    7        A      2
    8        A      2
    9        A      2
    10       B      2
    11       B      2
    12       B      2
    13       B      2
    14       B      2
    15       B      2
    16       B      2
    17       B      2
    18       B      2
    19       B      2

Output with 50% flagged rows in each group:

     persons  col_1  flagged
0        A      2    False
1        A      2    False
2        A      2     True
3        A      2    False
4        A      2     True
5        A      2     True
6        A      2    False
7        A      2     True
8        A      2    False
9        A      2     True
10       B      2    False
11       B      2    False
12       B      2     True
13       B      2    False
14       B      2     True
15       B      2     True
16       B      2    False
17       B      2     True
18       B      2    False
19       B      2     True

推荐阅读