python - 对于 pandas DataFrame 列中的每个唯一值,如何随机选择一定比例的行?
问题描述
Python新手在这里。想象一个看起来像这样的 csv 文件:
(...除了在现实生活中,Person 列中有 20 个不同的名称,每个 Person 有 300-500 行。此外,还有多个数据列,而不仅仅是一个。)
我想要做的是随机 标记每个人行的 10% 并将其标记在新列中。我想出了一个非常复杂的方法来做到这一点——它涉及创建一个随机数的辅助列和各种不必要的复杂的猜谜游戏。它奏效了,但很疯狂。最近,我想出了这个:
import pandas as pd
df = pd.read_csv('source.csv')
df['selected'] = ''
names= list(df['Person'].unique()) #gets list of unique names
for name in names:
df_temp = df[df['Person']== name]
samp = int(len(df_temp)/10) # I want to sample 10% for each name
df_temp = df_temp.sample(samp)
df_temp['selected'] = 'bingo!' #a new column to mark the rows I've randomly selected
df = df.merge(df_temp, how = 'left', on = ['Person','data'])
df['temp'] =[f"{a} {b}" for a,b in zip(df['selected_x'],df['selected_y'])]
#Note: initially instead of the line above, I tried the line below, but it didn't work too well:
#df['temp'] = df['selected_x'] + df['selected_y']
df = df[['Person','data','temp']]
df = df.rename(columns = {'temp':'selected'})
df['selected'] = df['selected'].str.replace('nan','').str.strip() #cleans up the column
如您所见,基本上我为每个人提取了一个临时数据帧,用于DF.sample(number)
进行随机化,然后DF.merge
用于将“标记”的行返回到原始数据帧中。它涉及遍历列表以创建每个临时 DataFrame ......我的理解是迭代有点蹩脚。
必须有一种更 Pythonic、矢量化的方式来做到这一点,对吧?无需迭代。也许涉及的东西groupby
?非常感谢任何想法或建议。
编辑:这是另一种避免merge
...但它仍然很笨重的方法:
import pandas as pd
import math
#SETUP TEST DATA:
y = ['Alex'] * 2321 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #shuffle (optional--just to show order doesn't matter)
percent = 10 #CHANGE AS NEEDED
#Add a 'helper' column with random numbers
df['rand'] = np.random.random(df.shape[0])
df = df.sample(frac=1) #this shuffles data, just to show order doesn't matter
#CREATE A HELPER LIST
helper = pd.DataFrame(df.groupby('persons'['rand'].count()).reset_index().values.tolist()
for row in helper:
df_temp = df[df['persons'] == row[0]][['persons','rand']]
lim = math.ceil(len(df_temp) * percent*0.01)
row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
def flag(name,num):
for row in helper:
if row[0] == name:
if num >= row[2]:
return 'yes'
else:
return 'no'
df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)
解决方案
如果我对您的理解正确,您可以使用以下方法实现此目的:
df = pd.DataFrame(data={'persons':['A']*10 + ['B']*10, 'col_1':[2]*20})
percentage_to_flag = 0.5
a = df.groupby(['persons'])['col_1'].apply(lambda x: pd.Series(x.index.isin(x.sample(frac=percentage_to_flag, random_state= 5, replace=False).index))).reset_index(drop=True)
df['flagged'] = a
Input:
persons col_1
0 A 2
1 A 2
2 A 2
3 A 2
4 A 2
5 A 2
6 A 2
7 A 2
8 A 2
9 A 2
10 B 2
11 B 2
12 B 2
13 B 2
14 B 2
15 B 2
16 B 2
17 B 2
18 B 2
19 B 2
Output with 50% flagged rows in each group:
persons col_1 flagged
0 A 2 False
1 A 2 False
2 A 2 True
3 A 2 False
4 A 2 True
5 A 2 True
6 A 2 False
7 A 2 True
8 A 2 False
9 A 2 True
10 B 2 False
11 B 2 False
12 B 2 True
13 B 2 False
14 B 2 True
15 B 2 True
16 B 2 False
17 B 2 True
18 B 2 False
19 B 2 True
推荐阅读
- javascript - 如何为数组销毁提供动态值?
- java - IncompatibleClassChangeError - net.sf.ehcache.config.BeanHandler.startElement
- python - 层权重形状 (1, 1) 与为 keras 模型提供的权重形状 (1,) 不兼容
- reactjs - 仅返回选中输入的选定值
- javascript - 如何多次清除图像?
- plotly-dash - 子图的绘图背景未更新
- excel - VBA:查找值是否是表中存在的第一个实例
- java - 将 jdk 8 迁移到 jdk 11,mysql 为 SSL 抛出错误
- flutter - 如何为flutter中的每个ListView项目设置不同的背景颜色
- android - 设备是否正确安装在 android 上