python - Get n users from pandas dataframe by id
问题描述
This is a mock dataframe.
df_test = pd.DataFrame({
'ID': [8972685, 8972685, 8972685, 8972685, 8972685, 8972685, 9834561, 9834561, 9834561, 9834561, 9834561, 9834561],
'POST': ['texteghteh', 'tethrtxt', 'tetrhrtxt', 'terthtrxt', 'teetrwxt', 'twetrhext', 'tethdxt', 'texthdt', 'texdhtrt', 'texdthdt', 'tdghgdhtext', 'tthtdext']
})
Basically the bigger dataframe contains approximately 90000 distinct users and 28000000 rows. Each row contains a post made by some user. What I want is to pick n users from the dataframe along with their posts. Let's say I want to pick the first 500 users and each has 1000 posts. Basically I need to obtain 500000 rows.
I previously asked this and it was instantly marked as duplicate which I think it's not. This is another answer but I did not manage to apply those solutions successfully. I need it the other way round. First n groups regardless of entries.
I tried this:
df_test.groupby('ID')['POST'].head(2)
which yields:
0 texteghteh
1 tethrtxt
6 tethdxt
7 texthdt
Name: POST, dtype: object
This gives me the first two posts from each user. I want to see the 2 users with their posts.
解决方案
Depending how you would sample the users and their posts. For example, if you want to get the first 500 users with at least 1000 posts:
n_users, min_posts = 500, 1000
groups = df_test.groupby('ID')
sizes = groups.size()
# get the first n_users with at list min_posts
users = sizes[sizes>=min_posts].head(n_users).index
Now, if you don't want to get the first users, but rather sample them randomly, you can do:
users = sizes[sizes>=min_posts].sample(n_users).index
Once you have the users, you can filter with isin
:
df_test[df_test['ID'].isin(users)]
And you can use the same logic with either groupby().head()
or groupby().sample()
to sample this data. For example, sample randomly min_posts
for each of these users:
df_test[df_test['ID'].isin(users)].groupby('ID').sample(min_posts)
推荐阅读
- javascript - 如何在点击时删除特定的 li 列表?
- r - 如何用前向和后向填充的平均值替换数据框中的 NULL 值?
- macos - 目标文件夹 macOS
- node.js - 为什么解码在 Nodejs 上无法正常工作?
- javascript - Mongoose 多个数据库类型错误:users.model 不是函数
- java - 如何等待函数完成(Threads in Thread)
- python - 如何根据数据框中的协同定位约束设置列值?
- azure - 在 Azure 上创建 MarkLogic REST api 实例但无法连接到它
- python - 如何最好地与同事共享在 Juypter 笔记本中创建的更新的 html 和 csv 文件
- list - 如何在 OCaml 中制作类型列表?