首页 > 解决方案 > Get n users from pandas dataframe by id

问题描述

This is a mock dataframe.

df_test = pd.DataFrame({
  'ID': [8972685, 8972685, 8972685, 8972685, 8972685, 8972685, 9834561, 9834561, 9834561, 9834561, 9834561, 9834561],
  'POST': ['texteghteh', 'tethrtxt', 'tetrhrtxt', 'terthtrxt', 'teetrwxt', 'twetrhext', 'tethdxt', 'texthdt', 'texdhtrt', 'texdthdt', 'tdghgdhtext', 'tthtdext']
})

Basically the bigger dataframe contains approximately 90000 distinct users and 28000000 rows. Each row contains a post made by some user. What I want is to pick n users from the dataframe along with their posts. Let's say I want to pick the first 500 users and each has 1000 posts. Basically I need to obtain 500000 rows.

I previously asked this and it was instantly marked as duplicate which I think it's not. This is another answer but I did not manage to apply those solutions successfully. I need it the other way round. First n groups regardless of entries.

I tried this:

df_test.groupby('ID')['POST'].head(2)

which yields:

0    texteghteh
1      tethrtxt
6       tethdxt
7       texthdt
Name: POST, dtype: object

This gives me the first two posts from each user. I want to see the 2 users with their posts.

标签: pythonpandasdataframe

解决方案


Depending how you would sample the users and their posts. For example, if you want to get the first 500 users with at least 1000 posts:

n_users, min_posts = 500, 1000
groups = df_test.groupby('ID')
sizes = groups.size()

# get the first n_users with at list min_posts
users = sizes[sizes>=min_posts].head(n_users).index

Now, if you don't want to get the first users, but rather sample them randomly, you can do:

users = sizes[sizes>=min_posts].sample(n_users).index

Once you have the users, you can filter with isin:

df_test[df_test['ID'].isin(users)]

And you can use the same logic with either groupby().head() or groupby().sample() to sample this data. For example, sample randomly min_posts for each of these users:

df_test[df_test['ID'].isin(users)].groupby('ID').sample(min_posts)

推荐阅读