首页 > 解决方案 > Creating a pandas pivot table to count number of times items appear in a list together

问题描述

I am trying to count the number of times users look at pages in the same session.

I am starting with a data frame listing user_ids and the page slugs they have visited:

user_id page_view_page_slug
1       slug1
1       slug2
1       slug3
1       slug4
2       slug5
2       slug3
2       slug2
2       slug1

What I am looking to get is a pivot table counting user_ids of the cross section of slugs

. slug1 slug2 slug3 slug4 slug5
slug1 2 2 2 1 1
slug2 2 2 2 1 1
slug3 2 2 2 1 1
slug4 1 1 1 1 0
slug5 1 1 1 0 1

I realize this will be the same data reflected when we see slug1 and slug2 vs slug2 and slug1 but I can't think of a better way. So far I have done a listagg

def listagg(df, grouping_idx):
    return df.groupby(grouping_idx).agg(list)
new_df = listagg(df,'user_id')

Returning:

          page_view_page_slug
user_id                                                   
1        [slug1, slug2, slug3, slug4]
2        [slug5, slug3, slug2, slug2]
7        [slug6, slug4, slug7]
9        [slug3, slug5, slug1]

But I am struggling to think of loop to count when items appear in a list together (despite the order) and how to store it. Then I also do not know how I would get this in a pivotable format.

标签: pythonpandasnumpypivot-table

解决方案


这是另一种方法,通过使用 numpy 广播创建一个矩阵,该矩阵通过将每个值user_id与每个其他值进行比较而获得,然后从该矩阵创建一个新的数据帧,并将indexcolumns设置为page_view_page_slugsum计算横截面level=0的蛞蝓:axis=0axis=1user_ids

a = df['user_id'].values
i = list(df['page_view_page_slug'])

pd.DataFrame(a[:, None] == a, index=i, columns=i)\
   .sum(level=0).sum(level=0, axis=1).astype(int)

       slug1  slug2  slug3  slug4  slug5
slug1      2      2      2      1      1
slug2      2      2      2      1      1
slug3      2      2      2      1      1
slug4      1      1      1      1      0
slug5      1      1      1      0      1

推荐阅读