首页 > 解决方案 > 对包含字符串和聚合函数的列值进行分组和过滤?

问题描述

如何仅针对在该列值中的任何位置包含字符串的值按几列分组?

例如,如果我想查看州和剧院名称,但只查看标题作为单词 dog 中任何位置的计数或次数,我该如何分组以进行过滤?

State | Theatre | Title           | TicketPrice
NY       B        Dog in heaven        5.50
NJ       C        Basketball           3.33
NY       B        Cats                 9.00
NY       B        Hair of Dog          44.00
NY       B        Lions                22.00
NJ       C        Dog Land             4.99

按州和剧院分组,我只想要 Dog 作为单词出现在 Title 列中的标题计数,以及每个分组的总和仅用于 Dog 出现的标题?

谢谢!

标签: pythonpython-3.xpandas

解决方案


比较Series.str.contains掩码的 by 列,转换为整数True->1False->0映射并计算1by 的数量sum

df1 = (df.assign(count = df['Title'].str.contains('Dog').astype(int))
        .groupby(['State', 'Theatre'])['count']
        .sum()
        .reset_index())

print (df1)
  State Theatre  count
0    NJ       C      1
1    NY       B      2

如果还想sumTicketPrice每个组的列聚合:

df2 = (df.assign(count = df['Title'].str.contains('Dog').astype(int))
        .groupby(['State', 'Theatre'])['count', 'TicketPrice']
        .sum()
        .reset_index())

print (df2)
  State Theatre  count  TicketPrice
0    NJ       C      1         8.32
1    NY       B      2        80.50

过滤行然后计算行数,但如果过滤掉不匹配的组:

df1 = (df[df['Title'].str.contains('Dog')]
        .groupby(['State', 'Theatre'])['TicketPrice']
        .size()
        .reset_index(name='count'))

print (df1)
  State Theatre  count
0    NJ       C      1
1    NY       B      2

推荐阅读