首页 > 解决方案 > 根据性别分组名称并计算它们的数量

问题描述

数据框如下所示:

df = pd.DataFrame({'name':["a"," b", "c","d", "e","a"," a", "a"," b", "c","d", "e","a"," a"],
           'gender': ["male", "female", "female", "female", "male","male","male","female","female", "female", "male","male","male", "male"],
          'year':[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2019],
          'month':[1, 12, 4, 3, 6, 7, 2, 4, 5, 1, 12, 4, 3, 6 ],
          'count':[100, 30, 10, 90, 34, 100, 30, 10, 90, 34, 100, 30, 10, 90]})

它显示了姓名、性别、出生年份和出生月份以及人数。例如,2005 年 1 月有 100 个婴儿名为“a”。我想找到男性和女性的前 10 个常用名字。如下:

在此处输入图像描述

这是我的代码:

f_data=data.groupby(['gender','name'])['count'].count().nlargest(10)

但是我怎样才能将表格更改为想要的格式,不能假设名称是重复的。我们需要总结它们,例如我们在 2005 年有 100 个“a”,在 2009 年有 100 个“a”等等。所以我们总共需要把它们加起来,比如 100+ 100 和。最终的结果必须像 a 是男性,并且在 10 年内总共有 500 人被命名为 a 等等。

标签: pythondataframejupyter-notebookdata-analysis

解决方案


编辑版本 2:删除重复项

下面的结果没有重复。不幸的是,这些值'name'有前导或尾随空格。我花了一段时间才弄清楚。我正在使用 unique() 和 [:10] 切片来获得前 10 名。

import pandas as pd
df = pd.DataFrame({'name':["a", "b", "c", "d", "e", "a", "a", "a", "b", "c", "d", "e", "a", "a"],
           'gender': ["male", "female", "female", "female", "male","male","male","female","female", "female", "male","male","male", "male"],
          'year':[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2019],
          'month':[1, 12, 4, 3, 6, 7, 2, 4, 5, 1, 12, 4, 3, 6 ],
          'count':[100, 30, 10, 90, 34, 100, 30, 10, 90, 34, 100, 30, 10, 90]})

df.sort_values(['gender','count'],ascending=False,inplace=True)
print (df)
male = df.loc[df['gender']=='male']['name'].unique()[:10]
print (male)
female = df[df['gender']=='female']['name'].unique()[:10]
results = pd.DataFrame({'Male':pd.Series(male),'Female':pd.Series(female)})
print (results)

其输出将是:

  Male Female
0    a      d
1    d      b
2    e      c
3  NaN      a

编辑版本 1

你不需要分组。相反,您需要按降序对值进行排序,并使用 head(10) 获取前 10

以下是如何完成它。请注意,我们没有 10 个男性名字和 10 个女性名字。所以我使用 pd.Series 来创建数据框。这也将解决两个列表之间的不匹配问题(男性有 8 个项目,女性有 6 个项目)。

代码如下所示:

import pandas as pd
df = pd.DataFrame({'name':["a"," b", "c","d", "e","a"," a", "a"," b", "c","d", "e","a"," a"],
           'gender': ["male", "female", "female", "female", "male","male","male","female","female", "female", "male","male","male", "male"],
          'year':[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2019],
          'month':[1, 12, 4, 3, 6, 7, 2, 4, 5, 1, 12, 4, 3, 6 ],
          'count':[100, 30, 10, 90, 34, 100, 30, 10, 90, 34, 100, 30, 10, 90]})

df.sort_values(['gender','count'],ascending=False,inplace=True)
male = df[df['gender']=='male']['name'].head(10).to_list()
female = df[df['gender']=='female']['name'].head(10).to_list()
results = pd.DataFrame({'Male':pd.Series(male),'Female':pd.Series(female)})
print (results)

输出是:

  Male Female
0    a      d
1    a      b
2    d      c
3    a      b
4    e      c
5    a      a
6    e    NaN
7    a    NaN

推荐阅读