python - Pandas 数据框分组和计数与 Python 中的验证
问题描述
我目前正在进行分析以执行以下操作:
1. 我需要计算每年是否存在 4 个条目为 'No. 2018 年和 2019 年的人物。应排除同一日期(无论哪一个)
它应该如下所示:
Year Gender No. People
18 Men 11
Woman 8
Not Applied 3
19 Men 14
Woman 5
Not Applied 0
No. People 列显示 No. People 的计数。
2. 按性别检查最近 10 个月在 10 天内是否有超过 6 个条目在 No. People 中存在。
结果可能如下所示:
Period Gender Entries
01/23/2019 - 01/15/2019 Men 6
N/A Woman N/A
N/A Not Applied N/A
3.查看最近3个月的人数是否有11项措施
Period Gender Entries
12/20/2018 - 01/23/2019 Men 26
12/20/2018 - 01/23/2019 Woman 13
12/20/2018 - 12/26/2018 Not Applied N/A
不知何故,它看起来很复杂,这就是我在代码中挣扎的原因。
我开始使用以下代码:
import pandas as pd
path = 'path'
filename = 'excel.xls'
final_path = path + '/' + filename
ws_name = 'Sheet1'
df.groupby(df['Date'].dt.year)['No. People'].agg(['count'])
但不知怎的,我正在为结果或错误而苦苦挣扎。
Excel 中的数据如下所示:
Date Gender No. People
12/20/18 Men 4
12/21/18 Men 9
12/22/18 Men 3
12/23/18 Men 9
12/24/18 Men 6
12/25/18 Men 1
12/26/18 Men 3
12/27/18 Men 8
12/28/18 Men 3
12/29/18 Men 5
12/30/18 Men 8
12/31/18 Men
01/01/19 Men
01/02/19 Men
01/03/19 Men
01/04/19 Men 9
01/05/19 Men 7
01/06/19 Men 5
01/07/19 Men 1
01/08/19 Men 8
01/09/19 Men 5
01/10/19 Men 6
01/11/19 Men 9
01/12/19 Men 7
01/13/19 Men
01/14/19 Men
01/15/19 Men
01/16/19 Men
01/17/19 Men
01/18/19 Men
01/19/19 Men 6
01/20/19 Men 5
01/21/19 Men 2
01/22/19 Men 5
01/23/19 Men 1
12/20/18 Women 6
12/21/18 Women 6
12/22/18 Women 2
12/23/18 Women 2
12/24/18 Women 2
12/25/18 Women
12/26/18 Women
12/27/18 Women
12/28/18 Women 1
12/29/18 Women 1
12/30/18 Women 4
12/31/18 Women
01/01/19 Women
01/02/19 Women
01/03/19 Women
01/04/19 Women
01/05/19 Women
01/06/19 Women
01/07/19 Women
01/08/19 Women
01/09/19 Women
01/10/19 Women
01/11/19 Women
01/12/19 Women
01/13/19 Women
01/14/19 Women
01/15/19 Women
01/16/19 Women
01/17/19 Women
01/18/19 Women
01/19/19 Women 4
01/20/19 Women 6
01/21/19 Women 8
01/22/19 Women 9
01/23/19 Women 4
12/20/18 Not Applied 6
12/21/18 Not Applied 2
12/22/18 Not Applied 3
12/23/18 Not Applied
12/24/18 Not Applied
12/25/18 Not Applied
12/26/18 Not Applied
解决方案
首先,也可以按性别添加分组
df['Date'] = pd.to_datetime(df['Date'])
df.groupby([df['Date'].dt.year, 'Gender'])['No. People'].agg(['count'])
对于第二个按 10 天的时间段进行分组,您可以使用 pandas Grouper 类
df.sort_values(by=['Date'], ascending=False, inplace=True)
from_date = df.iloc[0]['Date'] - pd.DateOffset(months=10)
last_10_months = df[df.Date >= from_date]
count_people = last_10_months.groupby([pd.Grouper(key='Date', freq='10D'), 'Gender']).count()
count_people[count_people['No. People'] > 6]
与月份的第三个相同
df.sort_values(by=['Date'], ascending=False, inplace=True)
from_date = df.iloc[0]['Date'] - pd.DateOffset(months=3)
last_3_months = df[df.Date >= from_date]
df.groupby(['Gender']).count()
count_people[count_people['No. People'] > 11]
推荐阅读
- arrays - $elemMatch 对不包含子文档的数组有用吗?
- ios - UITableView 单元格未添加到屏幕底部下方
- sql - 在更新语句中使用存储过程的输出参数
- r - R dplyr 计数组内的观察
- aws-iot - AWS IoT 子域对于 AWS IoT 按钮太长(第 1 代)
- javascript - 谷歌建议结果空数据参数
- python - Excel 到 SQL 和处理重复值
- html - 调整浏览器大小时堆叠 div
- python - 使用 urllib2 而不是请求来抓取 Google Scholar
- azure - Azure AD B2C 上的自定义属性为 stringCollection