python - 将 groupby 函数与 .csv 文件中的数据框一起使用并绘制结果
问题描述
尝试创建数据视图,生成每年每个月的所有出生总数。需要为每一行生成一个 Pandas 日期。每行包含 .csv 文件中数据的年份值和月份值。向此数据框添加“日”列并将日值设置为 1。然后使用年、月和日列添加“日期”列。
我的问题是使用 group by 来获取每个月 EACH YEAR 的所有出生总数的逻辑。为每一行生成一个熊猫日期,并让该行包含一个年份值、月份值,然后在其顶部添加一个日期列。然后需要使用该信息制作一个“日期”列,然后绘制每年每月的总出生人数。
我现在的情节没有在其中绘制任何数据,只有图形框架
到目前为止,这是我的代码:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (8.0, 3.0)
births = pd.read_csv("births.csv")
births = births.drop(births[births.day == 99].index)
births = births.drop(births[births.births < 500].index)
births['day'].fillna(1, inplace=True)
births['day'] = births['day'].astype('int32')
births.dtypes
births['Date'] = pd.to_datetime(births[['year','month','day']])
print(births)
newbirth = births[births['day'] == 31]
start = pd.to_datetime(births['Date'].iloc[0])
end = pd.to_datetime(newbirth['Date'].iloc[-1])
births1 = births[births['gender'] == 'M']
births2 = births[births['gender'] == 'F']
births1.reset_index(inplace = True)
births2.reset_index(inplace = True)
births.reset_index()
#new dataframe
df = pd.DataFrame(births1['Date'])
df['births'] = births1['births'] + births2['births']
print(df)
#newplot
ax1 = plt.subplot(3,1,1)
ax2 = plt.subplot(3,1,3)
start1 = pd.to_datetime("01-01-1984")
end1 = pd.to_datetime("12-31-1986")
start2 = pd.to_datetime("01-01-1984")
end2 = pd.to_datetime("12-31-1986")
adjusted_268 = pd.Timedelta('268 days')
ax1.plot(df['Date'],df['births'])
ax2.plot(df['Date'],df['births'])
ax1.set_xlim(start1, end1)
ax1.set_ylim(8000,12000)
ax1.set_title("Total Daily Births")
ax1.set_ylabel("Daily Births")
ax1.tick_params(axis='x', rotation=45)
ax2.set_xlim(start2, (end2-adjusted_268))
ax2.set_ylim(8000,12000)
ax2.set_title("Likely Conception Dates")
ax2.set_ylabel("Conception Events")
ax2.tick_params(axis='x', rotation=45)
#newdf needed
#newplot needed
new_df = births.groupby('month').sum()
print(new_df)
ax3 = plt.subplot()
ax3.plot(new_df)
ax3.set_ylim(240000,380000)
ax3.set_title("Total Births Per Month")
ax3.set_ylabel("Births")
ax3.tick_params(axis='x', rotation=45)
这是数据框从 .csv 文件中打印出来的内容,并且在我使用 pandas 对其进行更新之后
year month day gender births Date
0 1969 1 1 F 4046 1969-01-01
1 1969 1 1 M 4440 1969-01-01
2 1969 1 2 F 4454 1969-01-02
3 1969 1 2 M 4548 1969-01-02
4 1969 1 3 F 4548 1969-01-03
5 1969 1 3 M 4994 1969-01-03
6 1969 1 4 F 4440 1969-01-04
7 1969 1 4 M 4520 1969-01-04
8 1969 1 5 F 4192 1969-01-05
9 1969 1 5 M 4198 1969-01-05
10 1969 1 6 F 4710 1969-01-06
11 1969 1 6 M 4850 1969-01-06
12 1969 1 7 F 4646 1969-01-07
13 1969 1 7 M 5092 1969-01-07
14 1969 1 8 F 4800 1969-01-08
15 1969 1 8 M 4934 1969-01-08
16 1969 1 9 F 4592 1969-01-09
17 1969 1 9 M 4842 1969-01-09
18 1969 1 10 F 4852 1969-01-10
19 1969 1 10 M 5190 1969-01-10
20 1969 1 11 F 4580 1969-01-11
21 1969 1 11 M 4598 1969-01-11
22 1969 1 12 F 4126 1969-01-12
23 1969 1 12 M 4324 1969-01-12
24 1969 1 13 F 4758 1969-01-13
25 1969 1 13 M 5076 1969-01-13
26 1969 1 14 F 5070 1969-01-14
27 1969 1 14 M 5296 1969-01-14
28 1969 1 15 F 4798 1969-01-15
29 1969 1 15 M 5096 1969-01-15
... ... ... ... ... ... ...
15517 2007 10 1 F 180912 2007-10-01
15518 2007 10 1 M 189157 2007-10-01
15519 2007 11 1 F 173513 2007-11-01
15520 2007 11 1 M 180814 2007-11-01
15521 2007 12 1 F 173787 2007-12-01
15522 2007 12 1 M 181426 2007-12-01
15523 2008 1 1 F 174255 2008-01-01
15524 2008 1 1 M 182789 2008-01-01
15525 2008 2 1 F 165669 2008-02-01
15526 2008 2 1 M 173434 2008-02-01
15527 2008 3 1 F 172053 2008-03-01
15528 2008 3 1 M 179129 2008-03-01
15529 2008 4 1 F 169585 2008-04-01
15530 2008 4 1 M 177399 2008-04-01
15531 2008 5 1 F 173141 2008-05-01
15532 2008 5 1 M 182294 2008-05-01
15533 2008 6 1 F 169958 2008-06-01
15534 2008 6 1 M 179267 2008-06-01
15535 2008 7 1 F 183391 2008-07-01
15536 2008 7 1 M 192714 2008-07-01
15537 2008 8 1 F 182713 2008-08-01
15538 2008 8 1 M 191315 2008-08-01
15539 2008 9 1 F 179696 2008-09-01
15540 2008 9 1 M 188964 2008-09-01
15541 2008 10 1 F 175314 2008-10-01
15542 2008 10 1 M 183219 2008-10-01
15543 2008 11 1 F 158939 2008-11-01
15544 2008 11 1 M 165468 2008-11-01
15545 2008 12 1 F 173215 2008-12-01
15546 2008 12 1 M 181235 2008-12-01
解决方案
尝试这个:
生成随机数据:
np.random.seed(2019)
rng = pd.date_range('1969-01-01', '2008-12-01', freq='D')
df = pd.DataFrame({'birth': np.random.randint(1000, size=len(rng))}, index=rng)
打印df
:
birth
1969-01-01 72
1969-01-02 370
1969-01-03 159
1969-01-04 805
1969-01-05 728
1969-01-06 190
1969-01-07 758
1969-01-08 24
1969-01-09 797
... ...
2008-11-02 222
2008-11-03 409
2008-11-04 65
2008-11-05 451
2008-11-06 369
2008-11-07 815
2008-11-08 295
2008-11-09 232
然后创建年、月、日列:
df['month'] = df.index.month
df['day'] = df.index.day
df['year'] = df.index.year
并按年和月分组并找到出生的总和:
df_grouped = df.groupby(['year', 'month'])['birth'].sum()
分组df的结果:
year month
1969 1 14212
2 14982
3 15063
4 12726
5 14131
6 16597
7 14487
8 12517
9 14483
10 17620
11 13883
12 17778
1970 1 15345
2 14524
3 18429
4 12981
5 16855
6 17135
7 18174
8 17031
9 14734
10 15743
11 12761
12 18428
1971 1 15063
2 17224
3 15088
4 17015
5 17032
6 17275
...
2006 7 14718
8 13212
9 17389
10 13850
11 13154
12 14126
2007 1 15762
2 12744
3 17176
4 12777
5 16254
6 16297
7 16038
8 15067
9 13724
10 14386
11 15701
12 17136
2008 1 16047
2 15094
3 14539
4 18662
5 16575
6 13553
7 17107
8 15824
9 15096
10 15139
11 15419
12 250
现在让我们绘制:
fig, ax = plt.subplots(figsize=(15,7))
df_grouped.unstack().plot()
plt.show()
如果你愿意,你可以努力使情节“更漂亮”
[编辑] 我认为条形图看起来会更整洁:
df_grouped.unstack().plot(kind='bar')
推荐阅读
- perl - Perl socket编程发送方法
- azure - 使用 SCIM 将 Azure AD 用户预配到 wso2 身份服务器
- c - 如何从字符数组中提取子字符数组
- angular7 - 如何以角度向循环元素添加动态类
- java - 在JAVA中创建灵活数据类型的变量
- git - 在 VisualStudio 2017 中的某些项目中,我在团队资源管理器设置部分看不到“存储库设置”
- apache-spark - spark.streaming.kafka.consumer.cache.enabled 属性工作/影响 Kafka 消费者的性能
- php - 重定向页面时突然会话数据被破坏
- postgresql - 如何按不同表中的相应行数对行进行排序?
- excel - 如何单击单元格以激活相邻单元格的宏