首页 > 解决方案 > 将 groupby 函数与 .csv 文件中的数据框一起使用并绘制结果

问题描述

尝试创建数据视图,生成每年每个月的所有出生总数。需要为每一行生成一个 Pandas 日期。每行包含 .csv 文件中数据的年份值和月份值。向此数据框添加“日”列并将日值设置为 1。然后使用年、月和日列添加“日期”列。

我的问题是使用 group by 来获取每个月 EACH YEAR 的所有出生总数的逻辑。为每一行生成一个熊猫日期,并让该行包含一个年份值、月份值,然后在其顶部添加一个日期列。然后需要使用该信息制作一个“日期”列,然后绘制每年每月的总出生人数。

我现在的情节没有在其中绘制任何数据,只有图形框架

到目前为止,这是我的代码:

     import pandas as pd
     import matplotlib
     import matplotlib.pyplot as plt
     matplotlib.rcParams['figure.figsize'] = (8.0, 3.0)

     births = pd.read_csv("births.csv")
     births = births.drop(births[births.day == 99].index)
     births = births.drop(births[births.births < 500].index)
     births['day'].fillna(1, inplace=True)
     births['day'] = births['day'].astype('int32')
     births.dtypes
     births['Date'] = pd.to_datetime(births[['year','month','day']])
     print(births)

     newbirth = births[births['day'] == 31]


     start = pd.to_datetime(births['Date'].iloc[0])
     end = pd.to_datetime(newbirth['Date'].iloc[-1])

     births1 = births[births['gender'] == 'M']
     births2 = births[births['gender'] == 'F']
     births1.reset_index(inplace = True)
     births2.reset_index(inplace = True)
     births.reset_index()
     #new dataframe
     df = pd.DataFrame(births1['Date'])
     df['births'] = births1['births'] + births2['births']
     print(df)
     #newplot
     ax1 = plt.subplot(3,1,1)
     ax2 = plt.subplot(3,1,3)

     start1 = pd.to_datetime("01-01-1984")
     end1 = pd.to_datetime("12-31-1986")

     start2 = pd.to_datetime("01-01-1984")
     end2 = pd.to_datetime("12-31-1986")
     adjusted_268 = pd.Timedelta('268 days')

     ax1.plot(df['Date'],df['births'])
     ax2.plot(df['Date'],df['births'])

     ax1.set_xlim(start1, end1)
     ax1.set_ylim(8000,12000)
     ax1.set_title("Total Daily Births")
     ax1.set_ylabel("Daily Births")
     ax1.tick_params(axis='x', rotation=45)



     ax2.set_xlim(start2, (end2-adjusted_268))
     ax2.set_ylim(8000,12000)
     ax2.set_title("Likely Conception Dates")
     ax2.set_ylabel("Conception Events")
     ax2.tick_params(axis='x', rotation=45)

     #newdf needed
     #newplot needed

     new_df = births.groupby('month').sum()
     print(new_df)
     ax3 = plt.subplot()
     ax3.plot(new_df)
     ax3.set_ylim(240000,380000)
     ax3.set_title("Total Births Per Month")
     ax3.set_ylabel("Births")
     ax3.tick_params(axis='x', rotation=45)

这是数据框从 .csv 文件中打印出来的内容,并且在我使用 pandas 对其进行更新之后

           year  month  day gender  births       Date
    0      1969      1    1      F    4046 1969-01-01
    1      1969      1    1      M    4440 1969-01-01
    2      1969      1    2      F    4454 1969-01-02
    3      1969      1    2      M    4548 1969-01-02
    4      1969      1    3      F    4548 1969-01-03
    5      1969      1    3      M    4994 1969-01-03
    6      1969      1    4      F    4440 1969-01-04
    7      1969      1    4      M    4520 1969-01-04
    8      1969      1    5      F    4192 1969-01-05
    9      1969      1    5      M    4198 1969-01-05
    10     1969      1    6      F    4710 1969-01-06
    11     1969      1    6      M    4850 1969-01-06
    12     1969      1    7      F    4646 1969-01-07
    13     1969      1    7      M    5092 1969-01-07
    14     1969      1    8      F    4800 1969-01-08
    15     1969      1    8      M    4934 1969-01-08
    16     1969      1    9      F    4592 1969-01-09
    17     1969      1    9      M    4842 1969-01-09
    18     1969      1   10      F    4852 1969-01-10
    19     1969      1   10      M    5190 1969-01-10
    20     1969      1   11      F    4580 1969-01-11
    21     1969      1   11      M    4598 1969-01-11
    22     1969      1   12      F    4126 1969-01-12
    23     1969      1   12      M    4324 1969-01-12
    24     1969      1   13      F    4758 1969-01-13
    25     1969      1   13      M    5076 1969-01-13
    26     1969      1   14      F    5070 1969-01-14
    27     1969      1   14      M    5296 1969-01-14
    28     1969      1   15      F    4798 1969-01-15
    29     1969      1   15      M    5096 1969-01-15
    ...     ...    ...  ...    ...     ...        ...
    15517  2007     10    1      F  180912 2007-10-01
    15518  2007     10    1      M  189157 2007-10-01
    15519  2007     11    1      F  173513 2007-11-01
    15520  2007     11    1      M  180814 2007-11-01
    15521  2007     12    1      F  173787 2007-12-01
    15522  2007     12    1      M  181426 2007-12-01
    15523  2008      1    1      F  174255 2008-01-01
    15524  2008      1    1      M  182789 2008-01-01
    15525  2008      2    1      F  165669 2008-02-01
    15526  2008      2    1      M  173434 2008-02-01
    15527  2008      3    1      F  172053 2008-03-01
    15528  2008      3    1      M  179129 2008-03-01
    15529  2008      4    1      F  169585 2008-04-01
    15530  2008      4    1      M  177399 2008-04-01
    15531  2008      5    1      F  173141 2008-05-01
    15532  2008      5    1      M  182294 2008-05-01
    15533  2008      6    1      F  169958 2008-06-01
    15534  2008      6    1      M  179267 2008-06-01
    15535  2008      7    1      F  183391 2008-07-01
    15536  2008      7    1      M  192714 2008-07-01
    15537  2008      8    1      F  182713 2008-08-01
    15538  2008      8    1      M  191315 2008-08-01
    15539  2008      9    1      F  179696 2008-09-01
    15540  2008      9    1      M  188964 2008-09-01
    15541  2008     10    1      F  175314 2008-10-01
    15542  2008     10    1      M  183219 2008-10-01
    15543  2008     11    1      F  158939 2008-11-01
    15544  2008     11    1      M  165468 2008-11-01
    15545  2008     12    1      F  173215 2008-12-01
    15546  2008     12    1      M  181235 2008-12-01

标签: pythonpandasmatplotliblogic

解决方案


尝试这个:

生成随机数据:

np.random.seed(2019)
rng = pd.date_range('1969-01-01', '2008-12-01', freq='D')
df = pd.DataFrame({'birth': np.random.randint(1000, size=len(rng))}, index=rng)

打印df

birth
1969-01-01     72
1969-01-02    370
1969-01-03    159
1969-01-04    805
1969-01-05    728
1969-01-06    190
1969-01-07    758
1969-01-08     24
1969-01-09    797
...           ...
2008-11-02    222
2008-11-03    409
2008-11-04     65
2008-11-05    451
2008-11-06    369
2008-11-07    815
2008-11-08    295
2008-11-09    232

然后创建年、月、日列:

df['month'] = df.index.month
df['day'] = df.index.day
df['year'] = df.index.year

并按年和月分组并找到出生的总和:

df_grouped = df.groupby(['year', 'month'])['birth'].sum()

分组df的结果:

year  month
1969  1        14212
      2        14982
      3        15063
      4        12726
      5        14131
      6        16597
      7        14487
      8        12517
      9        14483
      10       17620
      11       13883
      12       17778
1970  1        15345
      2        14524
      3        18429
      4        12981
      5        16855
      6        17135
      7        18174
      8        17031
      9        14734
      10       15743
      11       12761
      12       18428
1971  1        15063
      2        17224
      3        15088
      4        17015
      5        17032
      6        17275
               ...  
2006  7        14718
      8        13212
      9        17389
      10       13850
      11       13154
      12       14126
2007  1        15762
      2        12744
      3        17176
      4        12777
      5        16254
      6        16297
      7        16038
      8        15067
      9        13724
      10       14386
      11       15701
      12       17136
2008  1        16047
      2        15094
      3        14539
      4        18662
      5        16575
      6        13553
      7        17107
      8        15824
      9        15096
      10       15139
      11       15419
      12         250

现在让我们绘制:

fig, ax = plt.subplots(figsize=(15,7))
df_grouped.unstack().plot()
plt.show()

在此处输入图像描述

如果你愿意,你可以努力使情节“更漂亮”

[编辑] 我认为条形图看起来会更整洁:

df_grouped.unstack().plot(kind='bar')

在此处输入图像描述


推荐阅读