首页 > 解决方案 > 在“groupby”语句中使用新分配的列?(与 Pandas 链接的方法)

问题描述

我是一名 R ( dplyr) 用户,正在学习如何使用pandas. 我正在练习使用风力涡轮机数据集,我希望能够返回一个数据框,其中包含自 2000 年以来不列颠哥伦比亚每年的制造商数量。

下面的块返回一个错误NameError: name 'year' is not definedyear在这种情况下,有没有办法将新生成的列通过管道传输到groupby一个链中的语句中?

import pandas as pd

wind_raw = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-10-27/wind-turbine.csv"
)

(
    wind_raw
    .loc[:,['province_territory', 'manufacturer', 'commissioning_date']]
    .assign(year = wind_raw.commissioning_date.str.replace(r'(\d{4})(\/\d{4})*', r'\1'))
    .assign(year = lambda row: pd.to_datetime(row.year))
    .query('province_territory == "British Columbia" and year >= 2000')
    .groupby(wind_raw.manufacturer, year)
    .size()
)

标签: pythonpandas

解决方案


你几乎明白了,你只需要改变groupby参数:

(
wind_raw
.loc[:,['province_territory', 'manufacturer', 'commissioning_date']]
.assign(year = wind_raw.commissioning_date.str.replace(r'(\d{4})(\/\d{4})*', r'\1'))
.assign(year = lambda row: pd.to_datetime(row.year))
.query('province_territory == "British Columbia" and year >= 2000')
.groupby(["manufacturer", "year"])
.size()
)

输出

manufacturer  year      
Enercon       2009-01-01    34
              2019-01-01     4
GE            2017-01-01    61
Leitwind      2010-01-01     1
Senvion       2017-01-01    10
Vestas        2011-01-01    48
              2012-01-01    79
              2014-01-01    55

此外,还有几件事可以简化:

(
wind_raw[['province_territory', 'manufacturer']]
.assign(year = wind_raw.commissioning_date.str.extract("(\d{4})").astype(int))
.query('province_territory == "British Columbia" and year >= 2000')
.groupby(["manufacturer", "year"])
.size()
)

推荐阅读