pandas - 试图找到给定城市的最高得分提升
问题描述
我尝试按城市、日期和分数对其进行分组。然后按城市和分数排序,最后应用线性回归计算最高斜率。
下面的代码
from scipy.stats import linregress
result = data.groupby(['facility_city', 'score', 'activity_date'], as_index=False).sum()
result = result.sort_values(by=['facility_city', 'activity_date'], ascending=True)
print(result)
result = result.apply(lambda v: linregress(v['activity_date'], v['score'])[0])
print(result)
数据的一个例子是
"activity_date","employee_id","facility_address","facility_city","facility_id","facility_name","facility_state","facility_zip","grade","owner_id","owner_name","pe_description","program_element_pe","program_name","program_status","record_id","score","serial_number","service_code","service_description"
2017-05-09,"EE0000593","17660 CHATSWORTH ST","GRANADA HILLS","FA0175397","HOVIK'S FAMOUS MEAT & DELI","CA","91344","A","OW0181955","JOHN'S FAMOUS MEAT & DELI INC.","FOOD MKT RETAIL (25-1,999 SF) HIGH RISK",1612,"HOVIK'S FAMOUS MEAT & DELI","ACTIVE","PR0168541",98,"DAHDRUQZO",1,"ROUTINE INSPECTION"
2017-04-10,"EE0000126","3615 PACIFIC COAST HWY","TORRANCE","FA0242138","SHAKEY'S PIZZA","CA","90505","A","OW0237843","SCO, LLC","RESTAURANT (61-150) SEATS HIGH RISK",1638,"SHAKEY'S PIZZA","ACTIVE","PR0190290",94,"DAL3SBUE0",1,"ROUTINE INSPECTION"
2017-04-04,"EE0000593","17515 CHATSWORTH ST","GRANADA HILLS","FA0007801","BAITH AL HALAL","CA","91344","A","OW0031150","SABIR MOHAMMAD SHAHID","FOOD MKT RETAIL (25-1,999 SF) HIGH RISK",1612,"BAITH AL HALAL","INACTIVE","PR0036723",95,"DAL2PIKJU",1,"ROUTINE INSPECTION"
2017-08-15,"EE0000971","44455 VALLEY CENTRAL WAY","LANCASTER","FA0013858","FOOD 4 LESS #306","CA","93536","A","OW0012108","FOOD 4 LESS, INC.","RESTAURANT (0-30) SEATS HIGH RISK",1632,"FOOD 4 LESS DELI/BAKERY#306","ACTIVE","PR0039905",98,"DA0ZMAJXZ",1,"ROUTINE INSPECTION"
2016-09-26,"EE0000145","11700 SOUTH ST","ARTESIA","FA0179671","PHO LITTLE SAIGON","CA","90701","A","OW0185167","PHO SOUTH ST INC","RESTAURANT (61-150) SEATS HIGH RISK",1638,"PHO LITTLE SAIGON","ACTIVE","PR0173311",96,"DA41DBXA2",1,"ROUTINE INSPECTION"
2016-05-11,"EE0000720","1309 S HOOVER ST","LOS ANGELES","FA0179745","HAPPY TACOS TO GO","CA","90006-4903","A","OW0185239","MAT L. MORA","RESTAURANT (0-30) SEATS HIGH RISK",1632,"HAPPY TACOS TO GO","INACTIVE","PR0173403",96,"DAURQTTVR",1,"ROUTINE INSPECTION"
2017-02-28,"EE0000741","4959 PATATA ST","CUDAHY","FA0012590","EL POTRERO CLUB","CA","90201","B","OW0036634","TSAY, SHYR JIN","RESTAURANT (151 + ) SEATS HIGH RISK",1641,"EL POTRERO CLUB","ACTIVE","PR0041708",87,"DAUNXDSVP",1,"ROUTINE INSPECTION"
当我尝试应用线性回归时,出现错误 KeyError: 'activity_date'。欢迎任何提示
解决方案
您需要groupby
再次apply
像以前一样使用。此外,线性回归不能直接使用日期,您需要先将它们转换为数字(参见这篇文章)。
将您的最后一个结果分配更改为:
result = (result.groupby('facility_city')
.apply(lambda x: linregress(x.activity_date.map(dt.datetime.toordinal), x.score)[0])
)
您的完整代码应如下所示:
import pandas as pd
import datetime as dt
from scipy.stats import linregress
# To make sure activity_date is a Timestamp:
data['activity_date'] = pd.to_datetime(data.activity_date)
# Sort cities and dates
result = data.sort_values(['facility_city', 'activity_date'])
# Calculate linear regressions and retrieve their slopes
result = (result.groupby('facility_city')
.apply(lambda x: linregress(x.activity_date.map(dt.datetime.toordinal), x.score)[0])
)
# Show higheest slopes first
result.sort_values(ascending=False)
推荐阅读
- spark-streaming - 使用 Spark 缓存小型 RDD 需要很长时间,并且 Spark 似乎被冻结了
- javascript - 如何使用两种不同的表单或方法处理提交
- javascript - 使用 php 代理的跨域 Ajax POST 请求
- javascript - 为什么 Gulp-Bundle 任务抱怨 Gulp v4 配置错误?
- django - 获取学生模型django中引用的元素(教师)的主键
- c++ - 如何在 C++17 中检索 std::filesystem::file_time_type 的时钟类型
- entity-framework - 无法读取映射到 DbView 的上下文的 MetadataWorkspace?
- r - 使用 H2O 在 R 中平衡随机森林
- docker - 在大学实验室环境中管理 docker 容器
- python - 如何使用python检查文件是否包含电子邮件地址或md5