首页 > 解决方案 > 试图找到给定城市的最高得分提升

问题描述

我尝试按城市、日期和分数对其进行分组。然后按城市和分数排序,最后应用线性回归计算最高斜率。

下面的代码

from scipy.stats import linregress
result = data.groupby(['facility_city', 'score', 'activity_date'], as_index=False).sum()
result = result.sort_values(by=['facility_city', 'activity_date'], ascending=True)
print(result)
result = result.apply(lambda v: linregress(v['activity_date'], v['score'])[0])
print(result)

数据的一个例子是

"activity_date","employee_id","facility_address","facility_city","facility_id","facility_name","facility_state","facility_zip","grade","owner_id","owner_name","pe_description","program_element_pe","program_name","program_status","record_id","score","serial_number","service_code","service_description"
2017-05-09,"EE0000593","17660 CHATSWORTH ST","GRANADA HILLS","FA0175397","HOVIK'S FAMOUS MEAT & DELI","CA","91344","A","OW0181955","JOHN'S FAMOUS MEAT & DELI INC.","FOOD MKT RETAIL (25-1,999 SF) HIGH RISK",1612,"HOVIK'S FAMOUS MEAT & DELI","ACTIVE","PR0168541",98,"DAHDRUQZO",1,"ROUTINE INSPECTION"
2017-04-10,"EE0000126","3615 PACIFIC COAST HWY","TORRANCE","FA0242138","SHAKEY'S PIZZA","CA","90505","A","OW0237843","SCO, LLC","RESTAURANT (61-150) SEATS HIGH RISK",1638,"SHAKEY'S PIZZA","ACTIVE","PR0190290",94,"DAL3SBUE0",1,"ROUTINE INSPECTION"
2017-04-04,"EE0000593","17515 CHATSWORTH ST","GRANADA HILLS","FA0007801","BAITH AL HALAL","CA","91344","A","OW0031150","SABIR MOHAMMAD SHAHID","FOOD MKT RETAIL (25-1,999 SF) HIGH RISK",1612,"BAITH AL HALAL","INACTIVE","PR0036723",95,"DAL2PIKJU",1,"ROUTINE INSPECTION"
2017-08-15,"EE0000971","44455 VALLEY CENTRAL WAY","LANCASTER","FA0013858","FOOD 4 LESS #306","CA","93536","A","OW0012108","FOOD 4 LESS, INC.","RESTAURANT (0-30) SEATS HIGH RISK",1632,"FOOD 4 LESS DELI/BAKERY#306","ACTIVE","PR0039905",98,"DA0ZMAJXZ",1,"ROUTINE INSPECTION"
2016-09-26,"EE0000145","11700 SOUTH ST","ARTESIA","FA0179671","PHO LITTLE SAIGON","CA","90701","A","OW0185167","PHO SOUTH ST INC","RESTAURANT (61-150) SEATS HIGH RISK",1638,"PHO LITTLE SAIGON","ACTIVE","PR0173311",96,"DA41DBXA2",1,"ROUTINE INSPECTION"
2016-05-11,"EE0000720","1309 S HOOVER ST","LOS ANGELES","FA0179745","HAPPY TACOS TO GO","CA","90006-4903","A","OW0185239","MAT L. MORA","RESTAURANT (0-30) SEATS HIGH RISK",1632,"HAPPY TACOS TO GO","INACTIVE","PR0173403",96,"DAURQTTVR",1,"ROUTINE INSPECTION"
2017-02-28,"EE0000741","4959 PATATA ST","CUDAHY","FA0012590","EL POTRERO CLUB","CA","90201","B","OW0036634","TSAY, SHYR JIN","RESTAURANT (151 + ) SEATS HIGH RISK",1641,"EL POTRERO CLUB","ACTIVE","PR0041708",87,"DAUNXDSVP",1,"ROUTINE INSPECTION"

当我尝试应用线性回归时,出现错误 KeyError: 'activity_date'。欢迎任何提示

标签: pandasdataframe

解决方案


您需要groupby再次apply像以前一样使用。此外,线性回归不能直接使用日期,您需要先将它们转换为数字(参见这篇文章)。

将您的最后一个结果分配更改为:

result = (result.groupby('facility_city')
 .apply(lambda x: linregress(x.activity_date.map(dt.datetime.toordinal), x.score)[0])
)

您的完整代码应如下所示:

import pandas as pd
import datetime as dt
from scipy.stats import linregress

# To make sure activity_date is a Timestamp:
data['activity_date'] = pd.to_datetime(data.activity_date)

# Sort cities and dates
result = data.sort_values(['facility_city', 'activity_date'])

# Calculate linear regressions and retrieve their slopes
result = (result.groupby('facility_city')
 .apply(lambda x: linregress(x.activity_date.map(dt.datetime.toordinal), x.score)[0])
)

# Show higheest slopes first
result.sort_values(ascending=False)

推荐阅读