r - R: Group_by function is not aggregating data correctly with sum & max/min date
问题描述
I have data with 6 Variables : EmployeeID, JobID, Name, JobLocation, Date, and HoursWorked. I want to group my data by EmployeeID and JobID (i.e. find all records with the same EmployeeID and JobID in one row), and then find the minimum and maximum date by group, and the sum of all HoursWorked between those dates. I want my data to end up with columns: EmployeeID, JobID, Name JobLocation, MinDate, MaxDate, TotalHoursWorked.
I have tried this so far but the MinDate,MaxDate, and TotalHoursWorked are showing the same date for every record.
Data$EmployeeID<- as.factor(Data$EmployeeID)
Data$JobID<- as.factor(Data$JobID)
Data$Date<- as.factor(Data$Date)
Data$Date<- as.Date(Data$Date,format="%m/%d/%Y")
Data$HoursWorked<-as.numeric(Data$HoursWorked)
Data<-Data[c("EmployeeID", "Name","JobID", "JobLocation", "Date", "HoursWorked")]
Data<- Data%>%
group_by(Data$EmployeeID,Data$JobID, Data$Name,Data$JobLocation) %>%
summarize(TotalHoursWorked = sum(Data$HoursWorked)) %>%
mutate(MaxDate=max(Data$Date), MinDate=min(Data$Date))
Output of sample(Data) without the "Name" column:
> sample(Data)
# A tibble: 1,000 x 5
EmployeeID HoursWorked JobID Date JobLocation
<fct> <dbl> <fct> <date> <chr>
1 32589 4 B3031-002513-00 2016-03-14 #
2 32590 8 B3031-002562-00 2016-04-08 #
3 32591 9 B3031-002564-00 2016-04-05 #
4 32591 2.5 B3031-002564-00 2016-04-06 #
5 32591 3 B3031-002562-00 2016-04-07 #
6 32591 7.5 B3031-002562-00 2016-04-08 #
7 32605 0 B3031-002348-00 2016-01-04 #
8 32605 3 B3031-002419-00 2016-01-04 #
9 32605 0 B3031-002348-00 2016-01-05 #
10 32605 3 B3031-002419-00 2016-01-05 #
# ... with 990 more rows
And output after I run the group_by code:
> sample(Data)
# A tibble: 80 x 6
MaxDate `Data$JobID` MinDate `Data$\`Job Location\`` TotalHoursWorked `Data$EmployeeID`
<date> <fct> <date> <chr> <dbl> <fct>
1 2016-07-29 B3031-002513-00 2016-01-04 # 3288. 32589
2 2016-07-29 B3031-002562-00 2016-01-04 # 3288. 32590
3 2016-07-29 B3031-002562-00 2016-01-04 # 3288. 32591
4 2016-07-29 B3031-002564-00 2016-01-04 # 3288. 32591
5 2016-07-29 B3031-002348-00 2016-01-04 # 3288. 32605
6 2016-07-29 B3031-002419-00 2016-01-04 # 3288. 32605
7 2016-07-29 B3031-002445-00 2016-01-04 # 3288. 32605
8 2016-07-29 B3031-002502-00 2016-01-04 # 3288. 32605
9 2016-07-29 B3031-002504-00 2016-01-04 # 3288. 32605
10 2016-07-29 B3031-002505-00 2016-01-04 # 3288. 32605
# ... with 70 more rows
解决方案
其实很简单,你在用summarise
,mutate
什么时候应该只用summarise
.
可能不需要第一个指令,我Date
在阅读下面的数据后运行它来强制列。
Data$Date <- as.Date(Data$Date)
现在解决方案。
library(tidyverse)
Data %>%
group_by(EmployeeID, JobID) %>%
summarise(TotalHoursWorked = sum(HoursWorked),
MaxDate = max(Date), MinDate = min(Date))
数据。
Data <- read.table(text = "
EmployeeID HoursWorked JobID Date JobLocation
1 32589 4 B3031-002513-00 2016-03-14 #
2 32590 8 B3031-002562-00 2016-04-08 #
3 32591 9 B3031-002564-00 2016-04-05 #
4 32591 2.5 B3031-002564-00 2016-04-06 #
5 32591 3 B3031-002562-00 2016-04-07 #
6 32591 7.5 B3031-002562-00 2016-04-08 #
7 32605 0 B3031-002348-00 2016-01-04 #
8 32605 3 B3031-002419-00 2016-01-04 #
9 32605 0 B3031-002348-00 2016-01-05 #
10 32605 3 B3031-002419-00 2016-01-05 #
", header = TRUE, comment.char = "")
推荐阅读
- rest - 应该从参数传递并获取 id 还是应该从 JWT 令牌获取?
- c++ - RtMidi - MidiInDummy:此类不提供任何功能
- python - 在 Pandas Python 中删除列
- python - Cassandra Python 驱动程序协议版本和连接限制不匹配
- mapbox - 标签的 Z 偏移
- reactjs - 禁止从“”导出 *
- python - 我无法使用 Python 附加到我的 JSON 数组
- android - 如何从包含正确符号的发布构建输出中生成 cSYM?
- amazon-web-services - 是否可以重命名或更改 AWS codedeploy appspec 文件的目录?
- excel - 使用 Excel-VBA 和 ADODB 连接在 OneDrive 上打开 Access 数据库