首页 > 解决方案 > Calculating the time spent with dates and time format in r

问题描述

I am working on a time/date variable and trying to estimate the time spent for each record. I am following two steps for this analysis. (a) modify the variables in the desired format (b) calculate the time spent on each question. Here how my dataset looks like:

id <-     c(1,1,1,1,1, 2,2,2,2,2)
item.id <- c(1,2,3,4,5, 1,2,3,4,5)
submit.time <-c("2019-04-09 09:50:30.340","2019-04-09 09:52:12.440","2019-04-09 09:52:15.787","2019-04-09 09:53:21.587","2019-04-09 09:53:49.047",
                "2019-04-09 09:49:45.243","2019-04-09 09:52:53.663","2019-04-09 09:53:23.293","2019-04-09 09:54:00.727","2019-04-09 09:54:52.400")
start.time <- c("04/09/2019 09:50:02.317 AM","04/09/2019 09:50:02.317 AM","04/09/2019 09:50:02.317 AM","04/09/2019 09:50:02.317 AM","04/09/2019 09:50:02.317 AM",
                "04/09/2019 09:47:42.583 AM","04/09/2019 09:47:42.583 AM","04/09/2019 09:47:42.583 AM","04/09/2019 09:47:42.583 AM","04/09/2019 09:47:42.583 AM")

data <- data.frame(id, item.id,start.time, submit.time)

> data
   id item.id                 start.time             submit.time
1   1       1 04/09/2019 09:50:02.317 AM 2019-04-09 09:50:30.340
2   1       2 04/09/2019 09:50:02.317 AM 2019-04-09 09:52:12.440
3   1       3 04/09/2019 09:50:02.317 AM 2019-04-09 09:52:15.787
4   1       4 04/09/2019 09:50:02.317 AM 2019-04-09 09:53:21.587
5   1       5 04/09/2019 09:50:02.317 AM 2019-04-09 09:53:49.047
6   2       1 04/09/2019 09:47:42.583 AM 2019-04-09 09:49:45.243
7   2       2 04/09/2019 09:47:42.583 AM 2019-04-09 09:52:53.663
8   2       3 04/09/2019 09:47:42.583 AM 2019-04-09 09:53:23.293
9   2       4 04/09/2019 09:47:42.583 AM 2019-04-09 09:54:00.727
10  2       5 04/09/2019 09:47:42.583 AM 2019-04-09 09:54:52.400

id for each student, item.id is the id of the question, start.time is the for login time to the exam(unique time for each student), and submit.time is the time when a student submits the answer for each question.

(a) editing data: this step includes removing AM|PM and switching the order of the start.time because I would like to use the submit.time's format, and edited start.time's format.

data$start.time <- gsub(" AM| PM", "", data$start.time) # exclude AM or PM
data$start.time <- gsub("/", "-", data$start.time) #replace / with -
dtparts = t(as.data.frame(strsplit(data$start.time,' '))) # split date and time
row.names(dtparts) = NULL
data$newdate <- strptime(as.character(dtparts[,1]), "%m-%d-%Y") # switch the date order
data$newdate <- as.POSIXct(data$newdate) # R was complaining about the time format-had to change here
data$start.time <- paste0(data$newdate," ",dtparts[,2]) # bring the time back

Now, the two-timing variables look the same. I transformed these dates and times to seconds.

data %>%
  mutate(start.time.num = as.numeric(as.POSIXct(start.time), units="secs"))  %>%
  mutate(submit.time.num = as.numeric(as.POSIXct(submit.time), units="secs"))

   id item.id              start.time             submit.time    newdate start.time.num submit.time.num
1   1       1 2019-04-09 09:50:02.317 2019-04-09 09:50:30.340 2019-04-09     1554817802      1554817830
2   1       2 2019-04-09 09:50:02.317 2019-04-09 09:52:12.440 2019-04-09     1554817802      1554817932
3   1       3 2019-04-09 09:50:02.317 2019-04-09 09:52:15.787 2019-04-09     1554817802      1554817936
4   1       4 2019-04-09 09:50:02.317 2019-04-09 09:53:21.587 2019-04-09     1554817802      1554818002
5   1       5 2019-04-09 09:50:02.317 2019-04-09 09:53:49.047 2019-04-09     1554817802      1554818029
6   2       1 2019-04-09 09:47:42.583 2019-04-09 09:49:45.243 2019-04-09     1554817663      1554817785
7   2       2 2019-04-09 09:47:42.583 2019-04-09 09:52:53.663 2019-04-09     1554817663      1554817974
8   2       3 2019-04-09 09:47:42.583 2019-04-09 09:53:23.293 2019-04-09     1554817663      1554818003
9   2       4 2019-04-09 09:47:42.583 2019-04-09 09:54:00.727 2019-04-09     1554817663      1554818041
10  2       5 2019-04-09 09:47:42.583 2019-04-09 09:54:52.400 2019-04-09     1554817663      1554818092

(b) in this step, I would like to calculate the time spent on each question here. For the first question of the first student, the time spent should be submit.time.num(1554817830) - start.time.num(1554817802)=28. For the second question of the first student, the time spent should be submit.time.num(1554817932) - previous submit.time.num(1554817830)=102. This procedure needs to be repeated for each student. When it gets to the second student, it should take the start.time again for the first row of the second student.

So, the additional column should look like this:

> time.spent
   time.spent
1          28
2         102
3           4
4          66
5          27
6         122
7         189
8          29
9          38
10         51

I apologize for posting this long, if you also have any suggestions for the first part, please let me know, more importantly, any suggestions for the part (b)?

Thanks

标签: rtime

解决方案


This can be done a bit faster in a single pipeline.

library(dplyr)
data %>%
  mutate(
    start.time = as.POSIXct(start.time, format = "%m/%d/%Y %H:%M:%OS"),
    submit.time = as.POSIXct(submit.time),
    time.spent = difftime(submit.time, start.time, units = "secs")
  ) %>%
  group_by(id) %>%
  mutate(
    time.spent = c(time.spent[1], diff(time.spent))
  ) %>%
  ungroup()
# # A tibble: 10 x 5
#       id item.id start.time          submit.time         time.spent  
#    <dbl>   <dbl> <dttm>              <dttm>              <drtn>      
#  1     1       1 2019-04-09 09:50:02 2019-04-09 09:50:30  28.023 secs
#  2     1       2 2019-04-09 09:50:02 2019-04-09 09:52:12 102.100 secs
#  3     1       3 2019-04-09 09:50:02 2019-04-09 09:52:15   3.347 secs
#  4     1       4 2019-04-09 09:50:02 2019-04-09 09:53:21  65.800 secs
#  5     1       5 2019-04-09 09:50:02 2019-04-09 09:53:49  27.460 secs
#  6     2       1 2019-04-09 09:47:42 2019-04-09 09:49:45 122.660 secs
#  7     2       2 2019-04-09 09:47:42 2019-04-09 09:52:53 188.420 secs
#  8     2       3 2019-04-09 09:47:42 2019-04-09 09:53:23  29.630 secs
#  9     2       4 2019-04-09 09:47:42 2019-04-09 09:54:00  37.434 secs
# 10     2       5 2019-04-09 09:47:42 2019-04-09 09:54:52  51.673 secs

Using @akrun's suggestion, we can shorten the code a little:

data %>%
  group_by(id) %>%
  mutate(
    start.time = as.POSIXct(start.time, format = "%m/%d/%Y %H:%M:%OS"),
    submit.time = as.POSIXct(submit.time),
    time.spent = submit.time - lag(submit.time, default = first(start.time))
  ) %>%
  ungroup()

(and optionally remove the columns you no longer need).


推荐阅读