首页 > 解决方案 > 从 R 中的时间戳发生数据生成存在/不存在和计数数据帧,按位置和 ID(所有可能的组合)每小时分箱

问题描述

当在特定位置检测到单个动物 (ID) 时,我有一个时间戳数据框。这是数据的简短示例(为清楚起见,按位置排序,然后按时间戳UTC):

df <- read.table(text =
                   "timestampUTC    location    ID
                 '2017-10-11 04:57:16'  JB12    A69-1602-2429
                 '2017-10-11 04:58:25'  JB12    A69-1602-2429
                 '2017-10-11 04:59:08'  JB12    A69-1602-2429
                 '2017-10-11 05:00:01'  JB12    A69-1602-2429
                 '2017-10-11 05:03:04'  JB12    A69-1602-2429
                 '2017-10-11 05:04:38'  JB12    A69-1602-2429
                 '2017-10-11 04:56:20'  JB13    A69-1601-47280
                 '2017-10-11 04:57:44'  JB13    A69-1601-47280
                 '2017-10-11 04:59:00'  JB13    A69-1601-47280
                 '2017-10-11 04:59:56'  JB13    A69-1601-47280
                 '2017-10-11 05:01:16'  JB13    A69-1601-47280
                 '2017-10-11 05:02:20'  JB13    A69-1601-47280
                 '2017-10-11 05:02:56'  JB13    A69-1601-47280
                 '2017-10-11 05:03:34'  JB13    A69-1601-47280
                 '2017-10-11 05:08:29'  JB13    A69-1602-2429
                 '2017-10-11 04:56:43'  WBR A69-1602-2428
                 '2017-10-11 04:57:09'  WBR A69-1602-2425
                 '2017-10-11 04:58:11'  WBR A69-1602-2428
                 '2017-10-11 04:58:42'  WBR A69-1602-2425
                 '2017-10-11 04:59:34'  WBR A69-1602-2428
                 '2017-10-11 05:00:13'  WBR A69-1602-2425
                 '2017-10-11 05:00:47'  WBR A69-1602-2428
                 '2017-10-11 05:01:39'  WBR A69-1602-2428
                 '2017-10-11 05:01:43'  WBR A69-1602-2425
                 '2017-10-11 05:02:49'  WBR A69-1602-2428
                 '2017-10-11 05:03:36'  WBR A69-1602-2428
                 '2017-10-11 05:04:38'  WBR A69-1602-2428
                 '2017-10-11 05:07:32'  WBR A69-1602-2428", header = T)

我想从这些数据中生成一个数据框,对于每个小时的 bin 和每个位置以及每个 ID,我都有一个列,其中包含该 ID 在该小时内该位置是否存在 (1) 或不存在 (0),以及一列在那一小时内该位置对该 ID 的观察计数。措辞略有不同 - 对于每个 ID,我想知道它是存在 (1) 还是不存在 (0) 以及每个位置的研究期间的每个小时箱的观察计数。

例如,上面的数据有 4 个唯一的 ID、3 个唯一的位置和属于两个小时分档的观测值(2017-10-11 04:00:00、2017-10-11 05:00:00)。输出将是一个包含 24 行的 df(4 个 ID x 3 个位置 x 2 个每小时垃圾箱)。输出应如下所示:

output <- read.table(text =
                   "HourlyBinUTC    location    ID  count   present
                 '2017-10-11 04:00:00'  JB13    A69-1601-47280  4   1
                 '2017-10-11 04:00:00'  JB13    A69-1602-2429   0   0
                 '2017-10-11 04:00:00'  JB13    A69-1602-2428   0   0
                 '2017-10-11 04:00:00'  JB13    A69-1602-2425   0   0
                 '2017-10-11 04:00:00'  JB12    A69-1601-47280  0   0
                 '2017-10-11 04:00:00'  JB12    A69-1602-2429   4   1
                 '2017-10-11 04:00:00'  JB12    A69-1602-2428   0   0
                 '2017-10-11 04:00:00'  JB12    A69-1602-2425   0   0
                 '2017-10-11 04:00:00'  WBR A69-1601-47280  0   0
                 '2017-10-11 04:00:00'  WBR A69-1602-2429   0   0
                 '2017-10-11 04:00:00'  WBR A69-1602-2428   3   1
                 '2017-10-11 04:00:00'  WBR A69-1602-2425   2   1
                 '2017-10-11 05:00:00'  JB13    A69-1601-47280  4   1
                 '2017-10-11 05:00:00'  JB13    A69-1602-2429   1   1
                 '2017-10-11 05:00:00'  JB13    A69-1602-2428   0   0
                 '2017-10-11 05:00:00'  JB13    A69-1602-2425   0   0
                 '2017-10-11 05:00:00'  JB12    A69-1601-47280  0   0
                 '2017-10-11 05:00:00'  JB12    A69-1602-2429   3   1
                 '2017-10-11 05:00:00'  JB12    A69-1602-2428   0   0
                 '2017-10-11 05:00:00'  JB12    A69-1602-2425   0   0
                 '2017-10-11 05:00:00'  WBR A69-1601-47280  0   0
                 '2017-10-11 05:00:00'  WBR A69-1602-2429   0   0
                 '2017-10-11 05:00:00'  WBR A69-1602-2428   6   1
                 '2017-10-11 05:00:00'  WBR A69-1602-2425   2   1", header = T)

(在我的真实 df 中,它跨越 19,887 个每小时垃圾箱,有 17 个人和 4 个位置,我希望输出有 1,352,316 行)

我是 R 新手,玩过 reshape() 和 dcast(),但对获得我需要的参数不够熟悉。我怀疑这将是 tidyverse 的工作——我(显然)迫切需要学习!

提前致谢!

标签: rdataframedatetime

解决方案


您不需要任何软件包。我读了你的df,但设置了选项stringsAsFactors

df <- read.table(text =
                 "timestampUTC    location    ID
                 '2017-10-11 04:57:16'  JB12    A69-1602-2429
                 ## ...
                 '2017-10-11 05:07:32'  WBR A69-1602-2428",
                 header = TRUE, stringsAsFactors = FALSE)

现在,告诉 R 第一列是时间戳,然后对时间戳进行四舍五入。

df[["timestampUTC"]] <- as.POSIXct(df[["timestampUTC"]], tz = "UTC")
df[["timestampUTC"]] <- trunc(df[["timestampUTC"]], "hours")

然后调用table并将结果转换为数据框。

result <- as.data.frame(table(df))

##           timestampUTC location             ID Freq
## 1  2017-10-11 04:00:00     JB12 A69-1601-47280    0
## 2  2017-10-11 05:00:00     JB12 A69-1601-47280    0
## 3  2017-10-11 04:00:00     JB13 A69-1601-47280    4
## 4  2017-10-11 05:00:00     JB13 A69-1601-47280    4
## 5  2017-10-11 04:00:00      WBR A69-1601-47280    0
## 6  2017-10-11 05:00:00      WBR A69-1601-47280    0
## 7  2017-10-11 04:00:00     JB12  A69-1602-2425    0
## 8  2017-10-11 05:00:00     JB12  A69-1602-2425    0
## 9  2017-10-11 04:00:00     JB13  A69-1602-2425    0
## 10 2017-10-11 05:00:00     JB13  A69-1602-2425    0
## 11 2017-10-11 04:00:00      WBR  A69-1602-2425    2
## 12 2017-10-11 05:00:00      WBR  A69-1602-2425    2
## 13 2017-10-11 04:00:00     JB12  A69-1602-2428    0
## 14 2017-10-11 05:00:00     JB12  A69-1602-2428    0
## 15 2017-10-11 04:00:00     JB13  A69-1602-2428    0
## 16 2017-10-11 05:00:00     JB13  A69-1602-2428    0
## 17 2017-10-11 04:00:00      WBR  A69-1602-2428    3
## 18 2017-10-11 05:00:00      WBR  A69-1602-2428    6
## 19 2017-10-11 04:00:00     JB12  A69-1602-2429    3
## 20 2017-10-11 05:00:00     JB12  A69-1602-2429    3
## 21 2017-10-11 04:00:00     JB13  A69-1602-2429    0
## 22 2017-10-11 05:00:00     JB13  A69-1602-2429    1
## 23 2017-10-11 04:00:00      WBR  A69-1602-2429    0
## 24 2017-10-11 05:00:00      WBR  A69-1602-2429    0

例如,如果您想包含一个ID从未出现在数据中的 ,您需要说明可能的tableID。你可以通过做ID一个因素来做到这一点。

df[["ID"]] <- factor(df[["ID"]],
                     levels = c("A69-1602-2429", "A69-1601-47280",
                                "A69-1602-2428", "A69-1602-2425",
                                "foo"))

或额外的时间:

df[["timestampUTC"]] <-
    factor(df[["timestampUTC"]],
           levels =
               seq(from = as.POSIXct("2017-10-01 00:00:00", tz = "UTC"),
                   to   = as.POSIXct("2017-10-03 00:00:00", tz = "UTC"),
                   by   = "1 hour"))

结果将包括零计数。

## ...
## 25 2017-10-11 04:00:00     JB12            foo    0
## 26 2017-10-11 05:00:00     JB12            foo    0
## 27 2017-10-11 04:00:00     JB13            foo    0
## 28 2017-10-11 05:00:00     JB13            foo    0
## 29 2017-10-11 04:00:00      WBR            foo    0
## 30 2017-10-11 05:00:00      WBR            foo    0

present我建议使用逻辑向量,而不是 0-1 变量:

present <- result$Freq > 0
cbind(result, present)

##           timestampUTC location             ID Freq present
## 1  2017-10-11 04:00:00     JB12 A69-1601-47280    0   FALSE
## 2  2017-10-11 05:00:00     JB12 A69-1601-47280    0   FALSE
## 3  2017-10-11 04:00:00     JB13 A69-1601-47280    4    TRUE
## 4  2017-10-11 05:00:00     JB13 A69-1601-47280    4    TRUE
## 5  2017-10-11 04:00:00      WBR A69-1601-47280    0   FALSE
## ...

推荐阅读