首页 > 解决方案 > 在 R 中跨数据帧聚合数据

问题描述

我有两个数据框。df1 是一个数据框,在每组单元中包含多个位置。df2 包括所有位置的最高温度 (df$tmax) 的每日观测值。对于 df1 中的每个单位,我想计算平均每日最大值。每个单元内所有位置的温度。

以下代码生成每个数据帧的示例。我需要将其扩展到大约 240 个单位和 8 年的日常数据。

R 中的这种查找/匹配练习似乎总是让我着迷。必须有一种明显的方法来做到这一点,但我目前在没有一些真正的蛮力加入等情况下受到阻碍。

df1 <- 
structure(list(unitID = c("98008", "98008", "98008", "98008", 
"98065", "98065", "98065", "98065", "98146", "98146", "98146", 
"98146", "98584", "98584", "98584"), locationID = c("USW00094290", "USW00094248", 
"USW00024234", "USC00454169", "USC00458508", "USS0021B60S", "USR0000WFTA", 
"USC00451233", "USW00024234", "USW00024233", "USW00094248", "USC00454169", 
"USW00094227", "USC00451939", "USC00455086")), class = "data.frame", row.names = c(NA, 
-15L))

df1
unitID  locationID
1   98008 USW00094290
2   98008 USW00094248
3   98008 USW00024234
4   98008 USC00454169
5   98065 USC00458508
6   98065 USS0021B60S
7   98065 USR0000WFTA
8   98065 USC00451233
9   98146 USW00024234
10  98146 USW00024233
11  98146 USW00094248
12  98146 USC00454169
13  98584 USW00094227
14  98584 USC00451939
15  98584 USC00455086
df2 <- 
structure(list(id = c("USW00094290", "USW00094290", "USW00094248", 
"USW00094248", "USW00024234", "USW00024234", "USC00454169", "USC00454169", 
"USC00458508", "USC00458508", "USS0021B60S", "USS0021B60S", "USR0000WFTA", 
"USR0000WFTA", "USC00451233", "USC00451233", "USW00024233", "USW00024233", 
"USW00094227", "USW00094227", "USC00451939", "USC00451939", "USC00455086", 
"USC00455086"), date = structure(c(17167, 17168, 17167, 17168, 
17167, 17168, 17167, 17168, 17167, 17168, 17167, 17168, 17167, 
17168, 17167, 17168, 17167, 17168, 17167, 17168, 17167, 17168, 
17167, 17168), class = "Date"), tmax = c(28, 28, 28, 28, 33, 
28, 33, 28, -11, -28, -17, -50, 11, -17, 0, -11, 28, 11, 44, 
33, 50, 39, 39, 28)), row.names = c(NA, -24L), class = c("tbl_df", 
"tbl", "data.frame"))

df2
# A tibble: 24 x 3
   id          date        tmax
   <chr>       <date>     <dbl>
 1 USW00094290 2017-01-01    28
 2 USW00094290 2017-01-02    28
 3 USW00094248 2017-01-01    28
 4 USW00094248 2017-01-02    28
 5 USW00024234 2017-01-01    33
 6 USW00024234 2017-01-02    28
 7 USC00454169 2017-01-01    33
 8 USC00454169 2017-01-02    28
 9 USC00458508 2017-01-01   -11
10 USC00458508 2017-01-02   -28
# ... with 14 more rows

输出应包括 unitID、日期和平均最大值。温度。

unitID  date         avg_temp
98008   2009-01-01   30.5
98008   2009-01-02   ...
98008   2009-01-03   ...

标签: r

解决方案


我们可以使用left_join, group_by unitIDandlocationID和 take meanof tmax

library(dplyr)

df1  %>%
  left_join(df2, by = c("locationID" = "id")) %>%
  group_by(unitID, locationID) %>%
  summarise(tmx = mean(tmax, na.rm = TRUE))


#   unitID locationID    tmx
#   <chr>  <chr>       <dbl>
# 1 98008  USC00454169  30.5
# 2 98008  USW00024234  30.5
# 3 98008  USW00094248  28  
# 4 98008  USW00094290  28  
# 5 98065  USC00451233  -5.5
# 6 98065  USC00458508 -19.5
# 7 98065  USR0000WFTA  -3  
# 8 98065  USS0021B60S -33.5
# 9 98146  USC00454169  30.5
#10 98146  USW00024233  19.5
#11 98146  USW00024234  30.5
#12 98146  USW00094248  28  
#13 98584  USC00451939  44.5
#14 98584  USC00455086  33.5
#15 98584  USW00094227  38.5

在基础 R 中,我们可以使用mergeaggregate

aggregate(tmax~unitID + locationID, 
          merge(df1, df2, by.x = "locationID", by.y = "id", all.x = TRUE), 
          mean, na.rm = TRUE)

推荐阅读