首页 > 解决方案 > 计算不平衡集合中的差异价格

问题描述

我有一个包含日期、地点和价格的不平衡数据框。我想按日期计算不同地区之间的差异价格。我的数据是不平衡的,并获得我认为在创建数据(地区)以平衡数据时的所有差异价格。

我的数据看起来像:

library(dplyr)
set.seed(123)
df= data.frame(date=(1:3),
    locality= rbinom(21,3, 0.2),
    price=rnorm(21, 50, 20)) 

df %>%
arrange(date, locality)

>    date locality     price
1     1        0  60.07625
2     1        0  35.32994
3     1        0  63.69872
4     1        1  54.76426
5     1        1  66.51080
6     1        1  28.28602
7     1        2  47.09213
8     2        0  26.68910
9     2        1 100.56673
10    2        1  48.88628
11    2        1  48.29153
12    2        2  29.02214
13    2        2  45.68269
14    2        2  43.59887
15    3        0  60.98193
16    3        0  75.89527
17    3        0  43.30174
18    3        0  71.41221
19    3        0  33.62969
20    3        1  34.31236
21    3        1  23.76955

为了获得平衡的数据,我认为:

>    date locality     price
1     1        0  60.07625
2     1        0  35.32994
3     1        0  63.69872
4     1        1  54.76426
5     1        1  66.51080
6     1        1  28.28602
7     1        2  47.09213
8     1        2  NA
9     1        2  NA
10    2        0  26.68910
10    2        0  NA
10    2        0  NA
11    2        1 100.56673
12    2        1  48.88628
13    2        1  48.29153
14    2        2  29.02214
15    2        2  45.68269
16    2        2  43.59887
etc...

最后,为了获得不同的价格 beetwen 对地区,我认为:

>   date      diff(price, 0-1)     diff(price, 0-2)    diff(price, 1-2)
1     1      60.07625-54.76426    60.07625-47.09213    etc...
2     1      35.32994-66.51080    35.32994-NA
3     1      63.69872-28.28602    63.69872-NA

标签: rdplyr

解决方案


您无需平衡数据。如果您使用dcast,它将NA为您添加 s。

首先转换数据以显示每个位置的单独列

library(data.table)
library(tidyverse)
setDT(df)

df[, rid := rowid(date, locality)]
df2 <- dcast(df, rid + date ~ locality, value.var = 'price')

#     rid date        0         1        2
#  1:   1    1 60.07625  54.76426 47.09213
#  2:   1    2 26.68910 100.56673 29.02214
#  3:   1    3 60.98193  34.31236       NA
#  4:   2    1 35.32994  66.51080       NA
#  5:   2    2       NA  48.88628 45.68269
#  6:   2    3 75.89527  23.76955       NA
#  7:   3    1 63.69872  28.28602       NA
#  8:   3    2       NA  48.29153 43.59887
#  9:   3    3 43.30174        NA       NA
# 10:   4    3 71.41221        NA       NA
# 11:   5    3 33.62969        NA       NA

然后创建一个to_diff差异数据框来计算,并pmap在其上计算差异。这里c0_1对应于您在问题中所称的内容diff(price, 0-1)

to_diff <- CJ(0:2, 0:2)[V1 < V2]
pmap(to_diff, ~ df2[[as.character(.x)]] - df2[[as.character(.y)]]) %>% 
  setNames(paste0('c', to_diff[[1]], '_', to_diff[[2]])) %>% 
  bind_cols(df2[, 1:2])

# A tibble: 11 x 5
#      c0_1   c0_2  c1_2   rid  date
#     <dbl>  <dbl> <dbl> <int> <int>
#  1   5.31  13.0   7.67     1     1
#  2 -73.9   -2.33 71.5      1     2
#  3  26.7   NA    NA        1     3
#  4 -31.2   NA    NA        2     1
#  5  NA     NA     3.20     2     2
#  6  52.1   NA    NA        2     3
#  7  35.4   NA    NA        3     1
#  8  NA     NA     4.69     3     2
#  9  NA     NA    NA        3     3
# 10  NA     NA    NA        4     3
# 11  NA     NA    NA        5     3

推荐阅读