首页 > 解决方案 > 使用 runner 包总结组

问题描述

我有一张房价和销售日期表。我想使用 runner 包计算 365 天时间窗口内的滚动中位数价格。我只想要每个日期一个中间价。

我的问题是当我尝试下面的代码时,如果某个日期出现不止一次,我会得到一个以上的中位数价格。这不是我预期会发生的。如果我使用 group_by/summarise,我认为每天都会有一个结果。

library(runner)
library(tidyverse)
library(lubridate)

startDate = as_date("2018-01-01")
endDate = as_date("2020-01-01")

# Create data
soldData <- tibble(
  price = round(rnorm(100, mean=500000, sd=100000),-3),
  date = sample(seq.Date(startDate,endDate,by="days"),100,replace=T))

# Fill in the missing dates between startDate and endDate
soldData <- bind_rows(soldData,anti_join(tibble(date=seq.Date(startDate,endDate,by="day")),soldData)) %>%
  arrange(date)

# Find the duplicated dates
duplicatedDates <- soldData[duplicated(soldData$date),]$date

# I thought using group_by/summarise would return one medianPrice per date
results <- soldData %>%
  group_by(date) %>%
  summarise(medianPrice = runner(
    price,
    k = "365 days",
    idx = date,
    f = function(x) {median(x,na.rm=T)})) 

# These are the problem rows.
duplicatedResults <- results %>%
  filter(date %in% duplicatedDates)

知道我哪里出错了吗?

标签: rrunner

解决方案


From dplyr 1.0.0, you can have output that returns multiple rows from summarise.

First you need to deal with duplicate data which you already have in your data. What do you want to do of dates that have multiple occurrence? One way would be to take median/mean of them.

library(dplyr)
library(runner)

soldData %>%
  group_by(date) %>%
  summarise(price = median(price, na.rm = TRUE)) -> df

So now in df we only have one value for each date. You can now apply the runner function.

df %>%
    mutate(medianPrice = runner(price,
                                k = "365 days",
                                idx = date,
                                f = function(x) {median(x,na.rm=T)})) 

There is also zoo:rollmedianr which helps in calculating rolling median.


推荐阅读