首页 > 解决方案 > R中存在虚拟变量时的无嵌套年份范围

问题描述

我正在研究一个包含个人居住地和职业信息的数据集。最初,它表示某人居住在一个地址一年到一年,例如从 1920 年到 1925 年。如果个人在 1920 年搬到该地址,则有一个值为 1 的虚拟变量。同样,如果个人搬出从 1925 年的那个地址,还有一个值为 1 的假人。

现在,问题是,当我取消“逐年”的嵌套时,从 1920 年到 1925 年,所有观察值(包括移出和移入)的值为 1。

示例数据:

library(tidyr)
library(dplyr)

individual <- c('John Doe','Peter Gynn','Jolie Hope', 'Jolie Hope')
occupation <- c('banker', 'butcher', 'clerk', 'clerk')
first_obs <- c(1920, 1920, 1920, 1925)
last_obs <- c(1925, 1925, 1925, 1926)
moved_in <- c(1, 0, 1, 1)
moved_out <- c(0, 0, 1, 0)
address <- c('king street', 'market street', 'montgomery road', 'princes ave')


df <- data.frame(individual, occupation, address, first_obs, last_obs, moved_in, moved_out)

df$year <- mapply(seq,df$first_obs,df$last_obs,SIMPLIFY=FALSE)


new_df <- df %>% 
  unnest(year) %>% 
  select(-first_obs,-last_obs)

如您所见,例如,朱莉·霍普(Jolie Hope)似乎在 1920 年至 1925 年期间每年都会搬入和搬出她的地址,但她应该在 1920 年搬入并在 1925 年搬出。有没有解决方案?

此外,由于人们在同一年进出,我遇到了一些重复值的问题。例如,Jolie Hope 于 1925 年从 Mongomery Road 搬出,并于 1925 年搬入 Princes Avenue。我认为最好的解决方案是只使用“搬入”行。是否可以系统地删除所有存在重复值的“移出”行?

标签: rdplyrtidyr

解决方案


我们可以在group_by他们搬进来时将 1 if 分配给第一个individual,当他们搬出时将 1 分配给最后。addressyearyear

library(dplyr)

df %>% 
  tidyr::unnest(year) %>% 
  select(-first_obs,-last_obs) %>%
  group_by(individual, address) %>%
  mutate(moved_in = if (any(moved_in == 1)) replace(moved_in, 
                    row_number() != 1, 0) else moved_in, 
         moved_out = if (any(moved_out == 1)) replace(moved_out, 
                     row_number() != n(), 0) else moved_out)

#   individual occupation address         moved_in moved_out  year
#   <fct>      <fct>      <fct>              <dbl>     <dbl> <int>
# 1 John Doe   banker     king street            1         0  1920
# 2 John Doe   banker     king street            0         0  1921
# 3 John Doe   banker     king street            0         0  1922
# 4 John Doe   banker     king street            0         0  1923
# 5 John Doe   banker     king street            0         0  1924
# 6 John Doe   banker     king street            0         0  1925
# 7 Peter Gynn butcher    market street          0         0  1920
# 8 Peter Gynn butcher    market street          0         0  1921
# 9 Peter Gynn butcher    market street          0         0  1922
#10 Peter Gynn butcher    market street          0         0  1923
#11 Peter Gynn butcher    market street          0         0  1924
#12 Peter Gynn butcher    market street          0         0  1925
#13 Jolie Hope clerk      montgomery road        1         0  1920
#14 Jolie Hope clerk      montgomery road        0         0  1921
#15 Jolie Hope clerk      montgomery road        0         0  1922
#16 Jolie Hope clerk      montgomery road        0         0  1923
#17 Jolie Hope clerk      montgomery road        0         0  1924
#18 Jolie Hope clerk      montgomery road        0         1  1925
#19 Jolie Hope clerk      princes ave            1         0  1925
#20 Jolie Hope clerk      princes ave            0         0  1926

为了解决重复值问题,我认为最好在同一年保留重复行,表明他们在同一年搬出旧地址并搬入新地址。


推荐阅读