首页 > 解决方案 > 如何从具有一定要求的长数据中计算周期的持续时间

问题描述

我有一个大数据集,其中每月记录一年的人的就业状况(来自> 500.000人的> 6.000.000次观察)。它看起来像这样:

ID <- c(1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3)
Month <- c("Jan", "Jan", "Jan", "Feb", "Feb", "Feb", "Mar", "Mar", "Mar", "Apr", "Apr", "Apr", 
           "May", "May", "May", "Jun", "Jun", "Jun", "Jul", "Jul", "Jul", "Aug", "Aug", "Aug", 
           "Sep", "Sep", "Sep", "Oct", "Oct", "Oct", "Nov", "Nov", "Nov", "Dec", "Dec", "Dec")
Activity <- c("Unemployed", "Unemployed", "Other", "Employed", "Employed", "Unemployed", "Employed", 
         "Employed", "Employed", "Unemployed", "Unemployed", "Unemployed", "Unemployed", "Unemployed", 
         "Unemployed", "Employed", "Employed", "Employed", "Other", "NA", "Unemployed", "Employed", "Unemployed", 
         "Employed", "Inactive", "Unemployed", "Employed", "Employed", "Employed", "Unemployed", 
         "Other", "Unemployed", "Employed", "Unemployed", "Other", "Other")
df <- data.frame(ID, Month, Activity)
df[order(ID),]

   ID Month   Activity
1   1   Jan Unemployed
4   1   Feb   Employed
7   1   Mar   Employed
10  1   Apr Unemployed
13  1   May Unemployed
16  1   Jun   Employed
19  1   Jul      Other
22  1   Aug   Employed
25  1   Sep   Inactive
28  1   Oct   Employed
31  1   Nov      Other
34  1   Dec Unemployed
2   2   Jan Unemployed
5   2   Feb   Employed
8   2   Mar   Employed
11  2   Apr Unemployed
14  2   May Unemployed
17  2   Jun   Employed
20  2   Jul         NA
23  2   Aug Unemployed
26  2   Sep Unemployed
29  2   Oct   Employed
32  2   Nov Unemployed
35  2   Dec      Other
3   3   Jan      Other
6   3   Feb Unemployed
9   3   Mar   Employed
12  3   Apr Unemployed
15  3   May Unemployed
18  3   Jun   Employed
21  3   Jul Unemployed
24  3   Aug   Employed
27  3   Sep   Employed
30  3   Oct Unemployed
33  3   Nov   Employed
36  3   Dec      Other

我想要实现的是创建一个新的数据框,其中每个就业拼写都有一行,但要求只有就业拼写被包括在失业之前和之后发生的一段时期。这意味着我只想包括人们从失业转为就业再转回失业的就业期,并能够计算每个就业期的持续时间。还需要每个就业咒语的开始和结束月份。最终的数据框应如下所示:

  ID Starting Ending Duration
1  1      Feb    Mar        2
3  2      Feb    Mar        2
5  2      Oct    Oct        1
6  3      Mar    Mar        1
7  3      Jun    Jun        1
8  3      Aug    Sep        2

标签: r

解决方案


To get only those employment durations which are in between unemployed ones you can use -

library(dplyr)

df %>%
  arrange(ID) %>%
  group_by(ID, grp = data.table::rleid(Activity)) %>%
  summarise(Activity = first(Activity),
            Starting = first(Month), 
            Ending = last(Month), 
            Duration = n()) %>%
  #Keep only those 'Employed' rows that are surrounded by 'Unemployed' ones
  filter(Activity == 'Employed', 
         lag(Activity) == 'Unemployed', lead(Activity) == 'Unemployed') %>%
  ungroup %>%
  select(-grp, -Activity)

#     ID Starting Ending Duration
#  <int> <chr>    <chr>     <int>
#1     1 Feb      Mar           2
#2     2 Feb      Mar           2
#3     2 Oct      Oct           1
#4     3 Mar      Mar           1
#5     3 Jun      Jun           1
#6     3 Aug      Sep           2

推荐阅读