r - 如何从具有一定要求的长数据中计算周期的持续时间
问题描述
我有一个大数据集,其中每月记录一年的人的就业状况(来自> 500.000人的> 6.000.000次观察)。它看起来像这样:
ID <- c(1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3)
Month <- c("Jan", "Jan", "Jan", "Feb", "Feb", "Feb", "Mar", "Mar", "Mar", "Apr", "Apr", "Apr",
"May", "May", "May", "Jun", "Jun", "Jun", "Jul", "Jul", "Jul", "Aug", "Aug", "Aug",
"Sep", "Sep", "Sep", "Oct", "Oct", "Oct", "Nov", "Nov", "Nov", "Dec", "Dec", "Dec")
Activity <- c("Unemployed", "Unemployed", "Other", "Employed", "Employed", "Unemployed", "Employed",
"Employed", "Employed", "Unemployed", "Unemployed", "Unemployed", "Unemployed", "Unemployed",
"Unemployed", "Employed", "Employed", "Employed", "Other", "NA", "Unemployed", "Employed", "Unemployed",
"Employed", "Inactive", "Unemployed", "Employed", "Employed", "Employed", "Unemployed",
"Other", "Unemployed", "Employed", "Unemployed", "Other", "Other")
df <- data.frame(ID, Month, Activity)
df[order(ID),]
ID Month Activity
1 1 Jan Unemployed
4 1 Feb Employed
7 1 Mar Employed
10 1 Apr Unemployed
13 1 May Unemployed
16 1 Jun Employed
19 1 Jul Other
22 1 Aug Employed
25 1 Sep Inactive
28 1 Oct Employed
31 1 Nov Other
34 1 Dec Unemployed
2 2 Jan Unemployed
5 2 Feb Employed
8 2 Mar Employed
11 2 Apr Unemployed
14 2 May Unemployed
17 2 Jun Employed
20 2 Jul NA
23 2 Aug Unemployed
26 2 Sep Unemployed
29 2 Oct Employed
32 2 Nov Unemployed
35 2 Dec Other
3 3 Jan Other
6 3 Feb Unemployed
9 3 Mar Employed
12 3 Apr Unemployed
15 3 May Unemployed
18 3 Jun Employed
21 3 Jul Unemployed
24 3 Aug Employed
27 3 Sep Employed
30 3 Oct Unemployed
33 3 Nov Employed
36 3 Dec Other
我想要实现的是创建一个新的数据框,其中每个就业拼写都有一行,但要求只有就业拼写被包括在失业之前和之后发生的一段时期。这意味着我只想包括人们从失业转为就业再转回失业的就业期,并能够计算每个就业期的持续时间。还需要每个就业咒语的开始和结束月份。最终的数据框应如下所示:
ID Starting Ending Duration
1 1 Feb Mar 2
3 2 Feb Mar 2
5 2 Oct Oct 1
6 3 Mar Mar 1
7 3 Jun Jun 1
8 3 Aug Sep 2
解决方案
To get only those employment durations which are in between unemployed ones you can use -
library(dplyr)
df %>%
arrange(ID) %>%
group_by(ID, grp = data.table::rleid(Activity)) %>%
summarise(Activity = first(Activity),
Starting = first(Month),
Ending = last(Month),
Duration = n()) %>%
#Keep only those 'Employed' rows that are surrounded by 'Unemployed' ones
filter(Activity == 'Employed',
lag(Activity) == 'Unemployed', lead(Activity) == 'Unemployed') %>%
ungroup %>%
select(-grp, -Activity)
# ID Starting Ending Duration
# <int> <chr> <chr> <int>
#1 1 Feb Mar 2
#2 2 Feb Mar 2
#3 2 Oct Oct 1
#4 3 Mar Mar 1
#5 3 Jun Jun 1
#6 3 Aug Sep 2
推荐阅读
- reactjs - 错误:元素类型无效:需要一个字符串(用于内置组件):尽管使用导出默认值
- reactjs - 如何为数组中的每个对象定义泛型类型?
- jquery-mobile - 包含 JQuery Mobile 文件时,click() 不会触发
- spring-boot - LoadBalanced WebClient 与 Eureka WebClient 一起启用
- flutter - “无法找到 Android SDK。”在 Windows 10 上运行颤振医生时
- scrapy - 处理多个链接时,scrapy-splash 并不总是成功
- php - 运行 composer 时出错需要 pusher/pusher-php-server
- python - 将文本文件中的键和值添加到字典
- reactjs - 部署后反应不显示图像
- groovy - 在 Jmeter Groovy 中读取请求 Cokkie