首页 > 解决方案 > 如何为 Y 组中 X 的唯一值创建索引变量?

问题描述

我有下表:

id_question  id_event   num_events
2015012713    49508          1
2015012711    49708          1
2015011523    41808          3
2015011523    44008          3
2015011523    44108          3
2015011522    41508          3
2015011522    43608          3
2015011522    43708          3
2015011521    39708          1
2015011519    44208          1

第三列按问题给出事件计数。我想创建一个变量,仅在每个问题有多个事件的情况下按问题索引事件。它看起来像这样:

id_question  id_event   num_events  index_event
2015012713    49508          1          
2015012711    49708          1          
2015011523    41808          3          1
2015011523    44008          3          2
2015011523    44108          3          3
2015011522    41508          3          1
2015011522    43608          3          2
2015011522    43708          3          3
2015011521    39708          1          
2015011519    44208          1          

我怎样才能做到这一点?

标签: rgroup-bydplyrcase-when

解决方案


我们可以tidyverse在按“id_question”分组后创建一个“index_event”。如果行数大于1(n() >1),则获取行序列(row_number()),默认选项case_whenNA

library(dplyr)
df1 %>%
   group_by(id_question) %>%
   mutate(index_event = case_when(n() >1 ~ row_number()))
# A tibble: 10 x 4
# Groups:   id_question [6]
#   id_question id_event num_events index_event
#         <int>    <int>      <int>       <int>
# 1  2015012713    49508          1          NA
# 2  2015012711    49708          1          NA
# 3  2015011523    41808          3           1
# 4  2015011523    44008          3           2
# 5  2015011523    44108          3           3
# 6  2015011522    41508          3           1
# 7  2015011522    43608          3           2
# 8  2015011522    43708          3           3
# 9  2015011521    39708          1          NA
#10  2015011519    44208          1          NA

或者 with data.table,我们rowid在 'id_question' 上使用,并将 'num_events' 中为 1 的元素更改为NAwith NA^(利用NA^0, NA^1

library(data.table)
setDT(df1)[, index_event := rowid(id_question) * NA^(num_events == 1)]

或者使用'id_question' 中频率的base R另一个选项,并将元素更改为 NA,如前一种情况sequence

df1$index_event <-  with(df1, sequence(table(id_question)) * NA^(num_events == 1))
df1$index_event
#[1] NA NA  1  2  3  1  2  3 NA NA

数据

df1 <- structure(list(id_question = c(2015012713L, 2015012711L, 2015011523L, 
2015011523L, 2015011523L, 2015011522L, 2015011522L, 2015011522L, 
2015011521L, 2015011519L), id_event = c(49508L, 49708L, 41808L, 
44008L, 44108L, 41508L, 43608L, 43708L, 39708L, 44208L), num_events = c(1L, 
1L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L)), class = "data.frame", row.names = c(NA, 
-10L))

推荐阅读