首页 > 解决方案 > 当您只想在 r 中引用该字符串的一部分时,如何根据字符串选择行?

问题描述

我有一个 ID 列长度为 8-10 个字符的数据集。每个 ID 都包含有关受试者家庭、他们在该家庭中的位置(如果他们是父母、先证者或兄弟姐妹)以及他们的位置的信息。这是 ID 列的剪辑:

temp <- as.data.table(new("character", .Data = c("45-D11150341", "45-D11180321", 
                                                 "45-D11220022", "45-D11240432", "45-D11270422", "45-D11290422", 
                                                 "45-D11320321", "45-D11500021", "45-D11500311", "45-D11520011", 
                                                 "H0050022S", "H0050432S", "H0060331S", "H0180422S", "H0200021S", 
                                                 "H0200432S", "H0210011S", "H0210422S", "H0250021S", "H0250311S"), 
            value.labels = NULL, value.filter = NULL))
colnames(temp) <- "nidaid"

> temp
          nidaid
 1: 45-D11150341
 2: 45-D11180321
 3: 45-D11220022
 4: 45-D11240432
 5: 45-D11270422
 6: 45-D11290422
 7: 45-D11320321
 8: 45-D11500021
 9: 45-D11500311
10: 45-D11520011
11:    H0050022S
12:    H0050432S
13:    H0060331S
14:    H0180422S
15:    H0200021S
16:    H0200432S
17:    H0210011S
18:    H0210422S
19:    H0250021S
20:    H0250311S

我需要创建一个列来指示受试者是否是先证者,而不是父母或兄弟姐妹。先证者在特定位置用“00”表示。属于“45-D”组的患者的该信息出现在第 5 和第 6 位的前 4 个数字(例如,45-Dxxxx00xx)之后。在“H”组中,该数字位于第 4 和第 5 个插槽的前 3 个数字(例如,Hxxx00xxS)之后。如果这些斑点不是“00”,那么它们就不是先证者。

我要创建的列应该有 1 表示先证者或 2 表示不是先证者。它应该看起来像:

> temp
          nidaid goal
 1: 45-D11150341    2
 2: 45-D11180321    2
 3: 45-D11220022    1
 4: 45-D11240432    2
 5: 45-D11270422    2
 6: 45-D11290422    2
 7: 45-D11320321    2
 8: 45-D11500021    1
 9: 45-D11500311    2
10: 45-D11520011    1
11:    H0050022S    1
12:    H0050432S    2
13:    H0060331S    2
14:    H0180422S    2
15:    H0200021S    1
16:    H0200432S    2
17:    H0210011S    1
18:    H0210422S    2
19:    H0250021S    1
20:    H0250311S    2

我已经使用以下代码来执行此操作,但它认为任何地方的连续“00”是我正在寻找的。

temp2 <- temp %>% 
  mutate(pro.sib = fifelse(grepl("00", nidaid) == TRUE, 1, 2))

感谢您的帮助!

标签: rselect

解决方案


选项。

library(data.table)
library(dplyr)
library(stringr)

temp <- as.data.table(new("character", .Data = c("45-D11150341", "45-D11180321", 
                                                 "45-D11220022", "45-D11240432", "45-D11270422", "45-D11290422", 
                                                 "45-D11320321", "45-D11500021", "45-D11500311", "45-D11520011", 
                                                 "H0050022S", "H0050432S", "H0060331S", "H0180422S", "H0200021S", 
                                                 "H0200432S", "H0210011S", "H0210422S", "H0250021S", "H0250311S"), 
                          value.labels = NULL, value.filter = NULL))
colnames(temp) <- "nidaid"

temp %>% 
  mutate(goal = case_when(str_detect(nidaid, pattern = "^45.*00.{2}$") == T ~ 1,
                          str_detect(nidaid, pattern = "^H.*00.{3}$") == T ~ 1,
                          TRUE ~ 2))

#>          nidaid goal
#> 1  45-D11150341    2
#> 2  45-D11180321    2
#> 3  45-D11220022    1
#> 4  45-D11240432    2
#> 5  45-D11270422    2
#> 6  45-D11290422    2
#> 7  45-D11320321    2
#> 8  45-D11500021    1
#> 9  45-D11500311    2
#> 10 45-D11520011    1
#> 11    H0050022S    1
#> 12    H0050432S    2
#> 13    H0060331S    2
#> 14    H0180422S    2
#> 15    H0200021S    1
#> 16    H0200432S    2
#> 17    H0210011S    1
#> 18    H0210422S    2
#> 19    H0250021S    1
#> 20    H0250311S    2

reprex 包(v0.3.0)于 2020-02-24 创建


推荐阅读