首页 > 解决方案 > 在R中的查找表中按值拆分列

问题描述

我有一张桌子,每个 hpo_term 一行,所以一个病人每个 ID 可以有很多行。

ID hpo_term
123 kidney failure
123 hand tremor
123 kidney transplant
432 hypertension
432 exotropia
432 scissor gait

我还有另外两张表格,一张是肾脏术语,另一张是非肾脏术语,肾脏一个看起来像这样:

kidney failure
kidney transplant
hypertension

非肾脏的看起来像这样:

hand tremor
exotropia
scissor gait

我想要的结果是这样的表格:

ID kidney_hpo_term                   non_kidney_hpo_term
123 kidney failure;kidney transplant hand tremor
432 hypertension                     exotropia;scissor gait

实际上有数百名患者和数百个 HPO 术语。

我可以访问基础 R;dplyr 但我真的不知道如何解决这个问题。

您的帮助将不胜感激。

非常感谢

编辑:

真正的 table1 有更多不相关的额外列,并且每个 ID 总是相同的,我也想导入它。例如:

 ID hpo_term              year_of_birth  affected_relative   genome
    123 kidney failure    2000               Y                38
    123 hand tremor       2000               Y                38
    123 kidney transplant 2000               Y                38
    432 hypertension      1980               N                37
    432 exotropia         1980               N                37
    432 scissor gait      1980               N                37

标签: rmergedata.tablelookup

解决方案


这是一个 dplyr 解决方案:

library(dplyr)

table1 = data.frame(ID = c(123,123,123,432,432,432),
                    hpo_term = c("kidney failure","hand tremor","kidney transplant","hypertension","exotropia","scissor gait"))

kid_terms = c("kidney failure","kidney transplant","hypertension")
nonkid_terms = c("hand tremor","exotropia","scissor gait")

table1$term_type = NA
table1$term_type[table1$hpo_term %in% kid_terms] = "kidney_hpo_term"
table1$term_type[table1$hpo_term %in% nonkid_terms] = "non_kidney_hpo_term"

table2 = table1 %>% group_by(ID,term_type) %>%
  summarize(term_list=paste(hpo_term,collapse=";")) %>%
  spread(term_type,term_list)

> table2
    ID kidney_hpo_term                  non_kidney_hpo_term   
1   123 kidney failure;kidney transplant hand tremor           
2   432 hypertension                     exotropia;scissor gait

这是一个data.table解决方案:

library(data.table)

table1 = data.table(ID = c(123,123,123,432,432,432),
                    hpo_term = c("kidney failure","hand tremor","kidney transplant","hypertension","exotropia","scissor gait"))

kid_terms = c("kidney failure","kidney transplant","hypertension")
nonkid_terms = c("hand tremor","exotropia","scissor gait")

table1$term_type = NA
table1$term_type[table1$hpo_term %in% kid_terms] = "kidney_hpo_term"
table1$term_type[table1$hpo_term %in% nonkid_terms] = "non_kidney_hpo_term"

table2 = table1[,.(term_list=paste(hpo_term,collapse=";")),by=.(ID,term_type)]

table3 = dcast(table2, ID~term_type, value.var = "term_list")

> table3
    ID                  kidney_hpo_term    non_kidney_hpo_term
1: 123 kidney failure;kidney transplant            hand tremor
2: 432                     hypertension exotropia;scissor gait

推荐阅读