首页 > 解决方案 > 找出群体的共同特征,使其与众不同

问题描述

我有两组患者(生病和健康)。每个患者都有这样的等级特征:

healthy_patient1 <- data.frame(feature=c("a", "b", "c", "d", "e", "f"), rank = c(0.001, 0.002, 0.002, 0.003, 0.05, 0.067))
healthy_patient2 <- data.frame(feature=c("a", "d", "e", "f", "g", "h", "q"), rank = c(0.001, 0.008, 0.01, 0.02, 0.05, 0.067, 1.2))
healthy_patient3 <- data.frame(feature=c("c", "d", "e", "g", "k", "l"), rank = c(0.003, 0.005, 0.01, 0.02, 0.05, 0.08))
healthy_patient4 <- data.frame(feature=c("b", "e", "g", "d", "k", "q", "o"), rank = c(0.001, 0.008, 0.01, 0.021, 0.054, 0.078, 1.1))

ill_patient1 <- data.frame(feature=c("c", "d", "e", "f", "o", "p", "q"), rank = c(0.002, 0.004, 0.005, 0.006, 0.02, 0.067, 0.09))
ill_patient2 <- data.frame(feature=c("e", "f", "o", "p", "r"), rank = c(0.001, 0.003, 0.02, 0.02, 0.03))
ill_patient3 <- data.frame(feature=c("c", "e", "o", "n", "k", "r"), rank = c(0.003, 0.005, 0.01, 0.03, 0.04, 0.08))
ill_patient4 <- data.frame(feature=c("b", "e", "o", "h", "n", "r", "s"), rank = c(0.002, 0.007, 0.01, 0.02, 0.03, 0.068, 1.1))

等级显示特定患者的特征的特异性,等级越低,特征越重要。我想找到健康患者与患病患者不同的共同特征。反之亦然,这些特征对于患病患者来说很常见,但与健康患者不同。

另外,我需要知道共同特征的排名和

我试过这个:

healthy_comm <- intersect(intersect(healthy_patient1$feature, healthy_patient2$feature),intersect(healthy_patient3$feature, healthy_patient4$feature))
ill_comm <- intersect(intersect(ill_patient1$feature, ill_patient2$feature),intersect(ill_patient3$feature, ill_patient4$feature))
setdiff(healthy_comm, ill_comm)

    healthy_comm 
[1] "d" "e"
    ill_comm 
1] "e" "o"
    setdiff(healthy_comm, ill_comm) 
[1] "d"

我可以回去在原始数据中找到“d”的秩和,但在我的真实数据集中,我有更多的患者和特征。所以,也许有一个更优雅、更有效的解决方案来解决这个问题

更新。在这种情况下,所需的输出将是“d”,sum_rank_healthy(d)=0.037,sum_rank_ill(d)=0.004

标签: rintersection

解决方案


这是它如何工作的基本想法:

  1. 将数据框的名称作为列添加到所有数据框
  2. 然后创建数据框df_healthydf_ill使用bind_rows
  3. 然后在此示例中应用inner_joinby feature(您也可以使用rank)与输出,您可以找到常见和不同的功能。
ill_patient1$patient <- "ill_patient1"
ill_patient2$patient <- "ill_patient2"
ill_patient3$patient <- "ill_patient3"
ill_patient4$patient <- "ill_patient4"

healthy_patient1$patient <- "healthy_patient1"
healthy_patient2$patient <- "healthy_patient2"
healthy_patient3$patient <- "healthy_patient3"
healthy_patient4$patient <- "healthy_patient4"
 

df_healthy <- bind_rows(healthy_patient1, healthy_patient2, healthy_patient3, healthy_patient4)
df_ill <- bind_rows(ill_patient1, ill_patient2, ill_patient3, ill_patient4)


library(dplyr)
inner_join(df_ill, df_healthy, by = "feature")

你可以扩展

library(dplyr)
inner_join(df_ill, df_healthy, by = "feature") %>% 
  mutate(common_rank = as.logical(rank.x == rank.y))

输出

   feature rank.x patient.x    rank.y patient.y        common_rank
   <chr>    <dbl> <chr>         <dbl> <chr>            <lgl>      
 1 c        0.002 ill_patient1  0.002 healthy_patient1 TRUE       
 2 c        0.002 ill_patient1  0.003 healthy_patient3 FALSE      
 3 d        0.004 ill_patient1  0.003 healthy_patient1 FALSE      
 4 d        0.004 ill_patient1  0.008 healthy_patient2 FALSE      
 5 d        0.004 ill_patient1  0.005 healthy_patient3 FALSE      
 6 d        0.004 ill_patient1  0.021 healthy_patient4 FALSE      
 7 e        0.005 ill_patient1  0.05  healthy_patient1 FALSE      
 8 e        0.005 ill_patient1  0.01  healthy_patient2 FALSE      
 9 e        0.005 ill_patient1  0.01  healthy_patient3 FALSE      
10 e        0.005 ill_patient1  0.008 healthy_patient4 FALSE      
# ... with 29 more rows

推荐阅读