首页 > 解决方案 > R合并功能无法找到数据帧之间的共享匹配

问题描述

您好我有以下两个数据框:

# dataframe 1 --> clst1_trimmed

> head(clst1_trimmed)
# A tibble: 6 x 2
  GeneName Clst.1
  <fct>     <dbl>
1 Cd74      1.20 
2 Lyz2      1.02 
3 Malat1    0.196
4 Ftl1      0.577
5 H2-Ab1    1.04 
6 B2m       0.639`

# dataframe2 --> immgen_trimmed
> head(immgen_trimmed)
# A tibble: 6 x 6
  ProbeSetID GeneName Description                                      Cell.A Cell.B Cell.C
       <int> <fct>    <fct>                                             <dbl>  <dbl>  <dbl>
1   10344620 Cd74     " predicted gene 10568"                            15.6   15.3   17.2
2   10344622 Cd74     " predicted gene 10568"                           240.   255.   224. 
3   10344624 Lyz2     " lysophospholipase 1"                            421.   474.   349. 
4   10344633 Malat1   " transcription elongation factor A (SII) 1"      802.   950.   864. 
5   10344637 Flt1     " ATPase H+ transporting lysosomal V1 subunit H"  199.   262.   167. 
6   10344653 Cd3e     " opioid receptor kappa 1"                         14.8   12.8   18.0

我想根据 shared 将这些合并在一起GeneNames。我尝试了以下方法,它奏效了:

merged <- merge(clst1_trimmed, immgen_trimmed)
 merged
  GeneName    Clst.1 ProbeSetID                                   Description    Cell.A    Cell.B
1     Cd74 1.1954372   10344622                          predicted gene 10568 239.86400 255.05600
2     Cd74 1.1954372   10344620                          predicted gene 10568  15.62080  15.33110
3   Ifitm3 1.7265938   10344674  family with sequence similarity 150 member A   9.40599   9.22875
4     Lyz2 1.0227826   10344624                           lysophospholipase 1 420.51800 474.19000
5   Malat1 0.1962251   10344633     transcription elongation factor A (SII) 1 801.62400 949.96800
    Cell.C
1 223.8960
2  17.2005
3  10.3231
4 349.0890
5 863.5060

但是,用相同的方法合并两个大数据框会失败:

> dim(sel_clst)
[1] 984   2
> dim(immgen_log2)
[1] 24922   212

merge2 <- merge(sel_clst, immgen_log2)
  str(merged2)
'data.frame':   0 obs. of  213 variables:
 $ GeneName                      : Factor w/ 984 levels "0610012G03Rik",..: 
 $ Cluster.1.Log2.Fold.Change    : num 
 $ ProbeSetID                    : int 
 $ Description                   : Factor w/ 21246 levels " "," 1-acylglycerol-3-phosphate O-acyltransferase 1 (lysophosphatidic acid acyltransferase alpha)",..: 
 $ X.proB_CLP_BM.                : num 
 $ X.proB_CLP_FL.                : num 
 $ X.proB_FrA_BM.                : num 

我认为问题是GeneNameimmgen_log2数据框中没有正确识别。我查找了一个我知道应该存在于两个数据框中的基因"Cd74",但它没有出现在immgen_log2数据框中。

> "Cd74" %in% sel_clst$GeneName
[1] TRUE
> "Cd74" %in% immgen_log2$GeneName
[1] FALSE

任何想法为什么会失败?

标签: rdataframemerge

解决方案


试试这个(在制作这些数据帧的备份副本之后):

levels(sel_clst$GeneName) <- trimws( levels( sel_clst$GeneName ))
levels(immgen_log2$GeneName) <- trimws( levels( immgen_log2$GeneName ))
merge2 <- merge(sel_clst, immgen_log2)

有时该read.csv函数在数据输入时无法进行修剪,因此trimws在所有 read.csv 操作中运行可能是未来努力的一个明智的保存步骤。对于 TL;DR 版本,您应该strip.white=TRUE在使用时将其设置为参数read.csv。我什至会说你应该用以下内容覆盖你的 read.csv 副本:

read.csv <- 
       function ( ...){ utils::read.csv(..., strip.white=TRUE) }

有一个options可以访问的 -parameterdefault.stringsAsFactors()可以让您避免很多新手对因子创建的困惑,但是没有可以调整的默认设置strip.white

查看此成绩单:

> dat <- read.csv(text= "hd1 , hd2, hd3\n 1, a ,   c\n1,b,d\n")
> dat
  hd1 hd2  hd3
1   1  a     c
2   1   b    d
> dput(dat)
structure(list(hd1 = c(1L, 1L), hd2 = structure(1:2, .Label = c(" a ", 
"b"), class = "factor"), hd3 = structure(1:2, .Label = c("   c", 
"d"), class = "factor")), .Names = c("hd1", "hd2", "hd3"), class = "data.frame", row.names = c(NA, 
-2L))
> dat <- data.frame(
             lapply(read.csv(text= "hd1 , hd2, hd3\n 1, a ,   c\n1,b,d\n"), 
                    trimws)
                    )
# could also have used a two step process starting with the original `dat` 
# dat[] <- lapply(dat, trimws)   .... the `[]` preserves structure

> dat
  hd1 hd2 hd3
1   1   a   c
2   1   b   d
> dput(dat)
structure(list(hd1 = structure(c(1L, 1L), .Label = "1", class = "factor"), 
    hd2 = structure(1:2, .Label = c("a", "b"), class = "factor"), 
    hd3 = structure(1:2, .Label = c("c", "d"), class = "factor")), .Names = c("hd1", 
"hd2", "hd3"), row.names = c(NA, -2L), class = "data.frame")

推荐阅读