首页 > 解决方案 > 基于列中的子字符串合并数据框

问题描述

我有两个数据框,一个(df_protein)包含来自携带修饰的蛋白质片段的实验测量数据,另一个(df_modificaton)我有一个“名称”的数据库关闭所有修饰。现在我正在尝试将它们合并在一起。

两者都有一个带有修饰序列的列(被修饰的氨基酸有一个星号)。但是在 df_protein 中存储了整个片段 (!) 的序列(以“ ”开头和结尾),而在 df_modification 中只给出了修饰前后的 7 个氨基酸(如果它在开头或结尾)蛋白质 其余位置标有“ ”)

为了更好地说明这里一个MWE:

df_protein <- data_frame(
  Protein = c("A", "A", "A", "B", "B"),
  Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
  Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250)
)

df_modificaton <- data_frame(
  Protein = c("A", "A", "A", "B", "B", "B"),
  Sequence = c("TIPEQRLS*SSSLLAS", "PSIASDIY*LPIATQ", "PEQRLSSS*SLLASPG", "DPVPPET*PSDSDHK", "FYYEILNS*PEKACSL","_____SMS*VDLSHIP"), 
  Modification = c("S125", "Y77", "S127", "T456", "S44", "S3")
)

# How can I merge the above to the following result:
df_merged <- data_frame(
  Protein = c("A", "A", "A", "B", "B"),
  Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
  Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250),
  Modification = c("Y77", "S125", "S127", "T456", "S3")
) 

我正在使用tidyverse,但我也可以使用其他软件包。谢谢。

标签: rdataframedplyrmerge

解决方案


一种方法是使用fuzzyjoin包来执行stringdist连接:

library(dplyr)
library(fuzzyjoin)
stringdist_inner_join(df_protein, df_modificaton,
                      by = "Sequence", method = "jw", distance_col = "distance") %>%
  group_by(Sequence.x) %>%
  slice_min(distance)
# A tibble: 5 x 7
# Groups:   Sequence.x [5]
  Protein.x Sequence.x              Counts Protein.y Sequence.y       Modification distance
  <chr>     <chr>                    <dbl> <chr>     <chr>            <chr>           <dbl>
1 A         _EPTPSIASDIY*LPIATQELR_   3.46 A         PSIASDIY*LPIATQ  Y77             0.260
2 A         _S*SSSLLASPGHISVK_        6.13 A         PEQRLSSS*SLLASPG S127            0.294
3 B         _SMS*VDLSHIPLK_           7.25 B         _____SMS*VDLSHIP S3              0.15 
4 A         _SSS*SLLASPGHISVK_       10.0  A         PEQRLSSS*SLLASPG S127            0.294
5 B         _TQDPVPPET*PSDSDHK_       0    B         DPVPPET*PSDSDHK  T456            0.137

推荐阅读