首页 > 解决方案 > 使用 REBUS(或普通正则表达式)在 R 中提取棘手的文本

问题描述

我从 UNIPROT 下载了有关本地化的蛋白质注释,但不幸的是无法获得 REBUS 和 STRINGR 来获得我需要的东西。在失败太多之后,我想寻求一些帮助,非常感谢!

我正在使用 stringR 和 REBUS,但普通的正则表达式可能也可以解决问题(我更喜欢 REBUS,因为它更易于阅读)

#df
startDF <- data.frame(UNIPROT = c("U123", "U223", "U334"),
                   localisation = c("SUBCELLULAR LOCATION: Cell membrane {ECO:0000250}. Membrane {ECO:0000305}; Single-pass membrane protein {ECO:0000305}. Note=Colocalizes with EHD1 and EHD2 at plasma membrane in myoblasts and myotubes. Localizes into foci at the plasma membrane (By similarity). {ECO:0000250}.", "SUBCELLULAR LOCATION: Cytoplasm, cytosol {ECO:0000269|PubMed:11554768}. Endoplasmic reticulum {ECO:0000269|PubMed:11554768}. Note=May transiently interact with the endoplasmic reticulum.", "SUBCELLULAR LOCATION: Lysosome membrane {ECO:0000305|PubMed:14592447}; Multi-pass membrane protein {ECO:0000255}."))

#packages
library(stringr)
library(rebus)

#tried to extract the first entry like this, but no success:
str_extract(startDF$localisation, pattern = "SUBCELLULAR LOCATION:" %R% WRD %R% OPEN_BRACKET %R% END)


#hoped for result
resultDF <- data.frame(UNIPROT = c("U123", "U223", "U334"),
                       primary_loc = c("Cell membrane", "Cytoplasm", "Lysosome membrane"),
                       other_loc = c("Membrane;Single-pass membrane protein" , "Endoplasmic reticulum",  "Multi-pass membrane protein"),
                       note = c(NA, "May transiently interact with the endoplasmic reticulum", NA))


最后,我希望将信息以 cols 分隔,令人惊奇的是首先获得主要位置,然后是次要位置,然后是注释(如果有的话)。奖励:如果你能区分实际的二级定位和跨膜结构域类型的描述,你应该获得奖牌!

非常感谢您的帮助!

标签: rregexstringr

解决方案


可能有更简单的方法可以达到相同的结果,但这是我第一次解决这个问题......希望这会让你开始......

library( data.table )

#1 split the location-strings, using "Note=" as split character
l <- data.table::tstrsplit( startDF$localisation, "Note=", fixed = FALSE )

#2 now, get the locations by splitting the location-strings
#first, strip the `SUBCELLULAR LOCATION:`
l <- lapply( l, function(x) gsub( "^SUBCELLULAR LOCATION: ", "", x ) )
#and get ritd of all the stuff within { ... }
l <- lapply( l, function(x) gsub( "\\{.*?\\}", "", x ) )
#not split the locations on . and ;, and trim whitespace
locations <- lapply( strsplit( l[[1]], "[.;]", fixed = FALSE ), trimws )
#remove eventual empty locations
locations <- lapply( locations, function(x) subset(x, nchar(x) > 0) )
#paste locations together
locations <- lapply( locations, paste0, collapse = ";")

#3 and the note?
notes <- l[[2]]

#4 now we build the final data.table
#first step is easy ;-)
dt <- data.table( UNIPROT  = startDF$UNIPROT )
#get the maximum number of locations
max_loc <- length( tstrsplit( locations,";" ) )
#input the locations
dt[, paste0("location_", 1:max_loc) := tstrsplit( locations, ";" ) ]
#add the note
dt[, note := notes ]

这导致(对不起截图,因为笔记很长,我无法打印出像样的打印) 在此处输入图像描述


推荐阅读