r - 将混合字符串拆分为R中的列
问题描述
文本挖掘分析和 R 编码的新手。
我有 200 个混合字符串的基因。我想将它们拆分并将字符串(例如,钙粘蛋白、孤儿受体)粘贴到一列中,将数字(例如,2/3)、数字+字符串(例如,7D、7TM)粘贴到另一列中。我使用 strssplit 来拆分单词。请任何关于如何解析它们的建议都会有所帮助。
example:
> Genes <- c("7D cadherins", "7TM orphan receptors", "7TM orphan receptors RNA18S", "28S ribosomal RNAs RNA28S", "45S pre-ribosomal RNAs RNA45S", "5.8S ribosomal RNAs", "Actin related protein 2/3 complex”)
Expected result(2nd and 3rd column):
7D cadherins cadherins 7D
7TM orphan receptors orphan receptors 7TM
18S ribosomal RNAs RNA18S ribosomal RNAs RNA18S 18S RNA18S
28S ribosomal RNAs RNA28S ribosomal RNAs RNA28S 28S RNA28S
45S pre-ribosomal RNAs RNA45S pre-ribosomal RNAs 45S RNA45S
5.8S ribosomal RNAs ribosomal RNAs 5.8S
Actin related protein 2/3 complex Actin related protein complex 2/3
解决方案
用于拆分strsplit
名称,grep
检测带或不带数字paste
的单词以及折叠单词。将everithing 放在一个函数中以避免重复:
wordS <- function(x, invert = TRUE) {
clean <- gsub( '[[:space:]]+', ' ', x ) # to remove extra spaces
split <- strsplit( clean, ' ' )
detec <- lapply( split, function(y) grep('[0-9]', y, invert = invert, value = TRUE) )
words <- sapply( detec, paste, collapse = ' ' )
return( words )
}
data.frame(
Gene = Genes,
column2 = wordS(Genes),
column3 = wordS(Genes, invert = FALSE)
)
Gene column2 column3
1 7D cadherins cadherins 7D
2 7TM orphan receptors orphan receptors 7TM
3 7TM orphan receptors RNA18S orphan receptors 7TM RNA18S
4 28S ribosomal RNAs RNA28S ribosomal RNAs 28S RNA28S
5 45S pre-ribosomal RNAs RNA45S pre-ribosomal RNAs 45S RNA45S
6 5.8S ribosomal RNAs ribosomal RNAs 5.8S
7 Actin related protein 2/3 complex Actin related protein complex 2/3
推荐阅读
- r - 如何检查条件下的值是否在R中其他条件下的区间内?
- apache-nifi - Apache NiFi - “执行”选项
- python-3.x - 让训练平台需要Python推荐
- splunk - 从 REST API 执行保存的搜索后无法获得结果
- servicestack - 如何使用 ServiceStack.OrmLite 从现有数据中获取实体列表?
- php - 奇怪的 htaccess url 重写行为
- tensorflow - 当某个动作无法执行时如何减少神经网络输出
- azure - 如何从 PowerShell 停止/启动 Azure Function
- flutter - Flutter 应用程序在启动时显示白屏几秒钟
- c# - 实体框架基于继承生成不需要的列