r - Most declarative approach to extract data from strings
问题描述
I would like to extract data from strings.
Input:
d <- c("\n 98.000 € VB\n \n 99999\n K9mm999u",
"\n 9 € VB\n \n 89999\n Di9ß99",
"\n 900.000 €\n \n 89999\n Aich9ch",
"\n 979.000 € VB\n \n 98999\n Ni9999rg9",
"\n 979.000 €\n \n 99999\n F9lk99s99"
)
Desired Output:
[[1]]
[1] "99999" "K9mm999u"
[[2]]
[1] "89999" "Di9ß99"
[[3]]
[1] "89999" "Aich9ch"
[[4]]
[1] "98999" "Ni9999rg9"
[[5]]
[1] "99999" "F9lk99s99"
What i tried:
library(magrittr)
d %>% gsub(pattern = ".*€|VB|\n", replacement = "", fixed = FALSE) %>%
trimws %>% strsplit(split = " ")
This would work sometimes, but the part strsplit(split = " ")
seems pretty bad.
If its ok i would ask in more General:
I would go with:
filter <- function(result) result[sapply(result, nchar) > 0]
d %>% gsub(pattern = ".*€|VB|\n", replacement = "", fixed = FALSE) %>%
trimws %>% strsplit(split = " ") %>% lapply(FUN = filter)
But is it necessary to define this custom filter
function.
Question:
What is the most declarative approach to yield the desired output?
解决方案
In the OP's code, strsplit
with one or more spaces can be changed to \\s+
library(dplyr)
d %>%
gsub(pattern = ".*€|VB|\n", replacement = "", fixed = FALSE) %>%
trimws %>%
strsplit(split = "\\s+")
#[[1]]
#[1] "99999" "K9mm999u"
#[[2]]
#[1] "89999" "Di9ß99"
#[[3]]
#[1] "89999" "Aich9ch"
#[[4]]
#[1] "98999" "Ni9999rg9"
#[[5]]
#[1] "99999" "F9lk99s99"
Here is another option with read.table
from base R
where we read the 'd' into a data.frame
, extract the first column, use logical recycling vector to subset thee elements, and split
into a list
with asplit
asplit(matrix(read.table(text = d, header = FALSE, fill = TRUE,
stringsAsFactors = FALSE)$V1[c(FALSE, TRUE, TRUE)], ncol = 2, byrow = TRUE), 1)
#[[1]]
#[1] "99999" "K9mm999u"
#[[2]]
#[1] "89999" "Di9ß99"
#[[3]]
#[1] "89999" "Aich9ch"
#[[4]]
#[1] "98999" "Ni9999rg9"
#[[5]]
#[1] "99999" "F9lk99s99"
Or with regmatches/gregexpr
from base R
, match either 5 digit numbers ([0-9]{5}
) or (|
) alpha numeric characters of length 5 and greater ([[:alnum:]]{5,}
), extract into a list
. The \\b
refers to word boundary
regmatches(d, gregexpr("\\b[0-9]{5}\\b|\\b[[:alnum:]]{5,}\\b", d))
#[[1]]
#[1] "99999" "K9mm999u"
#[[2]]
#[1] "89999" "Di9ß99"
#[[3]]
#[1] "89999" "Aich9ch"
#[[4]]
#[1] "98999" "Ni9999rg9"
#[[5]]
#[1] "99999" "F9lk99s99"
推荐阅读
- c# - C# TFS SDK 从变更集中获取项目
- azure-cosmosdb - 我们可以在 azure cosmos DB 中使用非主键作为分区键吗?
- ios - 为什么 dlopen() 总是返回我找不到图像?
- javascript - 在 Node.JS 中将标头发送到客户端后无法设置标头
- java - 显示 Activity 后的 Android BlackScreen 布局
- jquery - 试图通过控制台勾选页面上的所有复选框
- javascript - 未捕获的 ReferenceError:Web 开发人员工具中未定义要求
- javascript - 错误放置在 Jquery 验证插件的选择标记顶部
- drop-down-menu - yii2 下拉列表值
- php - 如何将gif作为php文件运行