首页 > 解决方案 > Most declarative approach to extract data from strings

问题描述

I would like to extract data from strings.

Input:

d <- c("\n      98.000 € VB\n          \n      99999\n          K9mm999u", 
    "\n      9 € VB\n          \n      89999\n          Di9ß99", 
    "\n      900.000 €\n          \n      89999\n          Aich9ch", 
    "\n      979.000 € VB\n          \n      98999\n          Ni9999rg9", 
    "\n      979.000 €\n          \n      99999\n          F9lk99s99"
    )

Desired Output:

[[1]]
[1] "99999"    "K9mm999u"

[[2]]
[1] "89999"  "Di9ß99"

[[3]]
[1] "89999"   "Aich9ch"

[[4]]
[1] "98999"     "Ni9999rg9"

[[5]]
[1] "99999"     "F9lk99s99"

What i tried:

library(magrittr)
d %>% gsub(pattern = ".*€|VB|\n", replacement = "", fixed = FALSE) %>% 
  trimws %>% strsplit(split = "          ")

This would work sometimes, but the part strsplit(split = " ") seems pretty bad.

If its ok i would ask in more General:

I would go with:

filter <- function(result) result[sapply(result, nchar) > 0]
d %>% gsub(pattern = ".*€|VB|\n", replacement = "", fixed = FALSE) %>% 
  trimws %>% strsplit(split = " ") %>% lapply(FUN = filter)

But is it necessary to define this custom filter function.

Question:

What is the most declarative approach to yield the desired output?

标签: r

解决方案


In the OP's code, strsplit with one or more spaces can be changed to \\s+

library(dplyr)
d %>% 
   gsub(pattern = ".*€|VB|\n", replacement = "", fixed = FALSE) %>% 
   trimws %>% 
   strsplit(split = "\\s+")
#[[1]]
#[1] "99999"    "K9mm999u"

#[[2]]
#[1] "89999"  "Di9ß99"

#[[3]]
#[1] "89999"   "Aich9ch"

#[[4]]
#[1] "98999"     "Ni9999rg9"

#[[5]]
#[1] "99999"     "F9lk99s99"

Here is another option with read.table from base R where we read the 'd' into a data.frame, extract the first column, use logical recycling vector to subset thee elements, and split into a list with asplit

asplit(matrix(read.table(text = d, header = FALSE, fill = TRUE, 
  stringsAsFactors = FALSE)$V1[c(FALSE, TRUE, TRUE)], ncol = 2, byrow = TRUE), 1)
#[[1]]
#[1] "99999"    "K9mm999u"

#[[2]]
#[1] "89999"  "Di9ß99"

#[[3]]
#[1] "89999"   "Aich9ch"

#[[4]]
#[1] "98999"     "Ni9999rg9"

#[[5]]
#[1] "99999"     "F9lk99s99"

Or with regmatches/gregexpr from base R, match either 5 digit numbers ([0-9]{5}) or (|) alpha numeric characters of length 5 and greater ([[:alnum:]]{5,}), extract into a list. The \\b refers to word boundary

regmatches(d, gregexpr("\\b[0-9]{5}\\b|\\b[[:alnum:]]{5,}\\b", d))
#[[1]]
#[1] "99999"    "K9mm999u"

#[[2]]
#[1] "89999"  "Di9ß99"

#[[3]]
#[1] "89999"   "Aich9ch"

#[[4]]
#[1] "98999"     "Ni9999rg9"

#[[5]]
#[1] "99999"     "F9lk99s99"

推荐阅读