I would like to extract data from strings.


d <- c("\n      98.000 € VB\n          \n      99999\n          K9mm999u", 
    "\n      9 € VB\n          \n      89999\n          Di9ß99", 
    "\n      900.000 €\n          \n      89999\n          Aich9ch", 
    "\n      979.000 € VB\n          \n      98999\n          Ni9999rg9", 
    "\n      979.000 €\n          \n      99999\n          F9lk99s99"

Desired Output:

[1] "99999"    "K9mm999u"

[1] "89999"  "Di9ß99"

[1] "89999"   "Aich9ch"

[1] "98999"     "Ni9999rg9"

[1] "99999"     "F9lk99s99"

What i tried:

d %>% gsub(pattern = ".*€|VB|\n", replacement = "", fixed = FALSE) %>% 
  trimws %>% strsplit(split = "          ")

This would work sometimes, but the part strsplit(split = " ") seems pretty bad.

If its ok i would ask in more General:

I would go with:

filter <- function(result) result[sapply(result, nchar) > 0]
d %>% gsub(pattern = ".*€|VB|\n", replacement = "", fixed = FALSE) %>% 
  trimws %>% strsplit(split = " ") %>% lapply(FUN = filter)

But is it necessary to define this custom filter function.


What is the most declarative approach to yield the desired output?

In the OP's code, strsplit with one or more spaces can be changed to \\s+

d %>% 
   gsub(pattern = ".*€|VB|\n", replacement = "", fixed = FALSE) %>% 
   trimws %>% 
   strsplit(split = "\\s+")
#[1] "99999"    "K9mm999u"

#[1] "89999"  "Di9ß99"

#[1] "89999"   "Aich9ch"

#[1] "98999"     "Ni9999rg9"

#[1] "99999"     "F9lk99s99"

Here is another option with read.table from base R where we read the 'd' into a data.frame, extract the first column, use logical recycling vector to subset thee elements, and split into a list with asplit

asplit(matrix(read.table(text = d, header = FALSE, fill = TRUE, 
  stringsAsFactors = FALSE)$V1[c(FALSE, TRUE, TRUE)], ncol = 2, byrow = TRUE), 1)
#[1] "99999"    "K9mm999u"

#[1] "89999"  "Di9ß99"

#[1] "89999"   "Aich9ch"

#[1] "98999"     "Ni9999rg9"

#[1] "99999"     "F9lk99s99"

Or with regmatches/gregexpr from base R, match either 5 digit numbers ([0-9]{5}) or (|) alpha numeric characters of length 5 and greater ([[:alnum:]]{5,}), extract into a list. The \\b refers to word boundary

regmatches(d, gregexpr("\\b[0-9]{5}\\b|\\b[[:alnum:]]{5,}\\b", d))
#[1] "99999"    "K9mm999u"

#[1] "89999"  "Di9ß99"

#[1] "89999"   "Aich9ch"

#[1] "98999"     "Ni9999rg9"

#[1] "99999"     "F9lk99s99"
