首页 > 解决方案 > 从非结构化日期搜索中排除错误匹配 - R

问题描述

我有一些高度非结构化的日期数据,其中包含许多错误。目前,我的regex捕获语法在获取所有日期方面非常出色,但它也可以获取不是日期的数字。这些数字通常后面跟着符号,这应该有助于预测这些数字是某种数字还是日期。

uglydates = c(
  "05-01-2018 Worked on PP&E valve. Specimens are unusually active.",
  "55.2 psi containment pressure nominal.",
  "August 11, 2018 Personal Journal, I thought I would like being alone. I was wrong.",
  "34.1 PSI reported on containment unit 34. Loss of pressure, cause unknown.",
  "10 3/4 casing seems to have ruptured. Exterior has numerous punctures",
  "perhaps caused by a wild animal.",
  "1.06.19 Hearing chittering noises in the woods.",
  "Thursday, February 2, 2019 Returned to Bunker, Mr. Higglies is missing.",
  "Fri, February 3, 2019 through Sunday, February 5, 2019 Searched for Mr. Higglies",
  "Thursday, Feb 9, 19 What remained of Mr. Higglies found me...",
  "Bleeding profusely, returning to the silo.",
  "Friday, 2 27 19 - Have not been able to stop bleeding. Don't feel like eating.",
  "Leaving bunker in search of help.",
  "3 27 Can't walk any longer. Going to lie here for just a few minutes.")

library(dplyr)
library(stringr)

# Function for adding parentheses around text
par <- function(x) paste0("(",x,")")

months <- month.name  %>% paste(collapse= "|") %>% par
monab  <- month.abb  %>% paste(collapse= "|") %>% par
days    <- (Sys.Date() + (0:6)) %>% format("%A") %>% paste(collapse= "|") %>% par
dayab   <- (Sys.Date() + (0:6)) %>% format("%a") %>% paste(collapse= "|") %>% par
num <- "([1-9]|[0-3][0-9]|201[6-9])" # 01-39, 1-9, 2016-2018

daydate <- paste(days, dayab, months, monab, num, sep= "|") %>% par

sep <-"[/\\-\\s/\\.,]*" # seperators

end <- "[\\s:\\-\\.\n$]" # Define possible end values

datematch  <- paste0("^(?i)(",daydate,sep,"){1,5}(",end,")")
#"^(?i)(((Wednesday|Thursday|Friday|Saturday|Sunday|Monday|Tuesday)|(Wed|Thu|Fri|Sat|Sun|Mon|Tue)|(January|February|March|April|May|June|July|August|September|October|November|December)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|([1-9]|[0-3][0-9]|201[6-9]))[/\\-\\s/\\.,]*){1,5}([\\s:\\-\\.\n$])"

uglydates %>% str_extract(datematch)
# [1] "05-01-2018 "                 "55.2 "                       "August 11, 2018 "           
# [4] "34.1 "                       "10 3/4 "                     NA                           
# [7] "1.06.19 "                    "Thursday, February 2, 2019 " "Fri, February 3, 2019 "     
# [10] "Thursday, Feb 9, 19 "        NA                            "Friday, 2 27 19 - "         
# [13] NA                            "3 27 "   

我试图使用否定的前瞻?!...语法,但它似乎并没有否定我需要它的一切(整个字符串捕获)。

exclude = "(PSI|casing)"
datematch  <- paste0("^(?i)((",daydate,sep,"){1,5}(",end,"))(?!", exclude,")")
# "^(?i)((((Wednesday|Thursday|Friday|Saturday|Sunday|Monday|Tuesday)|(Wed|Thu|Fri|Sat|Sun|Mon|Tue)|(January|February|March|April|May|June|July|August|September|October|November|December)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|([1-9]|[0-3][0-9]|201[6-9]))[/\\-\\s/\\.,]*){1,5}([\\s:\\-\\.\n$]))(?!(PSI|casing))"

uglydates %>% str_extract(datematch)
# [1] "05-01-2018 "                 "55."                         "August 11, 2018 "           
# [4] "34."                         "10 "                         NA                           
# [7] "1.06.19 "                    "Thursday, February 2, 2019 " "Fri, February 3, 2019 "     
# [10] "Thursday, Feb 9, 19 "        NA                            "Friday, 2 27 19 - "         
# [13] NA                            "3 27 "                  

标签: rregexregex-negation

解决方案


当前的负前瞻仅否定最终匹配的可选组,如在此虚拟示例中所示,另请参见带有可选部分和负前瞻的正则表达式

str_extract("0-0-0 psi", "((0[-]?)+)(?!\\spsi)")
#> [1] "0-0-"

reprex 包(v0.3.0)于 2019 年 6 月 13 日创建

一个简单的解决方案是替换:

exclude <- "(.*(PSI|casing))" 

PSI如果找到或,则否定整个字符串捕获casing

uglydates = c(
    "05-01-2018 Worked on PP&E valve. Specimens are unusually active.",
    "55.2 psi containment pressure nominal.",
    "August 11, 2018 Personal Journal, I thought I would like being alone. I was wrong.",
    "34.1 PSI reported on containment unit 34. Loss of pressure, cause unknown.",
    "10 3/4 casing seems to have ruptured. Exterior has numerous punctures",
    "perhaps caused by a wild animal.",
    "1.06.19 Hearing chittering noises in the woods.",
    "Thursday, February 2, 2019 Returned to Bunker, Mr. Higglies is missing.",
    "Fri, February 3, 2019 through Sunday, February 5, 2019 Searched for Mr. Higglies",
    "Thursday, Feb 9, 19 What remained of Mr. Higglies found me...",
    "Bleeding profusely, returning to the silo.",
    "Friday, 2 27 19 - Have not been able to stop bleeding. Don't feel like eating.",
    "Leaving bunker in search of help.",
    "3 27 Can't walk any longer. Going to lie here for just a few minutes.")

library(dplyr)
library(stringr)

# Function for adding parentheses around text
par <- function(x) paste0("(",x,")")

months <- month.name  %>% paste(collapse= "|") %>% par
monab  <- month.abb  %>% paste(collapse= "|") %>% par
days    <- (Sys.Date() + (0:6)) %>% format("%A") %>% paste(collapse= "|") %>% par
dayab   <- (Sys.Date() + (0:6)) %>% format("%a") %>% paste(collapse= "|") %>% par
num <- "([1-9]|[0-3][0-9]|201[6-9])" # 01-39, 1-9, 2016-2018

daydate <- paste(days, dayab, months, monab, num, sep= "|") %>% par

sep <-"[/\\-\\s/\\.,]*" # seperators

end <- "[\\s:\\-\\.\n$]" # Define possible end values

exclude <- "(.*(PSI|casing))"
datematch  <- paste0("^(?i)((",daydate,sep,"){1,5}(",end,"))(?!", exclude,")")
# "^(?i)((((Wednesday|Thursday|Friday|Saturday|Sunday|Monday|Tuesday)|(Wed|Thu|Fri|Sat|Sun|Mon|Tue)|(January|February|March|April|May|June|July|August|September|October|November|December)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|([1-9]|[0-3][0-9]|201[6-9]))[/\\-\\s/\\.,]*){1,5}([\\s:\\-\\.\n$]))(?!(.*(PSI|casing)))"

uglydates %>% str_extract(datematch)
#>  [1] "05-01-2018 "                 NA                           
#>  [3] "August 11, 2018 "            NA                           
#>  [5] NA                            NA                           
#>  [7] "1.06.19 "                    "Thursday, February 2, 2019 "
#>  [9] "Fri, February 3, 2019 "      "Thursday, Feb 9, 19 "       
#> [11] NA                            "Friday, 2 27 19 - "         
#> [13] NA                            "3 27 "

reprex 包(v0.3.0)于 2019 年 6 月 13 日创建


推荐阅读