首页 > 解决方案 > 缺少某些定界符时如何使用定界符提取文本

问题描述

我正在尝试根据半结构化文本文档中的标题提取文本。

输入

Column<-"Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Report: Need to complete Conclusion: Dud"

这里的输出是

Order     Subject Name           Grade  Report           Conclusion
1223442   History Bilbo Johnson   Bad   Need to complete  Dud

我可以通过以下(混乱但有效)功能来实现这一点:

dataframeIn<-data.frame(Column,stringsAsFactors=FALSE)
delim<-c("Order","Subject","Name","Grade","Report","Conclusion")


Extractor <- function(dataframeIn, Column, delim) {
  dataframeInForLater<-dataframeIn
  ColumnForLater<-Column
  Column <- rlang::sym(Column)
  dataframeIn <- data.frame(dataframeIn)
  dataframeIn<-dataframeIn %>%
    tidyr::separate(!!Column, into = c("added_name",delim),
                                          sep = paste(delim, collapse = "|"),
                    extra = "drop", fill = "right")
  names(dataframeIn) <- gsub(".", "", names(dataframeIn), fixed = TRUE)

  dataframeIn<-data.frame(dataframeIn)
  #Add the original column back in so have the original reference
  dataframeIn<-cbind(dataframeInForLater[,ColumnForLater],dataframeIn)
  dataframeIn<-data.frame(dataframeIn)
  return(dataframeIn)
}

Extractor(dataframeIn, "Column", delim)

但是,有时分隔符会丢失,例如

Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Conclusion: Dud

在这种情况下,所需的输出是

Order     Subject Name           Grade  Conclusion
1223442   History Bilbo Johnson   Bad    Dud

但实际输出变为:

 Order   Subject            Name   Grade Report Conclusion
:1223442  :History   Bilbo Johnson  : Bad    : Dud       <NA>

我如何解释缺少的分隔符,尽管它们的顺序相同(包括文本中间缺失的分隔符以及上面示例中的结尾)?

标签: r

解决方案


我们可以执行以下操作(这只是文本提取,我为您构建输出):

library(stringr)
Extractor <- function(x, delim) {
  pattern <- paste0(delim, ":{0,1}(.*?)(", paste(c(delim, "$"), collapse = "|"), ")")
  trimws(str_match(x, pattern)[, 2])
}
Extractor(Column1, delim)
# [1] "1223442"          "History"          "Bilbo Johnson"    "Bad"              "Need to complete" "Dud"
Extractor(Column2, delim)
# [1] "1223442"       "History"       "Bilbo Johnson" "Bad"           NA              "Dud"
Column3 <- "Subject:History Name Bilbo Johnson"
Extractor(Column3, delim)
# [1] NA              "History"       "Bilbo Johnson" NA              NA              NA

由于我们有NA's ,因此很清楚缺少哪些定界符,哪些没有。

它在您的情况下的工作方式是我们有一系列模式

pattern
# [1] "Order:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"     
# [2] "Subject:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"   
# [3] "Name:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"      
# [4] "Grade:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"     
# [5] "Report:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"    
# [6] "Conclusion:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"

然后str_matchnice 将(.*?)部分提取到第二个输出列,我们用 . 去掉任何空格trimws。啊,我们使用惰性匹配来(.*?)避免匹配太多。


推荐阅读