首页 > 解决方案 > 在正则表达式中用 \n 替换字符,然后将文本转换为 quanteda 语料库

问题描述

我有一些我有 OCR 的文本。OCR 放置了很多换行符 (\n),它们不应该是。但也错过了很多应该在那里的新线路。

我想删除现有的换行符并用空格替换它们。然后用原始文本中的换行符替换特定字符。然后将文档转换为 quanteda 中的语料库。

我可以创建一个基本的语料库。但问题是我不能把它分成几段。如果我使用
corpus_reshape(corps, to ="paragraphs", use_docvars = TRUE) 它不会分解文档。

如果我使用 corpus_segment(corps, pattern = "\n")

我得到一个错误。

rm(list=ls(all=TRUE))
library(quanteda)
library(readtext)

# Here is a sample Text
sample <- "Hello my name is Christ-
ina. 50 Sometimes we get some we-


irdness

Hello my name is Michael, 
sometimes we get some weird,


 and odd, results-- 50 I want to replace the 
 50s
"



# Removing the existing breaks
sample <- gsub("\n", " ", sample)
sample <- gsub(" {2,}", " ", sample)
# Adding new breaks
sample <- gsub("50", "\n", sample)

# I can create a corpus
corps <- corpus(sample, compress = FALSE)
summary(corps, 1)

# But I can't change to paragraphs
corp_para <- corpus_reshape(corps, to ="paragraphs", use_docvars = TRUE)
summary(corp_para, 1)

# But I can't change to paragraphs
corp_para <- corpus_reshape(corps, to ="paragraphs", use_docvars = TRUE)
summary(corp_para, 1)

corp_segmented <-  corpus_segment(corps, pattern = "\n")

# The \n characters are in both documents.... 
corp_para$documents$texts
sample

标签: rregexgsubquanteda

解决方案


I recommend using regular expression replacement to clean your text before making it into a corpus. The trick in your text is figure out where you want to remove newlines, and where you want to keep them. I'm guessing from your question that you want to remove the occurrences of "50", but also probably join the words split by hyphens and a newline. You probably also want to keep two newlines between texts?

Many users prefer the simpler interface of the stringr package, but I've always tended to use stringi (on which stringr is built) instead. It allows for vectorized replacement, so you can feed it a vector of patterns to match, and the replacements, in one function call.

library("stringi")

sample2 <- stri_replace_all_regex(sample, c("\\-\\n+", "\\n+", "50"), c("", "\n", "\n"),
  vectorize_all = FALSE
)
cat(sample2)
## Hello my name is Christina. 
##  Sometimes we get some weirdness
## Hello my name is Michael, 
## sometimes we get some weird,
##  and odd, results-- 
##  I want to replace the 
##  
## s

Here, you match "\\n" as a regular expression pattern but use just "\n" as the (literal) replacement.

There are two newlines before the last "s" in the replaced text because a) there was already one after the "s" in "50s" and b) we added one by replacing 50 with a new "\n".

Now you can construct a corpus with quanteda::corpus(sample2).


推荐阅读