r - 在正则表达式中用 \n 替换字符,然后将文本转换为 quanteda 语料库
问题描述
我有一些我有 OCR 的文本。OCR 放置了很多换行符 (\n),它们不应该是。但也错过了很多应该在那里的新线路。
我想删除现有的换行符并用空格替换它们。然后用原始文本中的换行符替换特定字符。然后将文档转换为 quanteda 中的语料库。
我可以创建一个基本的语料库。但问题是我不能把它分成几段。如果我使用
corpus_reshape(corps, to ="paragraphs", use_docvars = TRUE) 它不会分解文档。
如果我使用 corpus_segment(corps, pattern = "\n")
我得到一个错误。
rm(list=ls(all=TRUE))
library(quanteda)
library(readtext)
# Here is a sample Text
sample <- "Hello my name is Christ-
ina. 50 Sometimes we get some we-
irdness
Hello my name is Michael,
sometimes we get some weird,
and odd, results-- 50 I want to replace the
50s
"
# Removing the existing breaks
sample <- gsub("\n", " ", sample)
sample <- gsub(" {2,}", " ", sample)
# Adding new breaks
sample <- gsub("50", "\n", sample)
# I can create a corpus
corps <- corpus(sample, compress = FALSE)
summary(corps, 1)
# But I can't change to paragraphs
corp_para <- corpus_reshape(corps, to ="paragraphs", use_docvars = TRUE)
summary(corp_para, 1)
# But I can't change to paragraphs
corp_para <- corpus_reshape(corps, to ="paragraphs", use_docvars = TRUE)
summary(corp_para, 1)
corp_segmented <- corpus_segment(corps, pattern = "\n")
# The \n characters are in both documents....
corp_para$documents$texts
sample
解决方案
I recommend using regular expression replacement to clean your text before making it into a corpus. The trick in your text is figure out where you want to remove newlines, and where you want to keep them. I'm guessing from your question that you want to remove the occurrences of "50", but also probably join the words split by hyphens and a newline. You probably also want to keep two newlines between texts?
Many users prefer the simpler interface of the stringr package, but I've always tended to use stringi (on which stringr is built) instead. It allows for vectorized replacement, so you can feed it a vector of patterns to match, and the replacements, in one function call.
library("stringi")
sample2 <- stri_replace_all_regex(sample, c("\\-\\n+", "\\n+", "50"), c("", "\n", "\n"),
vectorize_all = FALSE
)
cat(sample2)
## Hello my name is Christina.
## Sometimes we get some weirdness
## Hello my name is Michael,
## sometimes we get some weird,
## and odd, results--
## I want to replace the
##
## s
Here, you match "\\n"
as a regular expression pattern but use just "\n"
as the (literal) replacement.
There are two newlines before the last "s" in the replaced text because a) there was already one after the "s" in "50s" and b) we added one by replacing 50 with a new "\n"
.
Now you can construct a corpus with quanteda::corpus(sample2)
.
推荐阅读
- vb.net - 应用程序处于中断模式。'RESTAURANT_POS.ModConnection' 的类型初始值设定项引发异常
- ar.js - AR.js 中的图像检测
- pdf - 使用 NextJs 使用来自外部源的 react-pdf 显示 pdf 文件
- c++ - SSL_WRITE 在通过 TCP 的 TLS 的不同功能中发送时合并消息
- ios - 是否可以关闭 QLPreviewController 上的复制文本
- python - 去掉括号
- pytorch - 尝试构建支持旧 gpu (3.0) 的 pytorch 1.0.0 cuda 10.2
- bash - 无法访问 bash 脚本函数中的配置文件变量
- javascript - NestJS (NodeJS) client.query 运行 3 次,结果为空
- mongodb - MongoDB - 过滤数组并获取不同的计数