r - R - NLP - 文本清理
问题描述
我是文本挖掘的新手,目前我坚持使用这种模式
pattern = c(
"<f0><U+009F><U+0098><U+00AD>",
"<f0><U+009F><U+0099><U+008F>",
"<f0><U+009F><U+008F><U+00BF> ",
"<f0><U+009F><U+0098><U+0082>",
" <f0><U+009F><U+00A4><U+00B7>",
" <f0><U+009F><U+008F><U+00BD><U+200D><U+2640><U+FE0F>\r\nBody",
" <f0><U+009F><U+00A4><U+00A3>",
" <f0><U+009F><U+0099><U+0084> ",
" <f0><U+009F><U+0099><U+0084>",
" <f0><U+009F><U+0099><U+0083>",
"<f0><U+009F><U+0098><U+00B4>",
"Hello")
我只想接收 pattern = "Hello" 并排除所有其他文本。
我尝试了以下方法,但我立即失败了:
gsub(c, "<f0><U+00F><U+[0-9]><U+[a-zA-Z0-9]>*, replacement = "")
所以,我试图把它分解:
a = gsub(c, pattern = "<f0>", replacement = "")
->结果<fo>
下降,所以这是一个好兆头,但是当我执行下一步时
gsub(a, pattern = "<U+009F>", replacement = "")
->结果:<U+009F>
仍然存在。你有什么想法吗?我很感激任何建议!提前致谢!
解决方案
清理文本的两种方法。没有给出允许移除“身体”的标准。
x <- pattern # to avoid ambiguity in function parameters
# by finding words longer than two letters (so not 'a' or 'I' either)
words <- unlist(regmatches(x, gregexpr("\\b[[:alpha:]]{2,}\\b", x, perl=TRUE)))
words
#[1] "Body" "Hello"
# by removing unwanted characters and character sequences
cleaned <- gsub("(<[^>]{0,}>|\\r|\\n)", "", x, perl=TRUE)
# and removing leading and trailing spaces
cleaned <- gsub("^ {1,}| {1,}$", "", cleaned, perl=TRUE)
cleaned[cleaned != ""]
#[1] "Body" "Hello"
推荐阅读
- node.js - Webpack 3.3 & NodeJs 8.1 跨文件继承给 TypeError?
- ethereum - What network does truffle migrate to as default when the config has 2 networks?
- java - Java 应用程序设置图标中的 GetClass 问题
- cmake - 如何在bazel中实现cmake add_definitions?
- html - 将带有叠加层的div中的文本放在前面,这样不透明度不会影响文本颜色?
- python - Pandas Mean Across Two Data Frames on Similar Columns only
- firefox-developer-tools - 可以只使用 Firefox 来截取 Firefox 开发者工具的屏幕截图吗?
- docker - 詹金斯奴隶与 Docker
- bash - 如何使用管道抑制来自 bash 别名的消息
- angular - 底部工作表关闭后如何触发点击事件