首页 > 解决方案 > 如何从文本中删除非 UTF-8 字符

问题描述

我需要帮助从我的词云中删除非 UTF-8 字符。到目前为止,这是我的代码。我已经尝试过 gsub 和 removeWords,它们仍然在我的词云中,我不知道该怎么做才能摆脱它们。任何帮助,将不胜感激。感谢您的时间。

在此处输入图像描述

txt <- readLines("11-0.txt")
corpus = VCorpus(VectorSource(txt))
gsub("’","‘","",txt)

corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace) 
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","â€","project"))

tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)

wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))

编辑:这是我的 inconv 版本

txt <- readLines("11-0.txt")
Encoding(txt) <- "latin1"
iconv(txt, "latin1", "ASCII", sub="")

corpus = VCorpus(VectorSource(txt))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace) 
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","project"))

tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)

wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))
title(main="Alice in Wonderland word cloud",font.main=1,cex.main =1.5)

标签: rspecial-charactersword-cloud

解决方案


的签名gsub是:

gsub(模式,替换,x,ignore.case = FALSE,perl = FALSE,fixed = FALSE,useBytes = FALSE)

不确定你想做什么

gsub("’","â€~","",txt)

但那条线可能没有做你想做的事......

有关gsub 和非 ascii 符号的先前 SO 问题,请参见此处。

编辑:

建议使用的解决方案iconv

删除所有非 ASCII 字符:

txt <- "’xxx‘"

iconv(txt, "latin1", "ASCII", sub="")

回报:

[1] "xxx"    

推荐阅读