首页 > 解决方案 > 读取 R 对象时的编码问题

问题描述

我正在使用 readRDS 读取 R 对象。它应该有两列,一年和一个字符串。对于大多数行,字符串是可以的,但有些有一个奇怪的白色斑点,有些似乎有一个带有转义特殊字符的字符向量,有些有特殊字符,如 â。

我认为这是原始数据的编码问题(这不是我的),但我不确定 blob 是什么或导致字符向量/转义的原因。我意识到它可能是原始数据,但试图更多地了解我所看到的内容,以便进行调查。

我正在使用 macOS 10.14.6。

欢迎任何想法。

原始数据在这里,我使用以下内容提取了一些带有奇怪字符的行。

data <- readRDS("all_speech.rds") %>%
    select(year, speech) %>%
    filter(str_detect(speech, "â"))

str(hansardOrig)
'data.frame':   2286324 obs. of  2 variables:
 $ year  : num  1979 1979 1979 1979 1979 ...
 $ speech: chr  "Mr. Speaker ...

添加

sample <- data %>% mutate(speech = substr(speech, 1, 200)) 
dput(head(sample))
structure(list(year = c(1982, 1982, 1982, 1984, 1986, 1986), 
    speech = c("With this it will be convenient to take amendment No. 112, in title, line 10, leave out 'section 163 1) of’.\n", 
    "I am not so much surprised as astonished by the amendment. It would create tremendous problems. Police officers have a vital role in visiting places of entertainment—without a warrant—particularly in ", 
    "I note the hon. Gentleman's desire to retire there.\nMy right hon. Friend mentioned that we are setting up a pilot scheme with three experimental homes. They will be in adapted, domestic-style, buildin", 
    "The British forces in the Lebanon had their headquarters at Haddâsse. From that position they would have been totally unable to help British nationals in west Beirut. They are better able to help, thr", 
    "We know that soon more cars will be manufactured in the United Kingdom, as the hon. Member for Edinburgh, Central Mr. Fletcher) wishes.\nhirdly, the decision will have a domino effect—that American phr", 
    "I beg to move,\nThat leave be given to bring in a Bill to make illegal the display of pictures of naked or partially naked women in sexually provocative poses in newspapers.\nThis is a simple but import"
    )), row.names = c(NA, 6L), class = "data.frame")

Rstudio 查看器中显示的数据

标签: rcharacter-encoding

解决方案


你有一个难题摆在你面前。您显示的示例具有不一致的编码,因此很难进行修复。

sample$speech在我的 Mac 上显示的第一个条目如下所示:

> sample$speech[1]
[1] "With this it will be convenient to take amendment No. 112, in title,
line 10, leave out 'section 163 1) of’.\n"

这看起来没问题,其中’字符看起来像定向引号的 UTF-8 编码"’&quot;,以 WINDOWS-1252 编码进行解释。我可以用这段代码解决这个问题:

> iconv(sample$speech[1], from="utf-8", to="WINDOWS-1252")
[1] "With this it will be convenient to take amendment No. 112, in title,
line 10, leave out 'section 163 1) of’.\n"

但是,这弄乱了第二个条目,因为它正确编码了破折号,因此翻译将它们转换为十六进制 97 个字符,在 Mac 上的本机 UTF-8 编码中是不合法的:

> sample$speech[2]
[1] "I am not so much surprised as astonished by the amendment. It would
create tremendous problems. Police officers have a vital role in visiting
places of entertainment—without a warrant—particularly in "
> iconv(sample$speech[2], from="utf-8", to="WINDOWS-1252")
[1] "I am not so much surprised as astonished by the amendment. It would
create tremendous problems. Police officers have a vital role in visiting
places of entertainment\x97without a warrant\x97particularly in "

各种包中有一些函数可以猜测编码并修复它们,例如rvest::repair_encoding, stringi::stri_enc_detect,但我无法让它们处理您的数据。我自己写了一个,基于这些想法:用于utf8ToInt将每个字符串转换为它的 Unicode 代码点,然后在一个序列中查找哪些包含多个高值。 sample$speech[1]看起来像这样:

> utf8ToInt(sample$speech[1])
  [1]   87  105  116  104   32  116  104  105  115   32  105  116   32  119  105  108  108
 [18]   32   98  101   32   99  111  110  118  101  110  105  101  110  116   32  116  111
 [35]   32  116   97  107  101   32   97  109  101  110  100  109  101  110  116   32   78
 [52]  111   46   32   49   49   50   44   32  105  110   32  116  105  116  108  101   44
 [69]   32  108  105  110  101   32   49   48   44   32  108  101   97  118  101   32  111
 [86]  117  116   32   39  115  101   99  116  105  111  110   32   49   54   51   32   49
[103]   41   32  111  102  226 8364 8482   46   10

接近结尾的那个序列226 8364 8482对于被误解的 UTF-8 字符来说是典型的。(维基百科页面详细描述了编码。两个字节字符从 192 到 223 开始,三个字节字符从 224 到 239 开始,四个字节字符从 240 到 247 开始。第一个字符之后的字符都在 128 到 191 的范围内. 棘手的部分是弄清楚这些高阶字符将如何显示,因为这取决于错误假设的编码。)这是一个快速而肮脏的函数,它尝试每种已知的编码iconv()并报告它的作用:

fixEncoding <- function(s, guess = iconvlist()) {
  firstbytes <- list(as.raw(192:223), 
                     as.raw(224:239), as.raw(240:247))
  nextbytes <- as.raw(128:191)
  for (i in seq_along(s)) {
    str <- utf8ToInt(s[i])
    if (any(str > 127)) {
      fixes <- c()
      encs <- c()
      for (g in guess) {
        high <- which(str > 127)
        firsts <- lapply(firstbytes, function(s) utf8ToInt(iconv(rawToChar(s), from = g, to = "UTF-8", sub="")))
        nexts <- utf8ToInt(iconv(rawToChar(nextbytes), from = g, to = "UTF-8", sub = ""))
        for (try in 1:3) {
          starts <- high[str[high] %in% firsts[[try]]]
          starts <- starts[starts <= length(str) - try]
          for (hit in starts) {
            if (str[hit+1] %in% nexts &&
                (try < 2 || str[hit+2] %in% nexts) &&
                (try < 3 || str[hit+3] %in% nexts)) 
              high <- setdiff(high, c(hit, hit + 1, 
                                    if (try > 1) hit + 2, 
                                    if (try > 2) hit + 3))
          }
        }
        if (!length(high)) {
          fixes <- c(fixes, iconv(s[i], from = "UTF-8", to = g, mark = FALSE))
          encs <- c(encs, g)
        }
      }
      if (length(fixes)) {
        if (length(unique(fixes)) == 1) {
          s[i] <- fixes[1]
          message("Fixed s[", i, "] using one of ", paste(encs, collapse=","), "\n", sep = "")
        } else {
          warning("s[", i, "] has multiple possible fixes.")
          message("It could be")
          uniq <- unique(fixes)
          for (u in seq_along(uniq))
            message(paste(encs[fixes == uniq[u]], collapse = ","), "\n")
          message("Not fixed!\n")
        }
      }
    }
  }
  s
}

当我在您的样品上尝试时,我看到了:

> fixed <- fixEncoding(sample$speech)
Fixed s[1] using one of CP1250,CP1252,CP1254,CP1256,CP1258,MS-ANSI,MS-ARAB,MS-EE,MS-TURK,WINDOWS-1250,WINDOWS-1252,WINDOWS-1254,WINDOWS-1256,WINDOWS-1258

您可以通过将其称为

fixed <- suppressMessages(fixEncoding(sample$speech))

您在原始帖子中遇到的另一个问题是某些字符串显示为单个字符。我认为这是一个 RStudio 错误。如果我在数据框中的单个元素中放置了太多字符,RStudio 查看器将无法显示它。对我来说,限制大约是 10240 个字符。此数据框将无法正确显示:

d <- data.frame(x = paste(rep("a", 10241), collapse=""))

但任何较小的数字都有效。这不是 R 问题。它可以毫无问题地在控制台中显示该数据框。只是View(d)这样不好,而且只在 RStudio 中。


推荐阅读