首页 > 解决方案 > Problem with encoding of character strings when loading json files to RStudio under Windows 10

问题描述

I am trying to extract Tweets from json files and save them as RData under Windows 10 and using RStudio version 1.2.5033 and streamR. However, Windows (and RStudio and streamR subsequently) assumes that the input is Windows-1252 although it is UTF-8 which leads to serious encoding issues.

To replicate the problem, please use this json file with two fake Tweets since I could not replicate the structure of the original json files within R. But this structure leads to issues with the only solution I found for the encoding issue (see below).

The code I used is the following:

df <- streamR::parseTweets("test.json")

The output I get with df$text is: '[1] "RT @bkabka:Eikö tämä" "RT @bkabka:España"'.

The output should be: '[1] "RT @bkabka:Eikö tämä" "RT @bkabka:España"'.

My question is therefore: (How) can I tell R and streamR to interpret the text as being encoded using UTF-8 rather than Windows-1252?

Since all this happens because the function wrongly assumes that the text is encoded with Windows-1252, one solution would be to go through the whole corpus and replace all of these wrongly interpreted special characters with the correct one, for example using the table I found here. In my case however, the corpus is very very large making this a very suboptimal solution in the long run. Additionally, I would not have the possibility to check whether it actually replaced all special characters correctly.


Some additional information:

Using rjson and the following code somehow makes R interpret the encoding correctly, but has troubles with the structure of the json files since it only extracts the first line:

lt <- rjson::fromJSON(file="test.json")

I guess it cannot extract the subsequent line because it does not recognise the line break which is an actual line break and not \n or any other character combination. Unfortunately, I do not have the possibility to change the json files.

The json files were created by another person under macOS using streamR - if I am not mistaken.

The same problem appears using simple R instead of RStudio. The problem does not appear on macOS.

The problem is even more serious when using tweet2r, the only other package I am aware of that allows to extract Tweets from json files using R. Tweet2r deletes specific special characters such as "¶" and thus, the wrongly interpreted special characters cannot be replaced anymore with the correct ones.

标签: rjsoncharacter-encodingwindows-10tweets

解决方案


感谢 MrFlick(见他的评论),这是一个使用 jsonlite 的解决方案,它产生非常相似的数据帧结构并正确读取编码:

df <- jsonlite::stream_in(file("~/../Downloads/test.json"))

对于那些习惯于使用 streamR 处理未来可能遇到类似问题的推文的人来说,只是一些进一步的信息,由 parseTweets 和 stream_in 创建的数据帧有两个主要区别

  1. parseTweets 不会为损坏的推文提取数据。stream_in 可以。因此,使用 stream_in 时数据帧有更多行,但包含相同的推文。

  2. stream_in 创建的变量更少,因为数据框中的某些列本身就是数据框。这可能会在使用数据框而不进一步转换使用 stream_in 创建的数据框时导致问题。parseTweets 为您做到这一点。


推荐阅读