r - Problem with encoding of character strings when loading json files to RStudio under Windows 10
问题描述
I am trying to extract Tweets from json files and save them as RData under Windows 10 and using RStudio version 1.2.5033 and streamR. However, Windows (and RStudio and streamR subsequently) assumes that the input is Windows-1252 although it is UTF-8 which leads to serious encoding issues.
To replicate the problem, please use this json file with two fake Tweets since I could not replicate the structure of the original json files within R. But this structure leads to issues with the only solution I found for the encoding issue (see below).
The code I used is the following:
df <- streamR::parseTweets("test.json")
The output I get with df$text is: '[1] "RT @bkabka:Eikö tämä" "RT @bkabka:España"'.
The output should be: '[1] "RT @bkabka:Eikö tämä" "RT @bkabka:España"'.
My question is therefore: (How) can I tell R and streamR to interpret the text as being encoded using UTF-8 rather than Windows-1252?
Since all this happens because the function wrongly assumes that the text is encoded with Windows-1252, one solution would be to go through the whole corpus and replace all of these wrongly interpreted special characters with the correct one, for example using the table I found here. In my case however, the corpus is very very large making this a very suboptimal solution in the long run. Additionally, I would not have the possibility to check whether it actually replaced all special characters correctly.
Some additional information:
Using rjson and the following code somehow makes R interpret the encoding correctly, but has troubles with the structure of the json files since it only extracts the first line:
lt <- rjson::fromJSON(file="test.json")
I guess it cannot extract the subsequent line because it does not recognise the line break which is an actual line break and not \n or any other character combination. Unfortunately, I do not have the possibility to change the json files.
The json files were created by another person under macOS using streamR - if I am not mistaken.
The same problem appears using simple R instead of RStudio. The problem does not appear on macOS.
The problem is even more serious when using tweet2r, the only other package I am aware of that allows to extract Tweets from json files using R. Tweet2r deletes specific special characters such as "¶" and thus, the wrongly interpreted special characters cannot be replaced anymore with the correct ones.
解决方案
感谢 MrFlick(见他的评论),这是一个使用 jsonlite 的解决方案,它产生非常相似的数据帧结构并正确读取编码:
df <- jsonlite::stream_in(file("~/../Downloads/test.json"))
对于那些习惯于使用 streamR 处理未来可能遇到类似问题的推文的人来说,只是一些进一步的信息,由 parseTweets 和 stream_in 创建的数据帧有两个主要区别:
parseTweets 不会为损坏的推文提取数据。stream_in 可以。因此,使用 stream_in 时数据帧有更多行,但包含相同的推文。
stream_in 创建的变量更少,因为数据框中的某些列本身就是数据框。这可能会在使用数据框而不进一步转换使用 stream_in 创建的数据框时导致问题。parseTweets 为您做到这一点。
推荐阅读
- javascript - VueJS 正在更新事件中的另一个元素
- javascript - 在“for...in”语句中应用过滤器
- php - 将php变量转换为Excel
- c++ - 了解关于 EMC++ 第 41 项的勘误表的评论
- symfony - 当用户关闭浏览器时向数据库发出请求
- java - 当我输入值为 50 而不是为 suncondition 提供值时,代码有效
- flutter - 如何在颤动中对 FCM 通知进行分组
- vue.js - 如何将图像从移动存储添加到 mapbox-gl?
- python - Pandas:如何组合具有相同列值的几行并创建一个涵盖所有可能性的新数据框?
- gcc - 无法在带有 anaconda 的机器上构建 gem5:“lto1:致命错误:字节码流”