首页 > 解决方案 > data.table not reading characters appropriately

问题描述

I have the following tibble:

> a

# A tibble: 1 x 1
  Page                                             
  <chr>                                            
1 勒布朗·詹姆斯_zh.wikipedia.org_desktop_all-agents

>  dput(a)
structure(list(Page = "<U+52D2><U+5E03><U+6717>·<U+8A79><U+59C6><U+65AF>_zh.wikipedia.org_desktop_all-agents"), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"))

when I convert to data.table, the encoding gets wrong:

b <- as.data.table(a)

>  b

                                                                                    Page
1: <U+52D2><U+5E03><U+6717>·<U+8A79><U+59C6><U+65AF>_zh.wikipedia.org_desktop_all-agents

I get this dataframe from a .csv file, where these japanese characters only show correctly when I use read_csv. With fread, even if I set encoding = 'UTF-8' it doesn't work. How can I overcome this problem with data.table?

Here is my sessioninfo:

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tibble_3.0.3      readr_1.3.1       data.table_1.13.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5       rstudioapi_0.11  knitr_1.29       magrittr_1.5     hms_0.5.3        R6_2.4.1        
 [7] rlang_0.4.7      fansi_0.4.1      tools_4.0.2      xfun_0.16        tinytex_0.25     utf8_1.1.4      
[13] cli_2.0.2        htmltools_0.5.0  ellipsis_0.3.1   yaml_2.2.1       digest_0.6.25    assertthat_0.2.1
[19] lifecycle_0.2.0  crayon_1.3.4     vctrs_0.3.2      glue_1.4.1       evaluate_0.14    rmarkdown_2.3   
[25] compiler_4.0.2   pillar_1.4.6     pkgconfig_2.0.3 

Update:

If I print the elemente alone, it shows correctly.

> b[[1]]
[1] "勒布朗·詹姆斯_zh.wikipedia.org_desktop_all-agents"

标签: rdata.table

解决方案


推荐阅读