首页 > 解决方案 > FasttextR 编码

问题描述

FasttextR 从我从他们的网站下载的预训练模型“cc.es.300.bin”(https://fasttext.cc /docs/en/crawl-vectors.html)。

我认为问题在于,当我上传模型时,我无法告诉 R 编码应该是“UTF-8”,而不是“Latin1”或其他。也就是说,我可以加载西班牙模型并弄错单词,如下所示:

model <- ft_load("cc.es.300.bin")  

但我不能这样做:

model <- ft_load("cc.es.300.bin", encoding="UTF-8") 

因为可以使用 xlsx 文件,例如:

model <- xlsx::read.xlsx("file.xlsx", sheetIndex = 1, encoding="UTF-8")

我尝试过:更改 Windows 中的语言和编码;使用 UTF-8 编码重新打开并保存 .R 文件;将语言环境更改为西班牙语Sys.setlocale("LC_ALL", "Spanish")。没有任何效果。

任何帮助都感激不尽。问候,

标签: rencodingutf-8fasttext

解决方案


图书馆“读者”帮助了我

install.packages("read")
library(readr)
guess_encoding(ft_words(model))

|                                                                                        |   0%
# A tibble: 2 x 2
  encoding  confidence
  <chr>          <dbl>
1 UTF-8           1   
2 Shift_JIS       0.31
parse_character(ft_words(model), locale=locale(encoding="UTF-8"))
   [1] "de"              ","               "."               "la"              "y"              
   [6] "en"              "que"             "el"              "</s>"            "a"              
  [11] "los"             ":"               "\""              "del"             "un"             
  [16] ")"               "se"              "con"             "por"             "las"            
  [21] "("               "para"            "una"             "es"              "no"             
  [26] "su"              "al"              "como"            "lo"              "/"              
  [31] "más"             "El"              "o"               "'"               "La"             
  [36] "!"               "|"               "?"               "me"              "En"             
  [41] "..."             "-"               "sus"             "este"            "pero"           
  [46] "ha"              "esta"            ";"               "“&quot;               "_"              
  [51] "”&quot;               "si"              "sobre"           "¿"               "fue"            
  [56] "son"             "le"              "muy"             "ser"             "ya"             
  [61] "tu"              "todo"            "1"               "entre"           "te"             
  [66] "mi"              "Los"             "%"               "sin"             "también"
...

代替

 [1] "de"               ","                "."                "la"              
   [5] "y"                "en"               "que"              "el"              
   [9] "</s>"             "a"                "los"              ":"               
  [13] "\""               "del"              "un"               ")"               
  [17] "se"               "con"              "por"              "las"             
  [21] "("                "para"             "una"              "es"              
  [25] "no"               "su"               "al"               "como"            
  [29] "lo"               "/"                "más"             "El"              
  [33] "o"                "'"                "La"               "!"               
  [37] "|"                "?"                "me"               "En"              
  [41] "..."              "-"                "sus"              "este"            
  [45] "pero"             "ha"               "esta"             ";"               
  [49] "“&quot;              "_"                "â€\u009d"         "si"              
  [53] "sobre"            "¿"               "fue"              "son"             
  [57] "le"               "muy"              "ser"              "ya"   

但是,当我使用函数来获取最近的邻居时,它似乎没有帮助

parse_character(ft_nearest_neighbors(model, "pera", k = 10L), locale=locale(encoding="UTF-8"))
Error in parse_vector(x, col_character(), na = na, locale = locale, trim_ws = trim_ws) : 
  is.character(x) is not TRUE

但是(注意 piña 而不是 piña)

ft_nearest_neighbors(model, "pera", k = 10L)
 limonera   ciruela   manzana mandarina     piña     fruta   sandía   compota    sandia     fresa 
0.6326169 0.6112964 0.6079050 0.5713655 0.5707002 0.5576053 0.5557024 0.5526152 0.5485740 0.5437940 

现在,有帮助的是 enc2utf8(不过,输出中的字符看起来很有趣)

ft_nearest_neighbors(model,enc2utf8("piña"), k = 10L)
  sandía    papaya    sandia    ananá  plátano   ananás     fruta    limón mandarina maracuyá 
0.6763531 0.6571828 0.6365163 0.6341625 0.6205474 0.6205293 0.6137358 0.6037553 0.6032383 0.5941805

如果您想获得单个词向量,enc2utf8 也有帮助

piña <- as.vector(ft_word_vectors(model, enc2utf8("piña")))


推荐阅读