r - FasttextR 编码
问题描述
FasttextR 从我从他们的网站下载的预训练模型“cc.es.300.bin”(https://fasttext.cc /docs/en/crawl-vectors.html)。
我认为问题在于,当我上传模型时,我无法告诉 R 编码应该是“UTF-8”,而不是“Latin1”或其他。也就是说,我可以加载西班牙模型并弄错单词,如下所示:
model <- ft_load("cc.es.300.bin")
但我不能这样做:
model <- ft_load("cc.es.300.bin", encoding="UTF-8")
因为可以使用 xlsx 文件,例如:
model <- xlsx::read.xlsx("file.xlsx", sheetIndex = 1, encoding="UTF-8")
我尝试过:更改 Windows 中的语言和编码;使用 UTF-8 编码重新打开并保存 .R 文件;将语言环境更改为西班牙语Sys.setlocale("LC_ALL", "Spanish")
。没有任何效果。
任何帮助都感激不尽。问候,
解决方案
图书馆“读者”帮助了我
install.packages("read")
library(readr)
guess_encoding(ft_words(model))
| | 0%
# A tibble: 2 x 2
encoding confidence
<chr> <dbl>
1 UTF-8 1
2 Shift_JIS 0.31
parse_character(ft_words(model), locale=locale(encoding="UTF-8"))
[1] "de" "," "." "la" "y"
[6] "en" "que" "el" "</s>" "a"
[11] "los" ":" "\"" "del" "un"
[16] ")" "se" "con" "por" "las"
[21] "(" "para" "una" "es" "no"
[26] "su" "al" "como" "lo" "/"
[31] "más" "El" "o" "'" "La"
[36] "!" "|" "?" "me" "En"
[41] "..." "-" "sus" "este" "pero"
[46] "ha" "esta" ";" "“" "_"
[51] "”" "si" "sobre" "¿" "fue"
[56] "son" "le" "muy" "ser" "ya"
[61] "tu" "todo" "1" "entre" "te"
[66] "mi" "Los" "%" "sin" "también"
...
代替
[1] "de" "," "." "la"
[5] "y" "en" "que" "el"
[9] "</s>" "a" "los" ":"
[13] "\"" "del" "un" ")"
[17] "se" "con" "por" "las"
[21] "(" "para" "una" "es"
[25] "no" "su" "al" "como"
[29] "lo" "/" "más" "El"
[33] "o" "'" "La" "!"
[37] "|" "?" "me" "En"
[41] "..." "-" "sus" "este"
[45] "pero" "ha" "esta" ";"
[49] "“" "_" "â€\u009d" "si"
[53] "sobre" "¿" "fue" "son"
[57] "le" "muy" "ser" "ya"
但是,当我使用函数来获取最近的邻居时,它似乎没有帮助
parse_character(ft_nearest_neighbors(model, "pera", k = 10L), locale=locale(encoding="UTF-8"))
Error in parse_vector(x, col_character(), na = na, locale = locale, trim_ws = trim_ws) :
is.character(x) is not TRUE
但是(注意 piña 而不是 piña)
ft_nearest_neighbors(model, "pera", k = 10L)
limonera ciruela manzana mandarina piña fruta sandÃa compota sandia fresa
0.6326169 0.6112964 0.6079050 0.5713655 0.5707002 0.5576053 0.5557024 0.5526152 0.5485740 0.5437940
现在,有帮助的是 enc2utf8(不过,输出中的字符看起来很有趣)
ft_nearest_neighbors(model,enc2utf8("piña"), k = 10L)
sandÃa papaya sandia ananá plátano ananás fruta limón mandarina maracuyá
0.6763531 0.6571828 0.6365163 0.6341625 0.6205474 0.6205293 0.6137358 0.6037553 0.6032383 0.5941805
如果您想获得单个词向量,enc2utf8 也有帮助
piña <- as.vector(ft_word_vectors(model, enc2utf8("piña")))
推荐阅读
- javascript - Array.find() 或 Array.some() 但返回自定义值
- javascript - three.js - 导入的对象,但看不到任何材料
- python - 允许 python.exe 在公共和专用网络上通信是否安全?(Windows 防火墙)
- python - Python DB API get inserted row id
- javascript - 如何让这个嵌套函数挑战问题在 JavaScript 中工作?
- bcp - 使用 xp_cmdshell 和 bcp 实用程序不起作用
- python - 如何将文件路径格式化为原始字符串,以便解压时不会出错
- javascript - 如何将功能添加到 Gridview?
- python - 使用海龟模块用颜色填充三角形的特定部分
- visual-studio-2019 - 缺少 Visual Studio 2019 Blazor 服务器和 Web 程序集模板