python - 预训练的 Fasttext 模型为词汇表外的单词返回乱码
问题描述
我在使用预训练的 fasttext.bin 模型(从 https://fasttext.cc/docs/en/crawl-vectors.html检索)时遇到问题。检查 most_similar 是否有 in-vocabulary-words 会返回合理的响应。但是,当检查 most_similar 以查找只有一个字符不同的词汇表外单词时,会返回乱码。
我的问题:这与模型有关,还是我以错误的方式使用它?
from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('cc.en.300.bin')
model.most_similar("universitet")
[('Universitet', 0.8522759675979614),
('högskolan', 0.677900493144989),
('Högskola', 0.6725144386291504),
('högskola', 0.6724666357040405),
('Högskolan', 0.6600401997566223),
('Universitetet', 0.6519213318824768),
('Høgskolen', 0.647462010383606),
('Universiteti', 0.6399329900741577),
('forskning', 0.617483377456665),
('språk', 0.6172543168067932)]
model.most_similar("universitett")
[('ESTATERETAILCONSUMERPHONESCARSBIKESAPPSINTERNETTABLETSCOMPUTERSSOCIETYPOLITICSLAWCRIMEENVIRONMENTSCIENCEARTSCELEBRITIESSPORTSSPECIALSFIRST',
0.47905537486076355),
('Wikipedia-Page-Suzannah-B-Troy-6-yrs-after-Misogynist-Cyber-Vandalism-Censorship-via-Deletion-on-a-page-about-Censorship-Wikipedia-Agrees-to-retur',
0.47733378410339355),
('DEky4M0BSpUOTPnSpkuL5I0GTSnRI4jMepcaFAoxIoFnX5kmJQk1aYvr2odGBAAIfkECQoABAAsCQAAABAAEgAACGcAARAYSLCgQQEABBokkFAhAQEQHQ4EMKCiQogRCVKsOOAiRocbLQ7EmJEhR4cfEWoUOTFhRIUNE44kGZOjSIQfG9rsyDCnzp0AaMYMyfNjS6JFZWpEKlDiUqALJ0KNatKmU4NDBwYEACH5BAkKAAQALAkAAAAQABIAAAhpAAEQGEiQIICDBAUgLEgAwICHAgkImBhxoMOHAyJOpGgQY8aBGxV2hJgwZMWLFTcCUIjwoEuLBym69PgxJMuDNAUqVDkz50qZLi',
0.474983274936676),
('DEky4M0BSpUOTPnSpkuL5I0GTSnRI4jMepcaFAoxIoFnX5kmJQk1aYvr2odGBAAIfkECQoABAAsCQAAABAAEgAACGcAARAYSLCgQQEABBokkFAhAQEQHQ4EMKCiQogRCVKsOOAiRocbLQ7EmJEhR4cfEWoUOTFhRIUNE44kGZOjSIQfG9rsyDCnzp0AaMYMyfNjS6JFZWpEKlDiUqALJ0KNatKmU4NDBwYEACH5BAUKAAQALAkAAAAQABIAAAhpAAEQGEiQIICDBAUgLEgAwICHAgkImBhxoMOHAyJOpGgQY8aBGxV2hJgwZMWLFTcCUIjwoEuLBym69PgxJMuDNAUqVDkz50qZLi',
0.47364047169685364),
('crescendosexibloguerobateyabsorbersexiindesignabledinerolatifundiosexibrezarcularsutesexirapoplinbrezarcorrentosoVd.lazadareflejoreglafeministabrezarchuzasexiouttiqueblogueroin',
0.47090965509414673),
('QQFZAAEACwAAAAAGQASAAAIjgAJCBQIoGDBgQgTKiwooGHDgwshDgTgsOLDhAAGaAQwUYBBhx85EtS4cWLGjR5JSjxZkgDFkwwLohTJUqTLlANiwvQ4seVNjwwfBoVokKjFo0Jlksz506NFiklZtoQKFSjIoktLVv1YsahSn1WP0vzq02VYoAjJMsVYVKHZrDbdupW6Vq5cunHtRjQoMCAAIfkECRQABAAsCQADAAQABAAACAsABQgkILCgwYEBAQAh',
0.46747487783432007),
('записиТелепрограммаVikerraadioOtseEsilehtJärelkuulamineSaatekavaPodcastidRaadioteaterRaadio',
0.4659830331802368),
('deblogueroreflejoantecedentesexitlacuachebateysuteindesignableabsorbersexilatifundiosexibrezarsutemultiétnicosexiplinrapobrezarcorrentosoVd.lazadafisiochillidomabrezarsico-chuzaoutcolodrablogueroin',
0.46159273386001587),
('2OtseEsilehtJärelkuulamineSaatedPodcastidKlassikaraadioOtseEsilehtJärelkuulamineSaatekavaPodcastidRaadio',
0.4609595537185669),
('leilighetEiendomstypeSelveierleilighetPlass', 0.4550461769104004)]
解决方案
如果我没记错“gensim.models.wrappers”已被弃用,请尝试使用
从 gensim.models.fasttext 导入 FastText
来源:https ://radimrehurek.com/gensim/models/wrappers/fasttext.html
推荐阅读
- amazon-web-services - 获取 cognito 用户的登录时间
- java - 即使我完全清除了我的文本字段,TextWatcher 也始终读取输入文本的第一个字符
- c++builder - #include "path\\file.h" 无法打开文件 dfm
- tomcat - 带有私有根 CA 的 Artifactory 7.x 设置
- ios - 检测用户何时滑过 UICollectionView 的 scrollViewWillBeginDragging 中的某个点?
- javascript - 如何通过javascript在excel中创建对象
- javascript - 显示 JavaScript 对象中的特定元素
- android - 是否可以将原生 Android 子模块和 iOS cocoapods 用于颤振插件?
- json - 在 Typescript 中过滤 Json 数组数据
- machine-learning - 我可以使用树莓派使用预训练的 CNN 模型进行预测吗?