word2vec - 无法加载 glove.6B.300d.txt
问题描述
我正在尝试使用以下代码加载手套向量
en_model = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=False)
我意外地收到以下错误。
File "/home/k/Desktop/Work/Vector explorer/word2vec-explorer/vec_test_loader.py", line 55, in make_model
en_model = KeyedVectors.load_word2vec_format(model_path, binary=is_bin)
File "/home/k/.local/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 1119, in load_word2vec_format
limit=limit, datatype=datatype)
File "/home/k/.local/lib/python3.5/site-packages/gensim/models/utils_any2vec.py", line 175, in _load_word2vec_format
vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
File "/home/k/.local/lib/python3.5/site-packages/gensim/models/utils_any2vec.py", line 175, in <genexpr>
vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
ValueError: invalid literal for int() with base 10: 'the'
有人可以帮忙吗?
解决方案
Gensim 需要更多关于的信息model_path
,我们必须在第一行附加两个数字,第一个表示我们有多少单词词汇表,第二个表示词嵌入的维数,如下所示:
101 300
the 1.0 2.1 -1.3 ...
I 1.1 0.2 -0.3 ...
.
.
.
您可以尝试使用如下一行代码:
python -m gensim.scripts.glove2word2vec --input glove.840B.300d.txt --output glove.840B.300d.w2vformat.txt
或者您可以使用我的代码作为以下参考:
import gensim
import os
import shutil
import hashlib
from sys import platform
def getFileLineNums(filename):
f = open(filename, 'r')
count = 0
for line in f:
count += 1
return count
def prepend_line(infile, outfile, line):
with open(infile, 'r') as old:
with open(outfile, 'w') as new:
new.write(str(line) + "\n")
shutil.copyfileobj(old, new)
def prepend_slow(infile, outfile, line):
with open(infile, 'r') as fin:
with open(outfile, 'w') as fout:
fout.write(line + "\n")
for line in fin:
fout.write(line)
def load(filename):
num_lines = getFileLineNums(filename)
gensim_file = 'glove_model.txt'
gensim_first_line = "{} {}".format(num_lines, 300)
# Prepends the line.
if platform == "linux" or platform == "linux2":
prepend_line(filename, gensim_file, gensim_first_line)
else:
prepend_slow(filename, gensim_file, gensim_first_line)
model = gensim.models.KeyedVectors.load_word2vec_format(gensim_file)
return model
model = load(your_model_path)
推荐阅读
- python - 如何使用 Pydub 和 OpenCV 库同时播放和音频以及运行网络摄像头?
- html - 使用 Jquery 更改选项后将输入值设置为空
- android - Dagger Hilt 错误:@HiltAndroidApp 有一个值。你忘记应用 Gradle 插件了吗?(dagger.hilt.android.plugin)
- firebase - Firestore 从集合字段中获取一个值然后返回(Kotlin)
- scipy - Scipy Optimizer 长短约束错误
- r - 忽略 keras 中 R 的缺失目标值的损失函数
- python - 张量板给出空白输出
- linux - 如何阻止 phpmyadmin 访问互联网
- r - 参考 R for 循环中的增量变量
- node.js - 我在 vercel monorepo 中的共享部门出了什么问题?