首页 > 技术文章 > gensim word2vec用法小结

foghorn 2022-02-13 17:59 原文

常用API

  • gensim.models.Word2Vec(sentence, min_count, workers)
  • gensim.models.word2vec.Word2Vec(sentence, min_count, workers)

word2vec参数

  • sentence:语料句子,必须是一个可迭代的对象
  • min_counts:指定了需要训练的词语最小出现次数,小于该值的词将被忽略
  • max_vocab_size:最大词汇数,防止内存溢出
  • size:词向量维度
  • alpha:训练的初始学习率,随着训练的进行,学习率会线性减少
  • min_alpha:最小学习率
  • window:滑动窗口大小
  • sg:训练模型(0:CBOW;1:skip-gram)
  • hs:word2vec两个解法的选择了,如果是0, 则是Negative Sampling,是1的话并且负采样个数negative大于0, 则是Hierarchical Softmax。默认是0即Negative Sampling。
  • iter:迭代次数

加载语料库

自己构建

形如:sentence=[['ab', 'ba'], ['sheu', 'dhudhi', 'hdush'], ... []]

加载单文件语料

使用LineSentence()函数,文件必须是已经分词过了的。

加载文件夹下的所有语料

使用PathLineSentence()函数,文件必须是已经分词过了的。

自定义

class MySentence:
    def __init__(self, data_path, max_line=None):
        self.data_path = data_path
        self.max_line = max_line
        self.cur_line = 0

    def __iter__(self):
        if self.max_line is not None:
            for line in open(self.data_path, 'r', encoding='utf-8'):
                if self.cur_line >= self.max_line:
                    return
                self.cur_line += 1
                yield line.strip('\n').split()
        else:
            for line in open(self.data_path, 'r', encoding='utf-8'):
                yield line.strip('\n').split()

上述代码自定义了一个MySentence类,它的实例是一个可迭代的对象,可以直接传给LineSentence()函数。

训练模型

from gensim.models import word2vec

ms = MySentence(data_path)
model = word2vec.Word2Vec(ms, hs=1, min_count=1, window=3, size=64)

假如要在已训练好的模型上追加训练:

先加载模型:
model = word2vec.Word2Vec.load(model_path)
再追加训练
model.train(other_sentence)

存储模型

  • model.save(model_name),可以追加训练
  • model.save_word2vec_format(model_name),不可以追加训练

加载模型

方法一:

model = word2vec.Word2Vec.load(model_path)

方法二:

model = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=False)  # C text format

model = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=True)  # C binary format

获取词向量

word_vec = model.wv[word]

示例

import jieba
import jieba.analyse
from gensim.models import word2vec


def cut_words():
    jieba.suggest_freq('沙瑞金', True)
    jieba.suggest_freq('田国富', True)
    jieba.suggest_freq('高育良', True)
    jieba.suggest_freq('侯亮平', True)
    jieba.suggest_freq('钟小艾', True)
    jieba.suggest_freq('陈岩石', True)
    jieba.suggest_freq('欧阳菁', True)
    jieba.suggest_freq('易学习', True)
    jieba.suggest_freq('王大路', True)
    jieba.suggest_freq('蔡成功', True)
    jieba.suggest_freq('孙连城', True)
    jieba.suggest_freq('季昌明', True)
    jieba.suggest_freq('丁义珍', True)
    jieba.suggest_freq('郑西坡', True)
    jieba.suggest_freq('赵东来', True)
    jieba.suggest_freq('高小琴', True)
    jieba.suggest_freq('赵瑞龙', True)
    jieba.suggest_freq('林华华', True)
    jieba.suggest_freq('陆亦可', True)
    jieba.suggest_freq('刘新建', True)
    jieba.suggest_freq('刘庆祝', True)

    with open('./in_the_name_of_people.txt', 'r', encoding='utf-8') as f:
        document = f.read()
        document_cut = jieba.cut(document)
        result = ' '.join(document_cut)

        with open('./in_the_name_of_people_segment.txt', 'w', encoding='utf-8') as f2:
            f2.write(result)

    print('ok')


class MySentence:
    def __init__(self, data_path, max_line=None):
        self.data_path = data_path
        self.max_line = max_line
        self.cur_line = 0

    def __iter__(self):
        if self.max_line is not None:
            for line in open(self.data_path, 'r', encoding='utf-8'):
                if self.cur_line >= self.max_line:
                    return
                self.cur_line += 1
                yield line.strip('\n').split()
        else:
            for line in open(self.data_path, 'r', encoding='utf-8'):
                yield line.strip('\n').split()


def word_embedding():
    ms = MySentence('./in_the_name_of_people_segment.txt')
    model = word2vec.Word2Vec(ms, hs=1, min_count=1, window=3, size=64)
    model.save('./name_of_people_wv.model')
    print('ok')


def load_model():
    model = word2vec.Word2Vec.load('./name_of_people_wv.model')
    words = ['侯亮平',  '蓦地', '睁开眼睛', '。',  '大厅',  '突起',  '一阵',  '骚动', '许多',  '人',  '拥向',  '不同',  '的',  '登机口']

    for word in words:
        print(model.wv[word])


if __name__ == "__main__":
    word_embedding()
    ms = MySentence('./in_the_name_of_people_segment.txt')
    load_model()

参考

推荐阅读