首页 > 解决方案 > 如何使用新文档(语料库)更新 .mm(市场矩阵)文件?

问题描述

我正在寻找一种使用 gensim 用新文档更新现有语料库的方法。在这里,我从现有的语料库中创建了一个字典和一个相同的词袋。后来,我将它序列化为.mm文件并保存到本地磁盘。现在,我想用新文档更新我现有的 .mm 文件,这样我就可以保留更新后的语料库的表示,以便在看不见的数据上,我可以将其用于文档相似性。请帮助我,我该怎么做?更新语料库的正确方法是什么?此外,我知道我可以将文档添加到字典中,而不是 .mm 文件。

from gensim import corpora, models, similarities
from gensim.parsing.preprocessing import STOPWORDS

tweets = [
    ['human', 'interface', 'computer'],
    ['survey', 'user', 'computer', 'system', 'response', 'time', 'survey'],
    ['eps', 'user', 'interface', 'system'],
    ['system', 'human', 'system', 'eps'],
    ['user', 'response', 'time'],
    ['trees'],
    ['graph', 'trees'],
    ['graph', 'minors', 'trees'],
    ['graph', 'minors', 'survey']
]

dictionary = corpora.Dictionary(tweets)
dictionary.save('tweets.dict')  # store the dictionary, for future reference

dictionary = corpora.Dictionary.load('tweets.dict')
print(f'Length of previous dict = {len(dictionary)}, tokens = {dictionary.token2id}')
raw_corpus = [dictionary.doc2bow(t) for t in tweets]
corpora.MmCorpus.serialize('tweets.mm', raw_corpus)  # store to disk
print("Save the vectorized corpus as a .mm file")

corpus = corpora.MmCorpus('tweets.mm') # loading saved .mm file
print(corpus)

new_docs = [
["user", "response", "system"],
["trees", "minor", "surveys"]
]

# how to add this new_docs corpus to tweets.mm

可以tweets.mm更新吗?还是推荐?

标签: pythonnlpgensimword2vecsimilarity

解决方案


没有直接的方法来更新磁盘上的 .mm 语料库。相反,我建议您从文件中读取语料库,并通过tweets使用new_docs. 通过这种方式,您可以确保语料库中的字典(单词到 id 的映射)不会与语料库不同步。

我将创建以下处理更新的函数:

def update_corpus(tweets, new_docs, dict_path):
    dictionary = corpora.Dictionary.load(dict_path)
    print(f'Length of previous dict = {len(dictionary)}, tokens = {dictionary.token2id}')
    dictionary.add_documents(new_docs)
    dictionary.save(dict_path)
    print(f'Length of updated dict = {len(dictionary)}, tokens = {dictionary.token2id}')
    import itertools  # you can move it outside of the function
    full_corpus = itertools.chain(tweets, new_docs)
    raw_corpus = [dictionary.doc2bow(t) for t in full_corpus]
    corpora.MmCorpus.serialize('tweets.mm', raw_corpus)  # store to disk
    print("Save the vectorized corpus as a .mm file")

请注意,在创建和保存字典后无需加载字典,因此您可以删除此行:

dictionary = corpora.Dictionary.load('tweets.dict')

推荐阅读