首页 > 解决方案 > 是否有在 python 中保存和加载 h2o word2vec 模型的首选方法?

问题描述

我在 python h2o 包中训练了一个 word2vec 模型。有没有一种简单的方法可以让我保存该 word2vec 模型并稍后将其加载回来以供使用?

我已经尝试了 h2o.save_model() 和 h2o.load_model() 函数,但没有成功。使用这种方法时出现错误,例如

ERROR: Unexpected HTTP Status code: 412 Precondition Failed (url = http://localhost:54321/99/Models.bin/)

water.exceptions.H2OIllegalArgumentException
[1] "water.exceptions.H2OIllegalArgumentException: Illegal argument: dir of function: importModel:

我正在使用相同版本的 h2o 来训练并重新加载模型,因此此问题中概述的问题不适用Can't import binay h2o model with h2o.loadModel() function: 412 Precondition Failed

有人对如何保存和加载 h2o word2vec 模型有任何见解吗?

我的示例代码包含一些重要的片段

import h2o
from h2o.estimators import H2OWord2vecEstimator

df['text'] = df['text'].ascharacter()
  
# Break text into sequence of words
words = tokenize(df["text"])
    
# Initializing h2o
print('Initializing h2o.')
h2o.init(ip=h2o_ip, port=h2o_port, min_mem_size=h2o_min_memory) 
   
# Build word2vec model:
w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0, epochs = 10)
w2v_model.train(training_frame=words)
    
    
# Calculate a vector for each row
word_vecs = w2v_model.transform(words, aggregate_method = "AVERAGE")

#Save model to path
wv_path = '/models/wordvec/'
model_path = h2o.save_model(model = w2v_model, path= wv_path ,force=True)

# Load model in later script
w2v_model = h2o.load_model(model_path)

标签: pythonword2vech2o

解决方案


听起来您尝试读取的目录可能存在访问问题。我刚刚按照文档中的w2v 示例在 H2O 3.30.0.1 上进行了测试,并且运行良好:

job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"),
                              col_names = ["category", "jobtitle"],
                              col_types = ["string", "string"],
                              header = 1)
STOP_WORDS = ["ax","i","you","edu","s","t","m","subject","can",
              "lines","re","what","there","all","we","one","the",
              "a","an","of","or","in","for","by","on","but","is",
              "in","a","not","with","as","was","if","they","are",
              "this","and","it","have","from","at","my","be","by",
              "not","that","to","from","com","org","like","likes",
              "so"]

# Make the 'tokenize' function:
def tokenize(sentences, stop_word = STOP_WORDS):
    tokenized = sentences.tokenize("\\W+")
    tokenized_lower = tokenized.tolower()
    tokenized_filtered = tokenized_lower[(tokenized_lower.nchar() >= 2) | (tokenized_lower.isna()),:]
    tokenized_words = tokenized_filtered[tokenized_filtered.grep("[0-9]",invert=True,output_logical=True),:]
    tokenized_words = tokenized_words[(tokenized_words.isna()) | (~ tokenized_words.isin(STOP_WORDS)),:]
    return tokenized_words

# Break job titles into a sequence of words:
words = tokenize(job_titles["jobtitle"])

# Build word2vec model:
w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0, epochs = 10)

w2v_model.train(training_frame=words)

#Save model
wv_path = 'models/'
model_path = h2o.save_model(model = w2v_model, path= wv_path ,force=True)

#Load Model
w2v_model2 = h2o.load_model(model_path)

推荐阅读