python - 是否有在 python 中保存和加载 h2o word2vec 模型的首选方法?
问题描述
我在 python h2o 包中训练了一个 word2vec 模型。有没有一种简单的方法可以让我保存该 word2vec 模型并稍后将其加载回来以供使用?
我已经尝试了 h2o.save_model() 和 h2o.load_model() 函数,但没有成功。使用这种方法时出现错误,例如
ERROR: Unexpected HTTP Status code: 412 Precondition Failed (url = http://localhost:54321/99/Models.bin/)
water.exceptions.H2OIllegalArgumentException
[1] "water.exceptions.H2OIllegalArgumentException: Illegal argument: dir of function: importModel:
我正在使用相同版本的 h2o 来训练并重新加载模型,因此此问题中概述的问题不适用Can't import binay h2o model with h2o.loadModel() function: 412 Precondition Failed
有人对如何保存和加载 h2o word2vec 模型有任何见解吗?
我的示例代码包含一些重要的片段
import h2o
from h2o.estimators import H2OWord2vecEstimator
df['text'] = df['text'].ascharacter()
# Break text into sequence of words
words = tokenize(df["text"])
# Initializing h2o
print('Initializing h2o.')
h2o.init(ip=h2o_ip, port=h2o_port, min_mem_size=h2o_min_memory)
# Build word2vec model:
w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0, epochs = 10)
w2v_model.train(training_frame=words)
# Calculate a vector for each row
word_vecs = w2v_model.transform(words, aggregate_method = "AVERAGE")
#Save model to path
wv_path = '/models/wordvec/'
model_path = h2o.save_model(model = w2v_model, path= wv_path ,force=True)
# Load model in later script
w2v_model = h2o.load_model(model_path)
解决方案
听起来您尝试读取的目录可能存在访问问题。我刚刚按照文档中的w2v 示例在 H2O 3.30.0.1 上进行了测试,并且运行良好:
job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"),
col_names = ["category", "jobtitle"],
col_types = ["string", "string"],
header = 1)
STOP_WORDS = ["ax","i","you","edu","s","t","m","subject","can",
"lines","re","what","there","all","we","one","the",
"a","an","of","or","in","for","by","on","but","is",
"in","a","not","with","as","was","if","they","are",
"this","and","it","have","from","at","my","be","by",
"not","that","to","from","com","org","like","likes",
"so"]
# Make the 'tokenize' function:
def tokenize(sentences, stop_word = STOP_WORDS):
tokenized = sentences.tokenize("\\W+")
tokenized_lower = tokenized.tolower()
tokenized_filtered = tokenized_lower[(tokenized_lower.nchar() >= 2) | (tokenized_lower.isna()),:]
tokenized_words = tokenized_filtered[tokenized_filtered.grep("[0-9]",invert=True,output_logical=True),:]
tokenized_words = tokenized_words[(tokenized_words.isna()) | (~ tokenized_words.isin(STOP_WORDS)),:]
return tokenized_words
# Break job titles into a sequence of words:
words = tokenize(job_titles["jobtitle"])
# Build word2vec model:
w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0, epochs = 10)
w2v_model.train(training_frame=words)
#Save model
wv_path = 'models/'
model_path = h2o.save_model(model = w2v_model, path= wv_path ,force=True)
#Load Model
w2v_model2 = h2o.load_model(model_path)
推荐阅读
- java - 如何从我的活动中的另一个活动中读取方法代码?
- c - strcat() 如何在 C 内部工作?
- r - 如何在knitr块输出中突出显示代码
- spring - 如何使用 Thymeleaf 将数据保存到通过模型放入 HTML 文件的对象的实例变量
- mysql - mysql 连接池如何与 Node 微服务一起工作?
- bash - 在 Visual Studio Code 上打开另一个集成终端实例的命令?
- python-3.x - 如果一对列值未在另一个 df2 中配对,则删除 df1 中的行
- java - 如何实现数组对称的两列交换?
- sml - 空列表列表的模式匹配
- c# - 无法在 C# web api 中重新创建此邮递员请求 - unsupported_grant_type