python - Is there method to save dict of Labelencoder for inference
问题描述
I am trying to build up an inference pipeline. It consists of two parts. Monthly ML model training using some tabular order metadata in previous years and daily inference prediction using new orders taken on that day. There are several string categorical columns I want to include in my model which I used labelencoder to convert them into integers. I am wondering how can I make sure I convert daily inference dataset into the same categories during data preprocessing. Should I save the dictionary of labelencoder and mapping to my inference dataset? Thanks.
解决方案
通常你会像这样序列化你的 LabelEncoder 。您也可以使用pickle
orjoblib
模块(我建议后者)。代码:
import joblib
joblib.dump(label_encoder, 'label_encoder.joblib')
label_encoder = joblib.load('label_encoder.joblib')
既然你问的是 dict,我想你可能是指将 LabelEncoder 打包到字典中,这是我经常对数据帧做的事情。举个例子:
import pandas
from collections import defaultdict
from sklearn import preprocessing
df = pandas.DataFrame({
'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']
})
d = defaultdict(preprocessing.LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))
fit
现在保存编码数据。我们现在可以使用以下方式反转编码:
fit.apply(lambda x: d[x.name].inverse_transform(x))
要序列化LabelEncoder
您的字典,请遵循与单个字典相同的路线:
joblib.dump(d, 'label_encoder_dict.joblib')
推荐阅读
- express - 允许现有 API 与 Pusher 桥接,并允许在传输之前对有效负载进行预处理
- android - 缩放与宽度相关的图像
- .net - 我对 .NET Core 中的 web.config 感到困惑
- icinga2 - Icinga2 是否与服务器和客户端之间的路由器一起使用
- cmd - 使用 psexec 时访问被拒绝
- javascript - 掷5个骰子的方法
- sql - 当我们向 EMR 或 Zepplin (AWS-EMR) 添加步骤时,Pyspark 中的完全外连接查询没有输出,并且来自 Pyspark shell 结果很好
- jsf - primefaces 行扩展未更新
- python - Pytest如何模拟threading.Timer
- ruby-on-rails - 引用具有相同 id 的多个 STI 模型