首页 > 解决方案 > Is there method to save dict of Labelencoder for inference

问题描述

I am trying to build up an inference pipeline. It consists of two parts. Monthly ML model training using some tabular order metadata in previous years and daily inference prediction using new orders taken on that day. There are several string categorical columns I want to include in my model which I used labelencoder to convert them into integers. I am wondering how can I make sure I convert daily inference dataset into the same categories during data preprocessing. Should I save the dictionary of labelencoder and mapping to my inference dataset? Thanks.

标签: pythonmachine-learningdata-scienceinference

解决方案


通常你会像这样序列化你的 LabelEncoder 。您也可以使用pickleorjoblib模块(我建议后者)。代码:

import joblib

joblib.dump(label_encoder, 'label_encoder.joblib')
label_encoder = joblib.load('label_encoder.joblib')

既然你问的是 dict,我想你可能是指将 LabelEncoder 打包到字典中,这是我经常对数据帧做的事情。举个例子:

import pandas
from collections import defaultdict
from sklearn import preprocessing 

df = pandas.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
    'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                 'New_York']
})

d = defaultdict(preprocessing.LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))

fit现在保存编码数据。我们现在可以使用以下方式反转编码:

fit.apply(lambda x: d[x.name].inverse_transform(x))

要序列化LabelEncoder您的字典,请遵循与单个字典相同的路线:

joblib.dump(d, 'label_encoder_dict.joblib')

推荐阅读