首页 > 解决方案 > Sagemaker 管道 - 处理步骤 - SKLearn - 缺少 SKLearn 扩展

问题描述

我正在寻求自动化 SageMaker Pipelines,以便它可以跨环境构建、训练和部署模型。我不是数据科学家,这个领域对我来说很新,所以斗争是真实的!

我已经设置了一个正确构建代码的管道,但是当需要进行预处理时,步骤失败并出现错误no module named 'sklearn extensions'

Preprocess.py 脚本如下

from numpy import nan
from sagemaker_sklearn_extension.externals import Header
from sagemaker_sklearn_extension.impute import RobustImputer
from sagemaker_sklearn_extension.preprocessing import NALabelEncoder
from sagemaker_sklearn_extension.preprocessing import RobustStandardScaler
from sagemaker_sklearn_extension.preprocessing import ThresholdOneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Given a list of column names and target column name, Header can return the index
# for given column name
HEADER = Header(
   column_names=[
       '1', '2', '3', '4', '5',
       '6', '7'
   ],
   target_column_name='6'
)


def build_feature_transform():
   """ Returns the model definition representing feature processing."""

   # These features can be parsed as numeric.

   numeric = HEADER.as_feature_indices(
       ['1', '2', '3', '4']
   )

   # These features contain a relatively small number of unique items.

   categorical = HEADER.as_feature_indices(
       ['1', '2', '3', '4']
   )

   numeric_processors = Pipeline(
       steps=[
           (
               'robustimputer',
               RobustImputer(strategy='constant', fill_values=nan)
           )
       ]
   )

   categorical_processors = Pipeline(
       steps=[('thresholdonehotencoder', ThresholdOneHotEncoder(threshold=8))]
   )

   column_transformer = ColumnTransformer(
       transformers=[
           ('numeric_processing', numeric_processors, numeric
           ), ('categorical_processing', categorical_processors, categorical)
       ]
   )

   return Pipeline(
       steps=[
           ('column_transformer', column_transformer
           ), ('robuststandardscaler', RobustStandardScaler())
       ]
   )


def build_label_transform():
   """Returns the model definition representing feature processing."""

   return NALabelEncoder()

这是调用流程 pipeline.py 的脚本

 # processing step for feature engineering
   sklearn_processor = SKLearnProcessor(
       framework_version="0.23-1",
       instance_type=processing_instance_type,
       instance_count=processing_instance_count,
       base_job_name=f"{base_job_prefix}/sklearn-job-preprocess",
       sagemaker_session=sagemaker_session,
       role=role,
   )
   step_process = ProcessingStep(
       name="PreprocessJobData",
       processor=sklearn_processor,
       outputs=[
           ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
           ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
           ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
       ],
       code=os.path.join(BASE_DIR, "preprocess.py"),
       job_arguments=["--input-data", input_data],
   )

任何帮助,将不胜感激!

标签: scikit-learnamazon-sagemaker

解决方案


首先,您可以按照此处所述从 pip 安装这些扩展。

但是,要使用 externals 模块中的 I/O 功能,您还需要安装mlio只能通过conda.


推荐阅读