scikit-learn - Sagemaker 管道 - 处理步骤 - SKLearn - 缺少 SKLearn 扩展
问题描述
我正在寻求自动化 SageMaker Pipelines,以便它可以跨环境构建、训练和部署模型。我不是数据科学家,这个领域对我来说很新,所以斗争是真实的!
我已经设置了一个正确构建代码的管道,但是当需要进行预处理时,步骤失败并出现错误no module named 'sklearn extensions'
Preprocess.py 脚本如下
from numpy import nan
from sagemaker_sklearn_extension.externals import Header
from sagemaker_sklearn_extension.impute import RobustImputer
from sagemaker_sklearn_extension.preprocessing import NALabelEncoder
from sagemaker_sklearn_extension.preprocessing import RobustStandardScaler
from sagemaker_sklearn_extension.preprocessing import ThresholdOneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Given a list of column names and target column name, Header can return the index
# for given column name
HEADER = Header(
column_names=[
'1', '2', '3', '4', '5',
'6', '7'
],
target_column_name='6'
)
def build_feature_transform():
""" Returns the model definition representing feature processing."""
# These features can be parsed as numeric.
numeric = HEADER.as_feature_indices(
['1', '2', '3', '4']
)
# These features contain a relatively small number of unique items.
categorical = HEADER.as_feature_indices(
['1', '2', '3', '4']
)
numeric_processors = Pipeline(
steps=[
(
'robustimputer',
RobustImputer(strategy='constant', fill_values=nan)
)
]
)
categorical_processors = Pipeline(
steps=[('thresholdonehotencoder', ThresholdOneHotEncoder(threshold=8))]
)
column_transformer = ColumnTransformer(
transformers=[
('numeric_processing', numeric_processors, numeric
), ('categorical_processing', categorical_processors, categorical)
]
)
return Pipeline(
steps=[
('column_transformer', column_transformer
), ('robuststandardscaler', RobustStandardScaler())
]
)
def build_label_transform():
"""Returns the model definition representing feature processing."""
return NALabelEncoder()
这是调用流程 pipeline.py 的脚本
# processing step for feature engineering
sklearn_processor = SKLearnProcessor(
framework_version="0.23-1",
instance_type=processing_instance_type,
instance_count=processing_instance_count,
base_job_name=f"{base_job_prefix}/sklearn-job-preprocess",
sagemaker_session=sagemaker_session,
role=role,
)
step_process = ProcessingStep(
name="PreprocessJobData",
processor=sklearn_processor,
outputs=[
ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
],
code=os.path.join(BASE_DIR, "preprocess.py"),
job_arguments=["--input-data", input_data],
)
任何帮助,将不胜感激!
解决方案
首先,您可以按照此处所述从 pip 安装这些扩展。
但是,要使用 externals 模块中的 I/O 功能,您还需要安装mlio
只能通过conda
.
推荐阅读
- performance - Mathematica:代码在几天前提供了输出,但现在一直处于“运行”状态
- javascript - Jquery 正在阻止表单提交
- visual-studio-code - 目录中的 MarkdownTOC 在更新和插入时用自动替换换行符
- bash - 通过 gitbash 从数据框中的特定坐标(制表符分隔)开始 sed
- gcc - 编译 GAS 代码未检测到 -fPIC 选项
- sql - Hive 获得 where 子句的两个值的最小值
- ruby-on-rails - 是否可以使用获取用户当前位置的地理编码器 gem 在 Rails 中创建一个按钮?
- javascript - 是否可以将图像添加到 Ant Design Tables 中的单元格表中?
- php - Laravel 查询生成器重复行
- java - AIX 7.1 crontab 中的 Kafka Producer 收到 JAVA 错误:Bad Major Version