首页 > 解决方案 > 使用 vb.net 和 SDCA Trainer 的 ML.NET 问题

问题描述

我正在尝试开发一个简单的成本预测引擎,以便根据几何信息和我们的历史数据来猜测钢氧切割零件的价格。这个实用程序是一个更大的应用程序的一部分,它是用 vb.net 编码的,所以我不得不使用那种语言。

找到的所有关于 ML.NET 的信息都是基于 C# 的,据我猜测,vb.net 的实现并不完全相同,因此适应该语言正在成为一场噩梦。似乎 vb.net 风格甚至跳过了一些培训师,缺乏一些功能并且支持较少。

首先,作为一个数值回归问题,我认为 SDCA 训练器是最好的选择,所以这就是我的方式。我已经为系统提供了一些“虚构”数据,使用 Excel 根据 1000 个零件的随机输入生成“逻辑”成本(几乎是线性的!)。我认为任何回归预测系统都应该非常准确地管理这些测试数据至少比预期的真实数据要精致得多!

这是我的简化代码,它从 .csv 文件构建和训练模型,并使用 4 个使用的输入对其进行测试:

    Public Class CShapeCostPrediction

    Public Class CShapeInput
        <ColumnName("AgeFrom1990"), LoadColumn(0)>
        Public Property AgeFrom1990 As Single

        <ColumnName("Area"), LoadColumn(1)>
        Public Property Area As Single

        <ColumnName("RectangularArea"), LoadColumn(2)>
        Public Property RectangularArea As Single

        <ColumnName("Thickness"), LoadColumn(3)>
        Public Property Thickness As Single

        <ColumnName("Perimeter"), LoadColumn(4)>
        Public Property Perimeter As Single

        <ColumnName("Cuts"), LoadColumn(5)>
        Public Property Cuts As Single

        <ColumnName("Cost"), LoadColumn(6)>
        Public Property CostReal As Single

        Public Sub New()
        End Sub

        'For testing
        Public Sub New(sAge As Single, sArea As Single, sRectArea As Single, sThick As Single, sPerim As Single, sCuts As Single)
            AgeFrom1990 = sAge
            Area = sArea
            RectangularArea = sRectArea
            Thickness = sThick
            Perimeter = sPerim
            Cuts = sCuts
        End Sub

    End Class

    Public Class CShapeOutput
        Public Property Score As Single
    End Class

    'Shared members
    Public Shared Context As MLContext
    Public Shared PredictionEngine As PredictionEngine(Of CShapeInput, CShapeOutput)

    'Main simplified testing workflow
    Public Shared Function Testing() As Boolean
        Context = New MLContext()

        Dim oTrainingDataView As IDataView = Context.Data.LoadFromTextFile(Of CShapeInput)(path:="D:\ShapeInfo.csv",
                                                                                    hasHeader:=True,
                                                                                    separatorChar:=CChar(";"),
                                                                                    allowQuoting:=True, allowSparse:=False)

        'Normalization. Reportedly required for SDCA trainer
        Dim oNormalize As EstimatorChain(Of Transforms.NormalizingTransformer) = Context.Transforms.NormalizeMeanVariance("AgeFrom1990").
                                                                                    Append(Context.Transforms.NormalizeMeanVariance("Area")).
                                                                                    Append(Context.Transforms.NormalizeMeanVariance("RectangularArea")).
                                                                                    Append(Context.Transforms.NormalizeMeanVariance("Thickness")).
                                                                                    Append(Context.Transforms.NormalizeMeanVariance("Perimeter")).
                                                                                    Append(Context.Transforms.NormalizeMeanVariance("Cuts"))
        'Concatenate to features
        Dim oConcatenate As EstimatorChain(Of ColumnConcatenatingTransformer) = oNormalize.Append(Context.Transforms.Concatenate("Features", "AgeFrom1990", "Area", "RectangularArea", "Thickness", "Perimeter", "Cuts"))

        'Trainer to predict a label from a feature
        Dim oTrainer As Trainers.SdcaRegressionTrainer = Context.Regression.Trainers.Sdca(labelColumnName:="Cost", featureColumnName:="Features")

        Dim oTrainingPipeline As IEstimator(Of ITransformer) = oConcatenate.Append(oTrainer)

        Dim oTrainedModel As ITransformer = oTrainingPipeline.Fit(oTrainingDataView)   'Too fast!?

        Dim oCrossValidationResults As IEnumerable(Of TrainCatalogBase.CrossValidationResult(Of RegressionMetrics)) = Context.Regression.CrossValidate(oTrainingDataView, oTrainingPipeline, numberOfFolds:=5, labelColumnName:="Cost")

        'Get some metrics and show them
        Dim dRSQuared As Double = 0.0
        Dim dRootMeanSquaredError As Double = 0.0
        For Each oCVResult As TrainCatalogBase.CrossValidationResult(Of RegressionMetrics) In oCrossValidationResults
            dRSQuared += oCVResult.Metrics.RSquared
            dRootMeanSquaredError += oCVResult.Metrics.RootMeanSquaredError
        Next
        Dim dCount As Double = CDbl(oCrossValidationResults.LongCount)
        dRSQuared /= dCount
        dRootMeanSquaredError /= dCount
        MessageBox.Show(String.Format("R-Squared: {0:0.000}" & Environment.NewLine() & "Root Mean Squared Error (RMSE): {1:0.000}", dRSQuared, dRootMeanSquaredError))

        'Model saving for later use
        If IO.File.Exists("D:\ShapeModel.zip") Then
            IO.File.Delete("D:\ShapeModel.zip")
        End If
        Context.Model.Save(oTrainedModel, oTrainingDataView.Schema, "D:\ShapeModel.zip")

        'Build prediction engine
        PredictionEngine = Context.Model.CreatePredictionEngine(Of CShapeInput, CShapeOutput)(oTrainedModel)

        'Some testing using some of the same values in the feeding data
        Dim oTestInputs As New List(Of CShapeInput)
        oTestInputs.Add(New CShapeInput(26, 0.553079716, 1.624771712, 47, 4.905492266, 3))     'Cost = 193.42
        oTestInputs.Add(New CShapeInput(40, 0.006435867, 0.018295898, 12, 0.495820115, 4))     'Cost = 0.60
        oTestInputs.Add(New CShapeInput(26, 0.948809904, 3.598203278, 96, 7.049619315, 8))     'Cost = 703.96
        oTestInputs.Add(New CShapeInput(5, 0.814014957, 1.391183561, 10, 3.985410019, 3))     'Cost = 56.71

        'Predict
        Dim oTestOutputs As New List(Of CShapeOutput)
        oTestOutputs.Add(PredictionEngine.Predict(oTestInputs(0)))
        oTestOutputs.Add(PredictionEngine.Predict(oTestInputs(1)))
        oTestOutputs.Add(PredictionEngine.Predict(oTestInputs(2)))
        oTestOutputs.Add(PredictionEngine.Predict(oTestInputs(3)))

        MessageBox.Show(String.Format("Cost 1: {0:0.000}" & Environment.NewLine() & "Cost 2: {1:0.000}" & Environment.NewLine() & "Cost 3: {2:0.000}" & Environment.NewLine() & "Cost 4: {3:0.000}", oTestOutputs(0).Score, oTestOutputs(1).Score, oTestOutputs(2).Score, oTestOutputs(3).Score))

        Return True
    End Function
End Class

输入数据第一行(测试文件中有 1000 行):

AgeFrom1990;Area;RectangularArea;Thickness;Perimeter;Cuts;Cost
26.000;0.553;1.625;47.000;4.905;3.000;193.425
23.000;0.198;0.351;33.000;3.520;7.000;48.176
5.000;0.740;2.981;55.000;4.727;6.000;310.574
39.000;0.110;0.182;41.000;1.263;4.000;32.389
40.000;0.111;0.557;27.000;1.890;1.000;23.167
15.000;0.635;0.826;51.000;3.589;1.000;218.191
18.000;0.763;0.994;89.000;5.638;9.000;482.146
36.000;0.095;0.143;87.000;1.455;7.000;60.164
15.000;0.942;1.404;50.000;4.037;1.000;319.190
34.000;0.124;0.189;17.000;2.205;6.000;15.295
35.000;0.679;3.285;63.000;6.535;5.000;335.729
18.000;0.240;1.060;17.000;2.123;3.000;31.298

获得的指标是: Rsquared:0.857 RMSE:63.41

但是,预测结果远非正确(预期/获得):

测试 1:193.4 / 192.8(.csv 文件中的第一行,工作正常)

测试 2:0.6 / -156.7(负数!)

测试 3:555.9 / 703.9

测试 4:56.7 / 128.5

唯一准确预测的结果是 .csv 文件的第一行,所以我不确定是否没有读取整个信息。

此外,Fit 过程非常快,大约保持 1-2 秒,因为指标评估是一个更长的过程。这有点奇怪,因为我认为拟合 1000 个输入应该需要一些处理。我可以避免交叉验证步骤获得相同的预测结果,所以它似乎根本不是强制性的。

老实说,我对这一切的了解真的很原始,而且我的代码是在 C# 中复制和改编来自不同来源的不同代码片段的结果,所以我确信这远不能接受。

例如,我对规范化和列连接的完成方式没有信心,将不同的结果附加到不同的返回数据类型上。找到的有关此工作流的所有信息都使用 C# 编码,以更直接的方法跳过数据类型。

任何信息都将不胜感激,因为我没有设法找到任何东西。

提前谢谢了!

标签: vb.netml.net

解决方案


推荐阅读