首页 > 解决方案 > 有没有办法使用 VarVector 来表示 Ml.net K-means 聚类中的原始数据

问题描述

我想在我通过处理另一个数据集在内存中生成的一些“原始”向量上使用 ML.Net K-means 聚类。我希望能够在运行时选择向量的长度。给定模型中的所有向量都将具有相同的长度,但随着我尝试不同的聚类方法,该长度可能因模型而异。

我使用以下代码:

public class MyVector
{
   [VectorType]
   public float[] Values;
}

void Train()
{

    var vectorSize = GetVectorSizeFromUser();

    var vectors = .... process dataset to create an array of MyVectors, each with 'vectorSize' values

    var mlContext = new MLContext();

    string featuresColumnName = "Features";
    var pipeline = mlContext
        .Transforms
        .Concatenate(featuresColumnName, nameof(MyVector.Values))
        .Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 3));

    var trainingData = mlContext.Data.LoadFromEnumerable(vectors);

    Console.WriteLine("Training...");
    var model = pipeline.Fit(trainingData);
}

问题是,当我尝试进行培训时,我得到了这个异常......

特征列“特征”的架构不匹配:预期向量,得到 VarVector(参数“inputSchema”)

对于任何给定的值vectorSize(比如 20),我可以通过使用来避免这种情况[VectorType(20)],但这里的关键是我不想依赖特定的编译时值。是否有允许将动态大小的数据用于这种训练的方法?

我可以想象各种令人讨厌的解决方法,包括使用虚拟列动态构建数据视图,但希望有一种更简单的方法。

标签: c#ml.net

解决方案


感谢 Jon 找到包含所需信息的此问题的链接。诀窍是在运行时覆盖 SchemaDefinition....

public class MyVector
{
   //it's not required to specify the type here since we will override in our custom schema 
   public float[] Values;
}

void Train()
{

    var vectorSize = GetVectorSizeFromUser();

    var vectors = .... process dataset to create an array of MyVectors, each with 'vectorSize' values

    var mlContext = new MLContext();

    string featuresColumnName = "Features";
    var pipeline = mlContext
        .Transforms
        .Concatenate(featuresColumnName, nameof(MyVector.Values))
        .Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 3));

    //create a custom schema-definition that overrides the type for the Values field...  
    var schemaDef = SchemaDefinition.Create(typeof(MyVector));
    schemaDef[nameof(MyVector.Values)].ColumnType 
                  = new VectorDataViewType(NumberDataViewType.Single, vectorSize);

    //use that schema definition when creating the training dataview  
    var trainingData = mlContext.Data.LoadFromEnumerable(vectors,schemaDef);

    Console.WriteLine("Training...");
    var model = pipeline.Fit(trainingData);

    //Note that the schema-definition must also be supplied when creating the prediction engine...

    var predictor = mlContext
                    .Model
                    .CreatePredictionEngine<MyVector,ClusterPrediction>(model, 
                                          inputSchemaDefinition: schemaDef);

    //now we can use the engine to predict which cluster a vector belongs to...
    var prediction = predictor.Predict(..some MyVector...);  
}

推荐阅读