首页 > 解决方案 > 如何在 ml.net 中获取 tf-idf 词袋权重的词汇?

问题描述

ML.NET 的文档展示了如何使用context.Transforms.Text.ProduceWordBags来获取词袋。该方法将Transforms.Text.NgramExtractingEstimator.WeightingCriteria参数之一作为参数,因此可以请求TfIdf使用权重。最简单的例子是:

// Get a small dataset as an IEnumerable and then read it as a ML.NET data set.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.Data.LoadFromEnumerable(data);

var pipeline = ml.Transforms.Text.ProduceWordBags("bags", review, ngramLength: 1, weighting: Transforms.Text.NgramExtractingEstimator.WeightingCriteria.TfIdf);

var transformer = pipeline.Fit(trainData);
var transformed_data = transformer.Transform(trainData);

这一切都很好,但我如何得到实际结果transformed_data呢?

我在调试器中做了一些挖掘,但我仍然对这里实际发生的事情感到困惑。

首先,运行管道会添加三个额外的列transformed_data

在此处输入图像描述

预览数据后,我可以看到这些列中的内容。为了让事情更清楚GetTopicsData,这是返回的内容,这就是我们正在运行转换的内容:

animals birds cats dogs fish horse
horse birds house fish duck cats
car truck driver bus pickup
car truck driver bus pickup horse

That's exactly what I'm seeing in the very first bags column, typed as Vector<string>:

在此处输入图像描述

Moving on to the second bags column, typed as Vector<Key<UInt32, 0-12>> (no idea what 0-12 is here btw.).

This one has KeyValues annotation on it and it looks like for each row it maps the words into indexes in global Vocabulary array.

在此处输入图像描述

The Vocabulary array is part of Annotations:

在此处输入图像描述

So that's promissing. You'd think the last bags column, typed as Vector<Single, 13> would have the weights for each of the words! Unfortunately, that's not what I'm seeing. First of all, the same Vocabulary array is present in Annotations:

在此处输入图像描述

And the values in rows are 1/0, which is not what TfIdf should return:

在此处输入图像描述

所以对我来说,这看起来更像是“当前行中是否存在词汇表i中的单词”而不是它的 TfIdf 频率,这是我想要得到的。

标签: c#tf-idfml.net

解决方案


推荐阅读