c# - 如何在 ml.net 中获取 tf-idf 词袋权重的词汇?
问题描述
ML.NET 的文档展示了如何使用context.Transforms.Text.ProduceWordBags
来获取词袋。该方法将Transforms.Text.NgramExtractingEstimator.WeightingCriteria
参数之一作为参数,因此可以请求TfIdf
使用权重。最简单的例子是:
// Get a small dataset as an IEnumerable and then read it as a ML.NET data set.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.Data.LoadFromEnumerable(data);
var pipeline = ml.Transforms.Text.ProduceWordBags("bags", review, ngramLength: 1, weighting: Transforms.Text.NgramExtractingEstimator.WeightingCriteria.TfIdf);
var transformer = pipeline.Fit(trainData);
var transformed_data = transformer.Transform(trainData);
这一切都很好,但我如何得到实际结果transformed_data
呢?
我在调试器中做了一些挖掘,但我仍然对这里实际发生的事情感到困惑。
首先,运行管道会添加三个额外的列transformed_data
:
预览数据后,我可以看到这些列中的内容。为了让事情更清楚GetTopicsData
,这是返回的内容,这就是我们正在运行转换的内容:
animals birds cats dogs fish horse
horse birds house fish duck cats
car truck driver bus pickup
car truck driver bus pickup horse
That's exactly what I'm seeing in the very first bags
column, typed as Vector<string>
:
Moving on to the second bags
column, typed as Vector<Key<UInt32, 0-12>>
(no idea what 0-12
is here btw.).
This one has KeyValues
annotation on it and it looks like for each row it maps the words into indexes in global Vocabulary array.
The Vocabulary array is part of Annotations
:
So that's promissing. You'd think the last bags
column, typed as Vector<Single, 13>
would have the weights for each of the words! Unfortunately, that's not what I'm seeing. First of all, the same Vocabulary array is present in Annotations
:
And the values in rows are 1
/0
, which is not what TfIdf should return:
所以对我来说,这看起来更像是“当前行中是否存在词汇表i
中的单词”而不是它的 TfIdf 频率,这是我想要得到的。
解决方案
推荐阅读
- python - 安装和导入 pynput 时出错
- javascript - 我可以使用 GA4 事件来跟踪下载 pdf 的按钮吗?
- raspberry-pi - 如何将 Raspberry Pi 4 与 PC(运行 Gazebo)通信并在 Pi 上获取传感器主题?
- delphi - Delphi Teechart how to increase line width of second chart
- google-cloud-platform - 尝试连接到 Google Cloud Storage 时出错
- python - Statsmodels ARIMA、值警告、日期索引和相关频率信息
- amazon-web-services - 在某些情况下,awsRequestIds 在 lambda 调用之间不是唯一的
- laravel - 在laravel中获取上一个网址
- asp.net-core - 创建 Blazor 服务器登录页面以获取 Jwt
- vue.js - $ 不将文本转换为代码我该怎么办?