dask - TF-IDF calculation in Dask
问题描述
Apache Spark comes with a package to do TF-IDF calculations that I find it quite handy: https://spark.apache.org/docs/latest/mllib-feature-extraction.html
Is there any equivalent, or maybe a way to do this with Dask? If so, can it also be done in horizontally scaled Dask (i.e., cluster with multiple GPUs)
解决方案
This was also asked on the dask gitter, with the following reply by @stsievert :
counting/hashing vectorizer are similar. They’re in Dask-ML and are the same as TFIDF without the normalization/function.
I think this would be a good github issue/feature request.
Here is the link to the API for HashingVectorizer
.
推荐阅读
- javascript - 我的 javascript 代码中的错误
- wordpress - 使用 WordPress 处理来自 Google 表格的数据
- c# - 正则表达式 C#:检查包含日期和下划线的文件名
- javascript - Google Map Uncaught TypeError:无法读取 vue js html 中 null 的属性“firstChild”?
- alfresco - activiti taskService完成并发执行时失败
- regex - 匹配同一组不同值的捕获组Regex
- c# - 如何将数据表中的条件数据添加到另一个数据表中。[错误:“位置 0 处没有行]。C#
- excel - 如何在 VBA 中调整此程序,以便将原始温度放置在转换温度旁边的单元格中?
- xml - XSLT 计算空节点之前的所有兄弟元素
- c++ - C++ 类 - 数组中最常见的对象