首页 > 解决方案 > TF-IDF calculation in Dask

问题描述

Apache Spark comes with a package to do TF-IDF calculations that I find it quite handy: https://spark.apache.org/docs/latest/mllib-feature-extraction.html

Is there any equivalent, or maybe a way to do this with Dask? If so, can it also be done in horizontally scaled Dask (i.e., cluster with multiple GPUs)

标签: dask

解决方案


This was also asked on the dask gitter, with the following reply by @stsievert :

counting/hashing vectorizer are similar. They’re in Dask-ML and are the same as TFIDF without the normalization/function.

I think this would be a good github issue/feature request.

Here is the link to the API for HashingVectorizer.


推荐阅读