首页 > 解决方案 > 如何使用scala在火花中洗牌稀疏向量

问题描述

我在 spark 中有一个稀疏向量,我想随机打乱(重新排序)它的内容。这个向量实际上是一个 tf-idf 向量,我想要重新排序它,以便在我的新数据集中,特征具有不同的顺序。有没有办法使用scala做到这一点?这是我生成 tf-idf 向量的代码:

val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(data).cache()
val cvModel: CountVectorizerModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("rawFeatures")
  .fit(wordsData)
val featurizedData = cvModel.transform(wordsData).cache()
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData).cache()

标签: scalaapache-sparkapache-spark-mllib

解决方案


也许这很有用-

加载测试数据

 val data = Array(
      Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
      Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
      Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
    )
    val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
    df.show(false)
    df.printSchema()

    /**
      * +---------------------+
      * |features             |
      * +---------------------+
      * |(5,[1,3],[1.0,7.0])  |
      * |[2.0,0.0,3.0,4.0,5.0]|
      * |[4.0,0.0,0.0,6.0,7.0]|
      * +---------------------+
      *
      * root
      * |-- features: vector (nullable = true)
      */

洗牌向量

val shuffleVector = udf((vector: Vector) =>
     Vectors.dense(scala.util.Random.shuffle(mutable.WrappedArray.make[Double](vector.toArray)).toArray)
   )

    val p = df.withColumn("shuffled_vector", shuffleVector($"features"))
    p.show(false)
    p.printSchema()

    /**
      * +---------------------+---------------------+
      * |features             |shuffled_vector      |
      * +---------------------+---------------------+
      * |(5,[1,3],[1.0,7.0])  |[1.0,0.0,0.0,0.0,7.0]|
      * |[2.0,0.0,3.0,4.0,5.0]|[0.0,3.0,2.0,5.0,4.0]|
      * |[4.0,0.0,0.0,6.0,7.0]|[4.0,7.0,6.0,0.0,0.0]|
      * +---------------------+---------------------+
      *
      * root
      * |-- features: vector (nullable = true)
      * |-- shuffled_vector: vector (nullable = true)
      */

您也可以使用上述内容udf创建Transformer并将其放入管道中

请务必使用import org.apache.spark.ml.linalg._

Update-1 将混洗向量转换为稀疏向量

 val shuffleVectorToSparse = udf((vector: Vector) =>
      Vectors.dense(scala.util.Random.shuffle(mutable.WrappedArray.make[Double](vector.toArray)).toArray).toSparse
    )

    val p1 = df.withColumn("shuffled_vector", shuffleVectorToSparse($"features"))
    p1.show(false)
    p1.printSchema()

    /**
      * +---------------------+-------------------------------+
      * |features             |shuffled_vector                |
      * +---------------------+-------------------------------+
      * |(5,[1,3],[1.0,7.0])  |(5,[0,3],[1.0,7.0])            |
      * |[2.0,0.0,3.0,4.0,5.0]|(5,[1,2,3,4],[5.0,3.0,2.0,4.0])|
      * |[4.0,0.0,0.0,6.0,7.0]|(5,[1,3,4],[7.0,4.0,6.0])      |
      * +---------------------+-------------------------------+
      *
      * root
      * |-- features: vector (nullable = true)
      * |-- shuffled_vector: vector (nullable = true)
      */

推荐阅读