首页 > 解决方案 > 如何将火花数据帧转换为稀疏向量行以制作 RowMatrix 对象

问题描述

spark svd 代码示例如下所示

val data = Array(
   Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
   Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
   Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))

val rows = sc.parallelize(data)
val mat: RowMatrix = new RowMatrix(rows)

val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(5, computeU = true)

现在问题是如何创建数据 rdd[Vectors.sparse] 对象以适应 svd 函数。数据看起来像这样,是 Vectors.sparse 的类型

{"type":0,"size":205209,"indices":[24119,32380,201090],"values": 
[1.8138314440983385,1.6036455249478836,1.3787660101958308]}
{"type":0,"size":205209,"indices":[24119,32380,176747,201090],"values": 
[5.441494332295015,3.207291049895767,3.2043056252302478,2.7575320203916616]}

到目前为止我试过这个

val rows = df_raw_features.select("raw_features").rdd.map(Vectors.sparse).map(Row(_))

我收到了这个错误

[error] /home/lujunchen/project/spark_code/src/main/scala/svd_feature_engineer.scala:39:71: type mismatch;
[error]  found   : (size: Int, elements: Iterable[(Integer, Double)])org.apache.spark.mllib.linalg.Vector <and> (size: Int, elements: Seq[(Int, Double)])org.apache.spark.mllib.linalg.Vector <and> (size: Int, indices: Array[Int], values: Array[Double])org.apache.spark.mllib.linalg.Vector
[error]  required: org.apache.spark.sql.Row => ?
[error]     val rows = df_raw_features.select("raw_features").rdd.map(Vectors.sparse).map(Row(_))

标签: apache-sparkapache-spark-mllib

解决方案


推荐阅读