首页 > 解决方案 > 希望根据单独 DF 的值减去一行中的每个值

问题描述

正如标题所述,我想用该列的平均值减去特定列的每个值。

这是我的代码尝试:

val test = moviePairs.agg(avg(col("rating1")).alias("avgX"), avg(col("rating2")).alias("avgY"))


val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - test.select("avgX").collect())
  .withColumn("meanDeltaY", col("rating2") - test.select("avgY").collect())
subMean.show()

标签: scalaapache-spark

解决方案


您可以使用 Spark 的 DataFrame 函数或对 DataFrame 的单纯 SQL 查询来聚合您关注的列的均值 ( rating1, rating2)。

val moviePairs = spark.createDataFrame(
  Seq(
        ("Moonlight", 7, 8),
        ("Lord Of The Drinks", 10, 1),
        ("The Disaster Artist", 3, 5),
        ("Airplane!", 7, 9),
        ("2001", 5, 1),
    )
).toDF("movie", "rating1", "rating2")

// find the means for each column and isolate the first (and only) row to get their values
val means = moviePairs.agg(avg("rating1"), avg("rating2")).head()

// alternatively, by using a simple SQL query:
// moviePairs.createOrReplaceTempView("movies")
// val means = spark.sql("select AVG(rating1), AVG(rating2) from movies").head()

val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - means.getDouble(0))
.withColumn("meanDeltaY", col("rating2") - means.getDouble(1))

subMean.show()

测试输入 DataFrame 的输出moviePairs(具有良好的双精度损失,您可以在此处看到):

+-------------------+-------+-------+-------------------+-------------------+
|              movie|rating1|rating2|         meanDeltaX|         meanDeltaY|
+-------------------+-------+-------+-------------------+-------------------+
|          Moonlight|      7|      8| 0.5999999999999996|                3.2|
| Lord Of The Drinks|     10|      1| 3.5999999999999996|               -3.8|
|The Disaster Artist|      3|      5|-3.4000000000000004|0.20000000000000018|
|          Airplane!|      7|      9| 0.5999999999999996|                4.2|
|               2001|      5|      1|-1.4000000000000004|               -3.8|
+-------------------+-------+-------+-------------------+-------------------+

推荐阅读