首页 > 解决方案 > 如何从另一个中减去一个 Scala Spark DataFrame(归一化到平均值)

问题描述

我有两个 Spark DataFrame:

df1 80 列

CO01...CO80

+----+----+
|CO01|CO02|
+----+----+
|2.06|0.56|
|1.96|0.72|
|1.70|0.87|
|1.90|0.64|
+----+----+

和 df2 有 80 列

avg(CO01)...avg(CO80) 

这是每列的平均值

+------------------+------------------+
|         avg(CO01)|         avg(CO02)|
+------------------+------------------+
|2.6185106382978716|1.0080985915492937|
+------------------+------------------+

如何从 df1 中减去 df2 以获得相应的值?

我正在寻找不需要列出所有列的解决方案。

附言

在 pandas 中,它可以简单地通过以下方式完成:

df2=df1-df1.mean()

标签: scalapandasapache-spark

解决方案


这是你可以做的

scala> val df = spark.sparkContext.parallelize(List(
     | (2.06,0.56),
     | (1.96,0.72),
     | (1.70,0.87),
     | (1.90,0.64))).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: double, c2: double]

scala>

scala> def subMean(mean: Double) = udf[Double, Double]((value: Double) => value - mean)
subMean: (mean: Double)org.apache.spark.sql.expressions.UserDefinedFunction

scala>

scala> val result = df.columns.foldLeft(df)( (df, col) =>
     | { val avg = df.select(mean(col)).first().getAs[Double](0);
     | df.withColumn(col, subMean(avg)(df(col)))
     | })
result: org.apache.spark.sql.DataFrame = [c1: double, c2: double]

scala>

scala> result.show(10, false)
    +---------------------+---------------------+
|c1                   |c2                   |
+---------------------+---------------------+
|0.15500000000000025  |-0.13749999999999996 |
|0.05500000000000016  |0.022499999999999964 |
|-0.20499999999999985 |0.1725               |
|-0.004999999999999893|-0.057499999999999996|
+---------------------+---------------------+

希望这可以帮助!

请注意,只要数据框中的所有列都是数字类型,这将适用于 n 列


推荐阅读