首页 > 解决方案 > 无法理解 spark 中的 fold() 行为

问题描述

我是新来的火花。我已经执行了以下火花程序,

val spark = SparkSession.builder().appName("FoldFunction").master("local").getOrCreate()
    val data = spark.sparkContext.parallelize(List(("Maths", 10), ("English", 10), ("Social", 10), ("Science",10)))
    val extraMarks = ("extra", 10)
    val foldedData = data.fold(extraMarks){ (acc, marks) => val add = acc._2 + marks._2
      ("total", add)}

    println(foldedData)

根据我的分析,代码将在总分上加 10 分。但我得到的答案是(total,60)

谁能解释一下我的分析是否正确?

标签: apache-spark

解决方案


api文档说如下

* @param zeroValue the initial value for the accumulated result of each partition for theop operator, and also the initial value for the combine results from different partitions for theopoperator - this will typically be the neutral element (e.g.for list concatenation or0for summation) * @param op an operator used to both accumulate results within a partition and combine results from different partitions */ def fold(zeroValue: T)(op: (T, T) => T): T

通常zeroValue设置为0Nil

但是你的zeroValueis("extra", 10)又是在最后的积累过程中添加的,这就是你得到的(total,60)

让我们一步一步来

首先acc(extra,10) marks这样10+10=20即 第二个是(Maths,10)这样20 +10=30即 第三个是这样30 +10=40即 第四个是这样40 +10=50即累积 加上所以10 +50= 60(total, 20)
acc(total,20) marks(English,10)(total, 30)
acc(total,30) marks(Social,10)(total, 40)
acc(total,40) marks(Science,10)(total, 50)
zeroValue (extra,10)folded (total,50)(total, 60)


推荐阅读