首页 > 解决方案 > Spark Dataframes:将条件列添加到数据框

问题描述

我想给Flagdataframe A加一个条件列,当满足以下两个条件时,加1 Flag,否则加0:

  1. num数据帧 A 介于数据帧 BnumStartnumEnd数据帧 B之间。

  2. 如果满足上述条件,则检查是否include为 1。

DataFrame A(它是一个非常大的数据框,包含数百万行):

+----+------+-----+------------------------+
|num |food  |price|timestamp               |
+----+------+-----+------------------------+
|1275|tomato|1.99 |2018-07-21T00:00:00.683Z|
|145 |carrot|0.45 |2018-07-21T00:00:03.346Z|
|2678|apple |0.99 |2018-07-21T01:00:05.731Z|
|6578|banana|1.29 |2018-07-20T01:11:59.957Z|
|1001|taco  |2.59 |2018-07-21T01:00:07.961Z|
+----+------+-----+------------------------+

DataFrame B(它是一个非常小的 DF,仅包含 100 行):

+----------+-----------+-------+
|numStart  |numEnd     |include|
+----------+-----------+-------+
|0         |200        |1      |
|250       |1050       |0      |
|2000      |3000       |1      |
|10001     |15001      |1      |
+----------+-----------+-------+

预期输出:

+----+------+-----+------------------------+----------+
|num |food  |price|timestamp               |Flag      |
+----+------+-----+------------------------+----------+
|1275|tomato|1.99 |2018-07-21T00:00:00.683Z|0         |
|145 |carrot|0.45 |2018-07-21T00:00:03.346Z|1         |
|2678|apple |0.99 |2018-07-21T01:00:05.731Z|1         |
|6578|banana|1.29 |2018-07-20T01:11:59.957Z|0         |
|1001|taco  |2.59 |2018-07-21T01:00:07.961Z|0         |
+----+------+-----+------------------------+----------+

标签: scalaapache-sparkdataframeapache-spark-sqlconditional

解决方案


dfB您可以根据dfA(i) 中描述的条件左连接,然后使用和函数构建一Flag列“默认”为 0:withColumncoalesce

  • 找到匹配的记录将使用include匹配dfB记录的值
  • 没有匹配的记录将有include=null,并且根据您的要求,这些记录应该得到Flag=0,所以我们使用coalescewhich 在 null 的情况下返回带有文字的默认值lit(0)

最后,去掉dfB你不感兴趣的列:

import org.apache.spark.sql.functions._
import spark.implicits._ // assuming "spark" is your SparkSession

dfA.join(dfB, $"num".between($"numStart", $"numEnd"), "left")
  .withColumn("Flag", coalesce($"include", lit(0)))
  .drop(dfB.columns: _*)
  .show()

// +----+------+-----+--------------------+----+
// | num|  food|price|           timestamp|Flag|
// +----+------+-----+--------------------+----+
// |1275|tomato| 1.99|2018-07-21T00:00:...|   0|
// | 145|carrot| 0.45|2018-07-21T00:00:...|   1|
// |2678| apple| 0.99|2018-07-21T01:00:...|   1|
// |6578|banana| 1.29|2018-07-20T01:11:...|   0|
// |1001|  taco| 2.59|2018-07-21T01:00:...|   0|
// +----+------+-----+--------------------+----+

推荐阅读