首页 > 解决方案 > 在 scala 中将 Map Datatype 的新列添加到 Spark Dataframe

问题描述

我能够创建一个新的 Dataframe,其中一列具有 Map 数据类型。

val inputDF2 = Seq(
(1, "Visa", 1, Map[String, Int]()), 
(2, "MC", 2, Map[String, Int]())).toDF("id", "card_type", "number_of_cards", "card_type_details")
scala> inputDF2.show(false)
+---+---------+---------------+-----------------+
|id |card_type|number_of_cards|card_type_details|
+---+---------+---------------+-----------------+
|1  |Visa     |1              |[]               |
|2  |MC       |2              |[]               |
+---+---------+---------------+-----------------+

现在我想创建一个与 card_type_details 相同类型的新列。我正在尝试使用 spark withColumn 方法来添加这个新列。

inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").show(false)

+---------+---------+---------------+---------------------+-----+
|person_id|card_type|number_of_cards|card_type_details    |tmp  |
+---------+---------+---------------+---------------------+-----+
|1        |Visa     |1              |[]                   |null |
|2        |MC       |2              |[]                   |null |
+---------+---------+---------------+---------------------+-----+ 

当我检查两列的架构时,它是相同的,但值会有所不同。

scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").printSchema
root
 |-- id: integer (nullable = false)
 |-- card_type: string (nullable = true)
 |-- number_of_cards: integer (nullable = false)
 |-- card_type_details: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = false)
 |-- tmp: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)

我不确定添加新列时是否正确。当我在 tmp 列上应用 .isEmpty 方法时,问题来了。我收到空指针异常。

scala> def checkValue = udf((card_type_details: Map[String, Int]) => {
     | var output_map = Map[String, Int]()
     | if (card_type_details.isEmpty) { output_map += 0.toString -> 1 }
     | else {output_map = card_type_details }
     | output_map
     | })
checkValue: org.apache.spark.sql.expressions.UserDefinedFunction

scala> inputDF2.withColumn("value", checkValue(col("card_type_details"))).show(false)
+---+---------+---------------+-----------------+--------+
|id |card_type|number_of_cards|card_type_details|value   |
+---+---------+---------------+-----------------+--------+
|1  |Visa     |1              |[]               |[0 -> 1]|
|2  |MC       |2              |[]               |[0 -> 1]|
+---+---------+---------------+-----------------+--------+

scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>")
.withColumn("value", checkValue(col("tmp"))).show(false)

org.apache.spark.SparkException: Failed to execute user defined function($anonfun$checkValue$1: (map<string,int>) => map<string,int>)

Caused by: java.lang.NullPointerException
  at $anonfun$checkValue$1.apply(<console>:28)
  at $anonfun$checkValue$1.apply(<console>:26)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:108)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:107)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1063)

如何添加应该与 card_type_details 列具有相同值的新列。

标签: scalaapache-sparkapache-spark-sql

解决方案


要添加与card_type_detailstmp具有相同值的列,您只需执行以下操作:

inputDF2.withColumn("tmp", col("cart_type_details"))

如果您的目标是添加带有空地图的列并避免使用NullPointerException,则解决方案是:

inputDF2.withColumn("tmp", typedLit(Map.empty[Int, String]))

推荐阅读