apache-spark - How to modify a dataframe in-place so that its ArrayType column can't be null (nullable = false and containsNull = false)?
问题描述
Take the following example dataframe:
val df = Seq(Seq("xxx")).toDF("a")
Schema:
root
|-- a: array (nullable = true)
| |-- element: string (containsNull = true)
How can I modify df
in-place so that the resulting dataframe is not nullable anywhere, i.e. has the following schema:
root
|-- a: array (nullable = false)
| |-- element: string (containsNull = false)
I understand that I can re-create another dataframe enforcing a non-nullable schema, such as following Change nullable property of column in spark dataframe
spark.createDataFrame(df.rdd, StructType(StructField("a", ArrayType(StringType, false), false) :: Nil))
But this is not an option under structured streaming, so I want it to be some kind of in-place modification.
解决方案
So the way to achieve this is with a UserDefinedFunction
// Problem setup
val df = Seq(Seq("xxx")).toDF("a")
df.printSchema
root
|-- a: array (nullable = true)
| |-- element: string (containsNull = true)
Onto the solution:
import org.apache.spark.sql.types.{ArrayType, StringType}
import org.apache.spark.sql.functions.{udf, col}
// We define a sub schema with the appropriate data type and null condition
val subSchema = ArrayType(StringType, containsNull = false)
// We create a UDF that applies this sub schema
// while specifying the output of the UDF to be non-nullable
val applyNonNullableSchemaUdf = udf((x:Seq[String]) => x, subSchema).asNonNullable
// We apply the UDF
val newSchemaDF = df.withColumn("a", applyNonNullableSchemaUdf(col("a")))
And there you have it.
// Check new schema
newSchemaDF.printSchema
root
|-- a: array (nullable = false)
| |-- element: string (containsNull = false)
// Check that it actually works
newSchemaDF.show
+-----+
| a|
+-----+
|[xxx]|
+-----+
推荐阅读
- python - 使用 python 3.6.5 在 Windows 7 上的 PYTHONPATH 行为
- json - React 应用程序接收承诺而不是 json 对象
- php - PHP数组结构改变
- c - 堆代码行解释
- ios - 为应用程序中的所有外观自定义返回按钮
- reactjs - 检查数组时无法读取 null 的属性“长度”
- reactjs - 如何将传播运算符发送到 react-redux Algolia 中的 hitComponent?
- video - H.264 残帧是如何存储和压缩的
- php - SQL 查询 - 在一个结果行中显示连接结果
- reactjs - 实现 ReactDOM.createPortal 时的问题(react 16 功能)