首页 > 解决方案 > 在 Spark-Scala 中将单个字符串列拆分为多列

问题描述

我有一个数据框:

+----+--------------------------+
|city|Types                     |
+----+--------------------------+
|BNG |school                    |
|HYD |school,restaurant         |
|MUM |school,restaurant,hospital|
+----+--------------------------+

我想用','将Types列拆分为多个列。

问题是列大小没有固定,所以我不知道该怎么做。

我在 pyspark 中看到了另一个相关问题,但我想在 spark-scala 中而不是 pyspark

任何帮助表示赞赏。

提前致谢

标签: scalaapache-sparkapache-spark-sql

解决方案


解决列中不规则大小的一种方法是调整表示。

例如:

val data = Seq(("BNG", "school"),("HYD", "school,res"),("MUM", "school,res,hos")).toDF("city","types")

+----+--------------+
|city|         types|
+----+--------------+
| BNG|        school|
| HYD|    school,res|
| MUM|school,res,hos|
+----+--------------+

data.withColumn("isSchool", array_contains(split(col("types"),","), "school")).withColumn("isRes", array_contains(split(col("types"),","), "res")).withColumn("isHos", array_contains(split(col("types"),","), "hos"))

+----+--------------+--------+-----+-----+
|city|         types|isSchool|isRes|isHos|
+----+--------------+--------+-----+-----+
| BNG|        school|    true|false|false|
| HYD|    school,res|    true| true|false|
| MUM|school,res,hos|    true| true| true|
+----+--------------+--------+-----+-----+

推荐阅读