首页 > 解决方案 > 如何将多个字符串值减少为列中的预定义类别

问题描述

我想根据预定义的模式匹配类别减少数据框中特定列的值。

例子:

  val df = spark.createDataFrame(Seq(
  (1, "apple"),
  (2, "banana"),
  (3, "avocado"),
  (4, "potato"))).toDF("Id", "category")

Id  category
1   apple
2   banana
3   avocado
4   potato

期望的输出:

  val df_reduced = spark.createDataFrame(Seq(
  (1, "fruit"),
  (2, "fruit"),
  (3, "vegetable"),
  (4, "vegetable"))).toDF("Id", "category")

Id  category
1   fruit
2   fruit
3   vegetable
4   vegetable

这是我想出的解决方案:

df.withColumn("category", when(col("category") === "apple", regexp_replace(col("category"), "apple", "fruit"))
              .otherwise(when(col("category") === "banana", regexp_replace(col("category"), "banana", "fruit"))
              .otherwise(when(col("category") === "avocado", regexp_replace(col("category"), "avocado", "vegetable"))
              .otherwise(when(col("category") === "potato", regexp_replace(col("category"), "potato", "vegetable"))
                         ))))
.show

我不太喜欢这种嵌套的 when-otherwise 方法,所以我想知道:对于这项任务是否有更好、更惯用的解决方案?

标签: scalaapache-sparkapache-spark-sql

解决方案


map我想,你应该在udf下面寻求帮助

import org.apache.spark.sql.functions._

val map=Map("Apple"->"fruit","Mango"->"fruit","potato"->"vegetable","avocado"->"vegetable","Banana"->"fruit")

val replaceUDF=udf((name:String)=>map.getOrElse(name, name))
val outputdf=df.withColumn("new_category", replaceUDF(col("category"))

样本输出:

+---+--------+------------+
| Id|category|new_category|
+---+--------+------------+
|  1|   Apple|       fruit|
|  2|  Banana|       fruit|
|  3|  potato|   vegetable|
|  4| avocado|   vegetable|
|  5|   Mango|       fruit|
+---+--------+------------+

推荐阅读