首页 > 解决方案 > spark数据框中列的模式匹配字符串

问题描述

我在 spark 数据框中有一列,我需要只使用包含“xyz”的字符串搜索数据并将其存储在新列中。

Input   (need the only field from column having xyz )
col A      colB                                                        
A          bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656      
B          xyz:4462915,xyz:4462917,xyz:4462918                        

Required Output
col A      colB                                                        colC
A          bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656      xyz:3089656
B          xyz:4462915,xyz:4462917,xyz:4462918                        xyz:4462915,xyz:4462917,xyz:4462918

我有 100k 行,无法使用 collect_list 在 colA 上使用 groupby,请您获取所需的输出。

标签: scaladataframeapache-spark

解决方案


If you are using Spark 2.4+ then you can split the colB with comma , and use built in functions as expressions

import org.apache.spark.sql.functions._

import spark.implicits._
val df = Seq(
  ("A", "bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656"),
  ("B", "xyz:4462915,xyz:4462917,xyz:4462918")
).toDF("colA", "colB")

val newDF = df.withColumn("split", split($"colB", ","))
  .selectExpr("*", "filter(split, x -> x LIKE 'xyz%' ) filteredB")
  .withColumn("colC", concat_ws(",", $"filteredB"))
  .drop("split", "filteredB")

newDF.show(false)

Output:

+----+-----------------------------------------------------+-----------------------------------+
|colA|colB                                                 |colC                               |
+----+-----------------------------------------------------+-----------------------------------+
|A   |bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656|xyz:3089656                        |
|B   |xyz:4462915,xyz:4462917,xyz:4462918                  |xyz:4462915,xyz:4462917,xyz:4462918|
+----+-----------------------------------------------------+-----------------------------------+

推荐阅读