首页 > 解决方案 > 使用 Scala 和 Spark 将 Array[Array[String]] 转换为列中的字符串

问题描述

这是我的数据框:

+--------------------+                          
|    NewsId|             newsArr|            transArr|
+----------+--------------------+--------------------+
|        26|[Republicans, Sto...|[[R, IH0, P, AH1,...|
|        29|[ISIS, Claims, Re...|[[AY1, S, AH0], [...|
|       474|[Concert, for, Tr...|[[K, AA1, N, S, E...|
|       964|[How, a, Fractiou...|[[HH, AW1], [AH0]...|
|      1677|[Review:, ‘Kong:,...|[[n/a], [n/a], [S...|
|      1697|[The, Rice-Size, ...|[[DH, AH0], [n/a]...|
|      1806|[Populists, Appea...|[[P, AA1, Y, AH0,...|
|      1950|[Uber, Board, Sta...|[[Y, UW1, B, ER0]...|
|      2040|[Health, Bill’s, ...|[[HH, EH1, L, TH]...|
|      2214|[Unmasking, the, ...|[[n/a], [DH, AH0]...|

我想将“transArr”列单元格变成这样的字符串:

+--------------------+                          
|    NewsId|             newsArr|      transArr|
+----------+--------------------+--------------+
|        26|[Republicans, Sto...|R IH0 P AH1...|
|        29|[ISIS, Claims, Re...|AY1 S AH0...  |
|       474|[Concert, for, Tr...|K AA1 N S E...|
|       964|[How, a, Fractiou...|HH AW1 AH0... |
|      1677|[Review:, ‘Kong:,...|n/a n/a S...  |
|      1697|[The, Rice-Size, ...|DH AH0 n/a... |
|      1806|[Populists, Appea...|P AA1 Y AH0...|
|      1950|[Uber, Board, Sta...|Y UW1 B ER0...|
|      2040|[Health, Bill’s, ...|HH EH1 L TH...|
|      2214|[Unmasking, the, ...|n/a DH AH0... |

有没有相对简单的解决方案?

标签: arraysscaladataframeapache-spark

解决方案


使用concat_ws& flatten,检查下面的代码。

scala> df.printSchema
root
 |-- data: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

scala> df
.withColumn(
     "flatten",
     concat_ws(" ",flatten($"data"))
)
.show(false)

+------------+-------+
|data        |flatten|
+------------+-------+
|[[abc, cdf]]|abc cdf|
+------------+-------+

推荐阅读