首页 > 解决方案 > 连接多列的更好方法?

问题描述

我有 30 列。26 个列名是字母的名称。我想把这 26 列做成一列作为一个字符串。

price  dateCreate  volume  country  A  B  C  D  E ..... Z
19     20190501    25      US       1  2  5  6  19      30
49     20190502    30      US       5  4  5  0  34      50

我要这个:

price  dateCreate  volume  country  new_col
19     20190501    25      US       "1,2,5,6,19,....30"
49     20190502    30      US       "5,4,5,0,34,50"

我知道我可以做这样的事情:

df.withColumn("new_col", concat($"A", $"B", ...$"Z"))

但是,将来遇到这个问题时,我想知道如何更轻松地连接许多列。有办法吗?

标签: scalaapache-spark

解决方案


只需将以下内容应用于要连接的任意数量的列

val df= Seq((19,20190501,24, "US",  1 , 2,  5,  6,  19 ),(49,20190502,30, "US", 5 , 4,  5,  0,  34 )).
        toDF("price", "dataCreate", "volume", "country", "A","B","C","D","E")

val exprs = df.columns.drop(4).map(col _)

df.select($"price", $"dataCreate", $"volume", $"country", concat_ws(",", 
         array(exprs: _*)).as("new_col"))


+-----+----------+------+-------+----------+
|price|dataCreate|volume|country|   new_col|
+-----+----------+------+-------+----------+
|   19|  20190501|    24|     US|1,2,5,6,19|
|   49|  20190502|    30|     US|5,4,5,0,34|
+-----+----------+------+-------+----------+

为了完整起见,这里是 pyspark 等价物

import pyspark.sql.functions as F

df= spark.createDataFrame([[19,20190501,24, "US",  1 , 2,  5,  6,  19 ],[49,20190502,30, "US", 5 , 4,  5,  0,  34 ]],
        ["price", "dataCreate", "volume", "country", "A","B","C","D","E"])

exprs = [col for col in df.columns[4:]]

df.select("price","dataCreate", "volume", "country", F.concat_ws(",",F.array(*exprs)).alias("new_col"))

推荐阅读