首页 > 解决方案 > Apache Spark将多行连接成一行中的列表

问题描述

N我需要从将用户的数据按行存储到一行列表中的源表中创建一个表(配置单元表/火花数据框) 。

User table:
Schema:  userid: string | transactiondate:string | charges: string
----|------------|-------| 
123 | 2017-09-01 | 20.00 | 
124 | 2017-09-01 | 30.00 | 
125 | 2017-09-01 | 20.00 | 
126 | 2017-09-01 | 30.00 | 
456 | 2017-09-01 | 20.00 | 
457 | 2017-09-01 | 30.00 | 
458 | 2017-09-01 | 20.00 | 
459 | 2017-09-01 | 30.00 | 

输出表应该是

User table:
Schema:  userid: string | transactiondate:string | charges: string 
------------------|-----------------------------------------------|-------------------------
[123,124,125,126] | [2017-09-01,2017-09-01,2017-09-01,2017-09-01] | [20.00,30.00,20.00,30.00]
[456,457,458,459] | [2017-09-01,2017-09-01,2017-09-01,2017-09-01] | [20.00,30.00,20.00,30.00]

标签: scalaapache-sparkapache-spark-sql

解决方案


您需要创建一个键值来对数据进行分组。我做了一个id专栏和groupBy这个专栏。

import org.apache.spark.sql.expressions.Window

val N = 4
val agg_list = df.columns.map(c => collect_list(c).as(c))
val w = Window.orderBy("transactiondate", "userid")

df.withColumn("id", ((row_number.over(w) - 1) / N).cast("int"))
  .groupBy("id")
  .agg(agg_list.head, agg_list.tail: _*)
  .drop("id").show(false)

结果是:

+--------------------+------------------------------------------------------------------------------------+------------------------+
|userid              |transactiondate                                                                     |charges                 |
+--------------------+------------------------------------------------------------------------------------+------------------------+
|[123, 124, 125, 126]|[2017-09-01 00:00:00, 2017-09-01 00:00:00, 2017-09-01 00:00:00, 2017-09-01 00:00:00]|[20.0, 30.0, 20.0, 30.0]|
|[456, 457, 458, 459]|[2017-09-01 00:00:00, 2017-09-01 00:00:00, 2017-09-01 00:00:00, 2017-09-01 00:00:00]|[20.0, 30.0, 20.0, 30.0]|
+--------------------+------------------------------------------------------------------------------------+------------------------+

推荐阅读