首页 > 解决方案 > 如何使用 group by 将 pyspark df 制作成嵌套数组

问题描述

我有一个看起来像这样的数据框,

TEST_schema = StructType([StructField("date", StringType(), True),\
                              StructField("id1", StringType(), True),\
                              StructField("id2", StringType(), True),\
                              StructField("col1", IntegerType(), True),
                             StructField("col2", IntegerType(), True),
                             StructField("col3", FloatType(), True),
                         StructField("col4", StringType(), True)])
TEST_data = [('2020-08-01','aa','123',3,5,0.5,'a'),('2020-08-02','aa','123',-1,4,0.0,'b'),('2020-08-01','bb','456',-1,3,0.0,'b'),\
                 ('2020-08-02','bb','456',-1,3,0.0,'d'),('2020-08-03','bb','456',-1,4,0.0,'a'),('2020-08-01','aa','456',-1,5,0.0,'z'),('2020-08-02','aa','456',-1,5,0.0,'g')]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df = TEST_df.withColumn("date",to_date("date", 'yyyy-MM-dd'))
TEST_df.show() 

+----------+---+---+----+----+----+----+
|      date|id1|id2|col1|col2|col3|col4|
+----------+---+---+----+----+----+----+
|2020-08-01| aa|123|   3|   5| 0.5|   a|
|2020-08-02| aa|123|  -1|   4| 0.0|   b|
|2020-08-01| bb|456|  -1|   3| 0.0|   b|
|2020-08-02| bb|456|  -1|   3| 0.0|   d|
|2020-08-03| bb|456|  -1|   4| 0.0|   a|
|2020-08-01| aa|456|  -1|   5| 0.0|   z|
|2020-08-02| aa|456|  -1|   5| 0.0|   g|
+----------+---+---+----+----+----+----+

我想基本上通过 id1 和 id2 转换数据框看起来像这个组。结果集看起来像这样..

|id1|id2|             col1                |           col2                 |
---------------------------------------------------------------------------
|aa |123|[ [2020-08-01,3],[2020-08-02,-1] ]| [[2020-08-01,5],[2020-08-02,4]]
|bb |456|[ [2020-08-01,-1],[2020-08-02,-1].........

等等......你能展示使用这些数组并做一些真正简单的计算吗?也许 onid1 =aaid2 = 123做 col1 + col2,尊重日期。所以计算列会像 [ [2020-08-01, 8 ],[2020-08-02, 3 ] ]

标签: pythonpysparkpyspark-dataframes

解决方案


如果您使用 pyspark,则不能执行日期数组和 col1 和 col2 的总和,因为日期和整数是不同的类型。但是你可以做一张日期和总和的地图,groupby,看看它是否符合你的需要

TEST_df.withColumn("dateColSum", create_map(col("date"), col("col1")+col("col2") )    ).groupBy("id1", "id2").agg(collect_list("dateColSum")).show(truncate=False)
+---+---+---------------------------------------------------------+
|id1|id2|collect_list(dateColSum)                                 |
+---+---+---------------------------------------------------------+
|aa |456|[[2020-08-01 -> 4], [2020-08-02 -> 4]]                   |
|aa |123|[[2020-08-01 -> 8], [2020-08-02 -> 3]]                   |
|bb |456|[[2020-08-01 -> 2], [2020-08-02 -> 2], [2020-08-03 -> 3]]|
+---+---+---------------------------------------------------------+


推荐阅读