首页 > 解决方案 > Spark DataFrame 分区

问题描述

目前,我有一个数据框。我想将它们分成几个独立的数据帧,然后依次处理它们。

火花数据名,如:

+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|            id|data_identifier_method|       start_time|         end_time|time_interval|             time|    value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
|  fd784213423f|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
|  fd784213423f|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
|  fd784213423f|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
|  fd784213423f|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
|  fd784213423f|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
|  fd784213423f|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+

然后我想把它分成四个数据框:

+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|            id|data_identifier_method|       start_time|         end_time|time_interval|             time|    value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|            id|data_identifier_method|       start_time|         end_time|time_interval|             time|    value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|            id|data_identifier_method|       start_time|         end_time|time_interval|             time|    value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|  fd784213423f|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
|  fd784213423f|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
|  fd784213423f|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|            id|data_identifier_method|       start_time|         end_time|time_interval|             time|    value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|  fd784213423f|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
|  fd784213423f|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
|  fd784213423f|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+

我该怎么办?</p>

也就是说,如果我不分割原始dataframe,如何对原始dataframe中的这四项进行操作呢?

标签: dataframeapache-sparkapache-spark-sql

解决方案


可以使用NTILE来实现

NTILE 是一个 Spark SQL 分析函数。它将有序数据集划分为由 expr 指示的多个桶,并为每一行分配适当的桶号。桶从 1 到 expr 编号。对于每个分区,expr 值必须解析为正常数。

它将有一个分区号(NTILE 值)。现在您可以使用过滤器来创建 Ntile 函数中指定的数据集。

下面是虚拟代码。

val w = Window.orderBy("sum_val")
val resultDF = x.orderBy("sum_val").select( x("id"),x("sum_val"), ntile(4).over(w) )

+---+-------+-----+
| id|sum_val|ntile|
+---+-------+-----+
|  B|      3|    1|
|  E|      4|    1|
|  H|      4|    2|
|  D|     14|    2|
|  A|     14|    3|
|  F|     30|    3|
|  C|     34|    4|
+---+-------+-----+

推荐阅读