dataframe - Spark DataFrame 分区
问题描述
目前,我有一个数据框。我想将它们分成几个独立的数据帧,然后依次处理它们。
火花数据名,如:
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
| id|data_identifier_method| start_time| end_time|time_interval| time| value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|fd78sfsdfsd8vs| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
|fd78sfsdfsd8vs| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
| fd784213423f| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
| fd784213423f| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
| fd784213423f| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
| fd784213423f| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
| fd784213423f| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
| fd784213423f| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
然后我想把它分成四个数据框:
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
| id|data_identifier_method| start_time| end_time|time_interval| time| value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|fd78sfsdfsd8vs| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
| id|data_identifier_method| start_time| end_time|time_interval| time| value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|fd78sfsdfsd8vs| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
| id|data_identifier_method| start_time| end_time|time_interval| time| value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
| fd784213423f| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
| fd784213423f| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
| fd784213423f| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
| id|data_identifier_method| start_time| end_time|time_interval| time| value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
| fd784213423f| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
| fd784213423f| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
| fd784213423f| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
我该怎么办?</p>
也就是说,如果我不分割原始dataframe,如何对原始dataframe中的这四项进行操作呢?
解决方案
可以使用NTILE来实现
NTILE 是一个 Spark SQL 分析函数。它将有序数据集划分为由 expr 指示的多个桶,并为每一行分配适当的桶号。桶从 1 到 expr 编号。对于每个分区,expr 值必须解析为正常数。
它将有一个分区号(NTILE 值)。现在您可以使用过滤器来创建 Ntile 函数中指定的数据集。
下面是虚拟代码。
val w = Window.orderBy("sum_val")
val resultDF = x.orderBy("sum_val").select( x("id"),x("sum_val"), ntile(4).over(w) )
+---+-------+-----+
| id|sum_val|ntile|
+---+-------+-----+
| B| 3| 1|
| E| 4| 1|
| H| 4| 2|
| D| 14| 2|
| A| 14| 3|
| F| 30| 3|
| C| 34| 4|
+---+-------+-----+
推荐阅读
- javascript - 为什么尽管 enumerable 设置为 true,但未显示使用 defineProperty 添加的属性?
- ruby - Ruby FFI gem 无法打开库文件
- .net - Hangfire 1.6.22 操作被取消,然后被另一个工人接走
- javascript - URL 和排除扩展标志中多个文件夹名称的正则表达式
- c# - 保持 MenuItem 的 IsChecked 属性与布尔值同步
- windows - Windows 10 使用 OpenSSH 服务器,执行 ssh 命令时,如何防止弹出 CMD 窗口出现
- laravel - 如何解决 Auth guard [web] 未定义问题
- asp.net-core - 包含 Razor 类库的控制器和模型源代码
- godot - Godot - set_offset 在 Inspector 中有效,但在代码中无效
- php - 使用 ord() 进行验证