scala - 至少对于获取的批量数据,Spark-sql Pivoting 无法按预期工作
问题描述
透视在大多数情况下都不能正常工作,即增加源表记录。
source_df
+---------------+-------------------+--------------------+-------------------+-------------------+--------------+-----------------------+----------------------+-----------+--------------+-------------------+----------------+---------------+---------------+
|model_family_id|classification_type|classification_value|benchmark_type_code| data_date|data_item_code|data_item_value_numeric|data_item_value_string|fiscal_year|fiscal_quarter| create_date|last_update_date|create_user_txt|update_user_txt|
+---------------+-------------------+--------------------+-------------------+-------------------+--------------+-----------------------+----------------------+-----------+--------------+-------------------+----------------+---------------+---------------+
| 1| COUNTRY| HKG| MEAN|2017-12-31 00:00:00| CREDITSCORE| 13| bb-| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| OBS_CNT|2017-12-31 00:00:00| CREDITSCORE| 649| aa| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| OBS_CNT_CA|2017-12-31 00:00:00| CREDITSCORE| 649| null| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_0|2017-12-31 00:00:00| CREDITSCORE| 3| aa| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_10|2017-12-31 00:00:00| CREDITSCORE| 8| bbb+| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_100|2017-12-31 00:00:00| CREDITSCORE| 23| d| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_25|2017-12-31 00:00:00| CREDITSCORE| 11| bb+| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_50|2017-12-31 00:00:00| CREDITSCORE| 14| b+| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_75|2017-12-31 00:00:00| CREDITSCORE| 15| b| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_90|2017-12-31 00:00:00| CREDITSCORE| 17| ccc+| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
+---------------+-------------------+--------------------+-------------------+-------------------+--------------+-----------------------+----------------------+-----------+--------------+-------------------+----------------+---------------+---------------+
我试过下面的代码
val pivot_df = source_df.groupBy("model_family_id","classification_type","classification_value" ,"data_item_code","data_date","fiscal_year","fiscal_quarter" , "create_user_txt", "create_date")
.pivot("benchmark_type_code" ,
Seq("mean","obs_cnt","obs_cnt_ca","percentile_0","percentile_10","percentile_25","percentile_50","percentile_75","percentile_90","percentile_100")
)
.agg( first(
when( col("data_item_code") === "CREDITSCORE" , col("data_item_value_string"))
.otherwise(col("data_item_value_numeric"))
)
)
我的结果低于结果,不确定我的代码有什么问题。
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+-------------+-------------+-------------+-------------+--------------+
|model_family_id|classification_type|classification_value|data_item_code| data_date|fiscal_year|fiscal_quarter|create_user_txt| create_date|mean|obs_cnt|obs_cnt_ca|percentile_0|percentile_10|percentile_25|percentile_50|percentile_75|percentile_90|percentile_100|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+-------------+-------------+-------------+-------------+--------------+
| 1| COUNTRY| HKG| CREDITSCORE|2017-12-31 00:00:00| 2017| 4| LOAD|2018-03-31 14:04:18|null| null| null| null| null| null| null| null| null| null|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+-------------+-------------+-------------+-------------+--------------+
我尝试在数据透视函数中不使用 Seq 列。但它仍然没有像预期的那样旋转,请帮忙???
2) 在 when 子句中,如果透视列是 $"benchmark_type_code" === 'OBS_CNT' | 'OBS_CNT' 那么它应该需要 $data_item_value_numeric 。如何做到这一点?
解决方案
我不确定您的 spark 版本是 2.X。我的软件版本如下: spark==>2.2.1 scala==>2.11 根据上面,我得到了正确答案:
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+--------------+-------------+-------------+-------------+-------------+
|model_family_id|classification_type|classification_value|data_item_code| data_date|fiscal_year|fiscal_quarter|create_user_txt| create_date|MEAN|OBS_CNT|OBS_CNT_CA|PERCENTILE_0|PERCENTILE_10|PERCENTILE_100|PERCENTILE_25|PERCENTILE_50|PERCENTILE_75|PERCENTILE_90|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+--------------+-------------+-------------+-------------+-------------+
| 1| COUNTRY| HKG| CREDITSCORE|2017-12-31 00:00:00| 2017| 4| LOAD|2018-03-31 14:04:18| bb-| aa| | aa| bbb+| d| bb+| b+| b| ccc+|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+--------------+-------------+-------------+-------------+-------------+
这是我的代码,你可以试试
import spark.implicits._
source_df
.groupBy($"model_family_id",$"classification_type",$"classification_value",$"data_item_code",$"data_date",$"fiscal_year",$"fiscal_quarter",$"create_user_txt",$"create_date")
.pivot("benchmark_type_code")
.agg(
first(
when($"data_item_code"==="CREDITSCORE", $"data_item_value_string")
.otherwise($"data_item_value_numeric")
)
).show()
推荐阅读
- turn - 如何使用转服务器?
- javascript - 编写一个程序,要求用户输入成对的数字,直到他们输入“退出”。使用函数添加数字
- powershell - PSReadLine 在向上箭头上一个命令执行后打破向下箭头历史
- javascript - 如何将生成的变量从一个 html 文件传递到另一个
- python - 如何解析用户输入的参数?Python
- sql - 此子选择的 LINQ 等效项
- netcdf - 平均多个 .nc 文件时,Netcdf 平均时间变量
- html - 我做的事情不是让sidenav模式根据屏幕宽度动态切换吗?
- angular - 尝试将 DatePicker 导入 app.module.ts 时出现离子错误
- java - 在 Windows 上使用 Ubuntu 的子系统进行“make”会产生 java 错误