sql - 如何在特定时间范围内从火花数据帧中选择最大值
问题描述
我有一些货物的记录。container_no
具有、origin
、destination
和列的记录shipment_dt
集volume
。
有多个记录相同container_no
,并且有可能同一容器已在不同日期发货,但如果shipment_dt
落在 10 天的时间跨度内,则检查origin
,container_no
如果所有记录具有不同的来源,则删除所有记录具有相同的 container_no 并且在 10 天的时间跨度内,否则选择交易量最高的记录。
我已经写了一个查询来获取十天的时间范围,但是不知道如何比较起点和终点并获取最大量的记录。
用于创建数据框的示例输入查询:
val Input_DF = spark.sql("""
SELECT '12345' AS container_no , 'India' AS origin ,'China' AS destination , '2020-10-10' AS shipment_dt , 20 as volume UNION
SELECT '12345' AS container_no , 'India' AS origin ,'China' AS destination , '2020-10-10' AS shipment_dt , 30 as volume UNION
SELECT '12345' AS container_no , 'India' AS origin ,'China' AS destination , '2020-10-12' AS shipment_dt , 10 as volume UNION
SELECT '12345' AS container_no , 'India' AS origin ,'China' AS destination , '2020-10-25' AS shipment_dt , 20 as volume UNION
SELECT '12345' AS container_no , 'India' AS origin ,'China' AS destination , '2020-10-26' AS shipment_dt , 10 as volume UNION
SELECT '12346' AS container_no , 'India' AS origin ,'China' AS destination , '2020-10-15' AS shipment_dt , 20 as volume UNION
SELECT '12346' AS container_no , 'India' AS origin ,'China' AS destination , '2020-10-16' AS shipment_dt , 20 as volume UNION
SELECT '12346' AS container_no , 'India' AS origin ,'China' AS destination , '2020-10-17' AS shipment_dt , 50 as volume UNION
SELECT '12347' AS container_no , 'India' AS origin ,'China' AS destination , '2020-10-18' AS shipment_dt , 20 as volume UNION
SELECT '12347' AS container_no , 'Nepal' AS origin ,'China' AS destination , '2020-10-19' AS shipment_dt , 21 as volume""")
Input_DF.createOrReplaceTempView("Input_DF")
查询以创建 10 天的数据时间范围:
val output_df = spark.sql("""
SELECT
B.* ,
CASE
WHEN from_prev BETWEEN 0 AND 9
THEN 1
ELSE 0
END AS recent ,
floor(from_first / 10 ) AS recent_group
FROM
(
SELECT
A.*,
NVL(DATEDIFF(shipment_dt,FIRST(shipment_dt) over(partition BY container_no
ORDER BY shipment_dt ASC)) ,0) AS from_first,
NVL(DATEDIFF(shipment_dt,lag(shipment_dt,1) over(partition BY container_no
ORDER BY shipment_dt ASC)) ,0) from_prev
FROM
Input_DF A) B
ORDER BY
container_no,
shipment_dt""")
在示例输入屏幕截图中,我添加了一个额外的列来解释一行与具有相同 container_no 但日期不同的另一记录的不同之处。提前致谢。
解决方案
推荐阅读
- r - 如何在 R 中对独立性(按比例)进行统计双边检验?
- linux - gunzip 模块实际上是否默认包含在 Nginx 中?
- delphi - 使用月历 Delphi 计算天数
- pyomo - 如何在 Pyomo 模型中按数字查找约束?
- browser - P5.JS Sketch 不会在浏览器中加载
- python - model_evaluation_utils 混淆矩阵出错
- scala - Spark:对象不可序列化
- python - 为什么我得到退出代码 -1 却没有在 IDE 中显示任何错误?
- python - 如何从 PDF(Python PDFminer)中提取指数?
- reactjs - 未触发 useEffect