apache-spark - 在pyspark数据框中orderby之后选择第n行
问题描述
我想为每组名称选择第二行。我使用 orderby 按名称排序,然后按购买日期/时间戳排序。重要的是我为每个名称选择第二次购买(按日期时间)。
这是构建数据框的数据:
data = [
('George', datetime(2020, 3, 24, 3, 19, 58), datetime(2018, 2, 24, 3, 22, 55)),
('Andrew', datetime(2019, 12, 12, 17, 21, 30), datetime(2019, 7, 21, 2, 14, 22)),
('Micheal', datetime(2018, 11, 22, 13, 29, 40), datetime(2018, 5, 17, 8, 10, 19)),
('Maggie', datetime(2019, 2, 8, 3, 31, 23), datetime(2019, 5, 19, 6, 11, 33)),
('Ravi', datetime(2019, 1, 1, 4, 19, 47), datetime(2019, 1, 1, 4, 22, 55)),
('Xien', datetime(2020, 3, 2, 4, 33, 51), datetime(2020, 5, 21, 7, 11, 50)),
('George', datetime(2020, 3, 24, 3, 19, 58), datetime(2020, 3, 24, 3, 22, 45)),
('Andrew', datetime(2019, 12, 12, 17, 21, 30), datetime(2019, 9, 19, 1, 14, 11)),
('Micheal', datetime(2018, 11, 22, 13, 29, 40), datetime(2018, 8, 19, 7, 11, 37)),
('Maggie', datetime(2019, 2, 8, 3, 31, 23), datetime(2018, 2, 19, 6, 11, 42)),
('Ravi', datetime(2019, 1, 1, 4, 19, 47), datetime(2019, 1, 1, 4, 22, 17)),
('Xien', datetime(2020, 3, 2, 4, 33, 51), datetime(2020, 6, 21, 7, 11, 11)),
('George', datetime(2020, 3, 24, 3, 19, 58), datetime(2020, 4, 24, 3, 22, 54)),
('Andrew', datetime(2019, 12, 12, 17, 21, 30), datetime(2019, 8, 30, 3, 12, 41)),
('Micheal', datetime(2018, 11, 22, 13, 29, 40), datetime(2017, 5, 17, 8, 10, 38)),
('Maggie', datetime(2019, 2, 8, 3, 31, 23), datetime(2020, 3, 19, 6, 11, 12)),
('Ravi', datetime(2019, 1, 1, 4, 19, 47), datetime(2018, 2, 1, 4, 22, 24)),
('Xien', datetime(2020, 3, 2, 4, 33, 51), datetime(2018, 9, 21, 7, 11, 41)),
]
df = sqlContext.createDataFrame(data, ['name', 'trial_start', 'purchase'])
df.show(truncate=False)
我按名称订购数据,然后购买
df.orderBy("name","purchase").show()
产生结果:
+-------+-------------------+-------------------+
| name| trial_start| purchase|
+-------+-------------------+-------------------+
| Andrew|2019-12-12 22:21:30|2019-07-21 06:14:22|
| Andrew|2019-12-12 22:21:30|2019-08-30 07:12:41|
| Andrew|2019-12-12 22:21:30|2019-09-19 05:14:11|
| George|2020-03-24 07:19:58|2018-02-24 08:22:55|
| George|2020-03-24 07:19:58|2020-03-24 07:22:45|
| George|2020-03-24 07:19:58|2020-04-24 07:22:54|
| Maggie|2019-02-08 08:31:23|2018-02-19 11:11:42|
| Maggie|2019-02-08 08:31:23|2019-05-19 10:11:33|
| Maggie|2019-02-08 08:31:23|2020-03-19 10:11:12|
|Micheal|2018-11-22 18:29:40|2017-05-17 12:10:38|
|Micheal|2018-11-22 18:29:40|2018-05-17 12:10:19|
|Micheal|2018-11-22 18:29:40|2018-08-19 11:11:37|
| Ravi|2019-01-01 09:19:47|2018-02-01 09:22:24|
| Ravi|2019-01-01 09:19:47|2019-01-01 09:22:17|
| Ravi|2019-01-01 09:19:47|2019-01-01 09:22:55|
| Xien|2020-03-02 09:33:51|2018-09-21 11:11:41|
| Xien|2020-03-02 09:33:51|2020-05-21 11:11:50|
| Xien|2020-03-02 09:33:51|2020-06-21 11:11:11|
+-------+-------------------+-------------------+
我怎样才能得到每个名字的第二行?在熊猫中这很容易。我可以只使用nth。我一直在看sql,但没有找到解决方案。任何建议表示赞赏。
我正在寻找的输出是:
+-------+-------------------+-------------------+
| name| trial_start| purchase|
+-------+-------------------+-------------------+
| Andrew|2019-12-12 22:21:30|2019-08-30 07:12:41|
| George|2020-03-24 07:19:58|2020-03-24 07:22:45|
| Maggie|2019-02-08 08:31:23|2019-05-19 10:11:33|
|Micheal|2018-11-22 18:29:40|2018-05-17 12:10:19|
| Ravi|2019-01-01 09:19:47|2019-01-01 09:22:17|
| Xien|2020-03-02 09:33:51|2020-05-21 11:11:50|
+-------+-------------------+-------------------+
解决方案
尝试使用window row_number()
函数,然后仅过滤2
排序后的行purchase
。
Example:
from pyspark.sql import *
from pyspark.sql.functions import *
w=Window.partitionBy("name").orderBy(col("purchase"))
df.withColumn("rn",row_number().over(w)).filter(col("rn") ==2).drop(*["rn"]).show()
SQL Api:
df.createOrReplaceTempView("tmp")
spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")
sql("select `(rn)?+.+` from (select *,row_number() over(partition by name order by purchase) rn from tmp) e where rn =2").\
show()
推荐阅读
- python - 3D 图像到 2D 平面的转换 - 棋盘
- .net - 新的内部 web 应用程序未打开
- r - ggsci颜色没有映射
- node.js - 在 reactjs fetch api 中找不到 404 错误文件
- python - 我的数据在列的值中有逗号,这也是一个分隔符,如何在 python 中通过 csv.reader 读取它
- javascript - 页面不会在 jsx 内的状态更改时重新加载
- c# - How to sort/select top values from Hashtable?
- javascript - 如何使用 API 调用从嵌套 JSON 中检索数据并使用 js 将其显示在网页上
- python - 如何创建分类计数的分组条形图
- java - 启动 Spring Boot 应用程序时,无法获取 JDBC 连接对象并没有用请求的接口包装任何内容