pyspark - 多列上具有滞后条件的行号
问题描述
我想创建按 ACCOUNT、NAME 和 TYPE 分区的行号。
我尝试了密集的排名和行号。但是,我需要包含任何这些列中的更改的所有初始记录
df = spark.createDataFrame(
[
('20190910', 'A1', 'Linda', 'b2c'),
('20190911', 'A1', 'Tom', 'consultant'),
('20190912', 'A1', 'John', 'b2c'),
('20190913', 'A1', 'Tom', 'consultant'),
('20190914', 'A1', 'Tom', 'consultant'),
('20190915', 'A1', 'Linda', 'consultant'),
('20190916', 'A1', 'Linda', 'b2c'),
('20190917', 'B1', 'John', 'b2c'),
('20190916', 'B1', 'John', 'consultant'),
('20190910', 'B1', 'Linda', 'b2c'),
('20190911', 'B1', 'John', 'b2c'),
('20190915', 'C1', 'John', 'consultant'),
('20190916', 'C1', 'Linda', 'consultant'),
('20190917', 'C1', 'John', 'b2c'),
('20190916', 'C1', 'RJohn', 'consultant'),
('20190910', 'C1', 'Tom', 'b2c'),
('20190911', 'C1', 'John', 'b2c'),
],
['Event_date', 'account', 'name', 'type']
)
预期结果:
活动日期 | 帐户 | 姓名 | 类型 | 行号 |
---|---|---|---|---|
20190910 | A1 | 琳达 | b2c | 1 |
20190911 | A1 | 汤姆 | 顾问 | 1 |
20190912 | A1 | 约翰 | b2c | 1 |
20190913 | A1 | 汤姆 | 顾问 | 2 |
20190914 | A1 | 汤姆 | 顾问 | 3 |
20190915 | A1 | 琳达 | 顾问 | 1 |
20190916 | A1 | 琳达 | b2c | 2 |
20190917 | B1 | 约翰 | b2c | 1 |
20190916 | B1 | 约翰 | 顾问 | 1 |
20190910 | B1 | 琳达 | b2c | 2 |
20190911 | B1 | 约翰 | b2c | 3 |
20190915 | C1 | 约翰 | 顾问 | 1 |
20190916 | C1 | 琳达 | 顾问 | 1 |
20190917 | C1 | 约翰 | b2c | 1 |
20190916 | C1 | 约翰 | 顾问 | 2 |
20190910 | C1 | 汤姆 | b2c | 1 |
20190911 | C1 | 约翰 | b2c | 2 |
解决方案
您可以创建 a并按,Window
对其进行分区account
,然后对其进行分区。name
type
row_number
例子:
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
("20190910", "A1", "Linda", "b2c"),
("20190911", "A1", "Tom", "consultant"),
("20190912", "A1", "John", "b2c"),
("20190913", "A1", "Tom", "consultant"),
("20190914", "A1", "Tom", "consultant"),
("20190915", "A1", "Linda", "consultant"),
("20190916", "A1", "Linda", "b2c"),
("20190917", "B1", "John", "b2c"),
("20190916", "B1", "John", "consultant"),
("20190910", "B1", "Linda", "b2c"),
("20190911", "B1", "John", "b2c"),
("20190915", "C1", "John", "consultant"),
("20190916", "C1", "Linda", "consultant"),
("20190917", "C1", "John", "b2c"),
("20190916", "C1", "RJohn", "consultant"),
("20190910", "C1", "Tom", "b2c"),
("20190911", "C1", "John", "b2c"),
],
["Event_date", "account", "name", "type"],
)
w = Window.partitionBy("account", "name", "type").orderBy("Event_date")
df = df.withColumn("row_number", F.row_number().over(w)).orderBy("Event_date")
结果:
+----------+-------+-----+----------+----------+
|Event_date|account|name |type |row_number|
+----------+-------+-----+----------+----------+
|20190912 |A1 |John |b2c |1 |
|20190911 |A1 |Tom |consultant|1 |
|20190913 |A1 |Tom |consultant|2 |
|20190914 |A1 |Tom |consultant|3 |
|20190915 |A1 |Linda|consultant|1 |
|20190910 |A1 |Linda|b2c |1 |
|20190916 |A1 |Linda|b2c |2 |
|20190911 |B1 |John |b2c |1 |
|20190916 |B1 |John |consultant|1 |
|20190917 |B1 |John |b2c |2 |
|20190910 |B1 |Linda|b2c |1 |
|20190910 |C1 |Tom |b2c |1 |
|20190915 |C1 |John |consultant|1 |
|20190916 |C1 |RJohn|consultant|1 |
|20190911 |C1 |John |b2c |1 |
|20190916 |C1 |Linda|consultant|1 |
|20190917 |C1 |John |b2c |2 |
+----------+-------+-----+----------+----------+
它与您的预期结果不完全相同,因为它是由Event_date
and排序的account
。
您的预期输出似乎不一致。请再次检查数字,尤其是对于
B1
. 也在RJohn
输入数据中。
推荐阅读
- emacs - 限制 .dir-locals.el 搜索遍历子目录
- amazon-web-services - AWS CLI - aws ec2 describe-instances 为每个 EC2 实例检索密钥对
- javascript - 在 Phaser 3 中控制动画循环
- java - 多租户配置:事务上的 StaleObjectStateException (hibernate + spring-data-jpa)
- sql - 通过 Excel VBA 进行 SQL 查询的复杂 JOIN 子句中的语法错误
- odoo - 通过控制器在记录中创建日志注释
- c - 我可以比较 2 个 SDL_Surface(它们是否相同)
- android-studio - android模拟器启动(android studio)上的BSOD
- javascript - 如何在不使用箭头函数的情况下编写此函数?
- python - 当没有 NaN 时开始时间序列数据