首页 > 解决方案 > 多列上具有滞后条件的行号

问题描述

我想创建按 ACCOUNT、NAME 和 TYPE 分区的行号。

我尝试了密集的排名和行号。但是,我需要包含任何这些列中的更改的所有初始记录

    df = spark.createDataFrame(
       [
    ('20190910', 'A1', 'Linda', 'b2c'),
    ('20190911', 'A1', 'Tom', 'consultant'),
    ('20190912', 'A1', 'John', 'b2c'),
    ('20190913', 'A1', 'Tom', 'consultant'),
    ('20190914', 'A1', 'Tom', 'consultant'),
    ('20190915', 'A1', 'Linda', 'consultant'),
    ('20190916', 'A1', 'Linda', 'b2c'),
    ('20190917', 'B1', 'John', 'b2c'),
    ('20190916', 'B1', 'John', 'consultant'),
    ('20190910', 'B1', 'Linda', 'b2c'),
    ('20190911', 'B1', 'John', 'b2c'),
    ('20190915', 'C1', 'John', 'consultant'),
    ('20190916', 'C1', 'Linda', 'consultant'),
    ('20190917', 'C1', 'John', 'b2c'),
    ('20190916', 'C1', 'RJohn', 'consultant'),
    ('20190910', 'C1', 'Tom', 'b2c'),
    ('20190911', 'C1', 'John', 'b2c'),
     ],
    ['Event_date', 'account', 'name', 'type']
     )

预期结果:

活动日期 帐户 姓名 类型 行号
20190910 A1 琳达 b2c 1
20190911 A1 汤姆 顾问 1
20190912 A1 约翰 b2c 1
20190913 A1 汤姆 顾问 2
20190914 A1 汤姆 顾问 3
20190915 A1 琳达 顾问 1
20190916 A1 琳达 b2c 2
20190917 B1 约翰 b2c 1
20190916 B1 约翰 顾问 1
20190910 B1 琳达 b2c 2
20190911 B1 约翰 b2c 3
20190915 C1 约翰 顾问 1
20190916 C1 琳达 顾问 1
20190917 C1 约翰 b2c 1
20190916 C1 约翰 顾问 2
20190910 C1 汤姆 b2c 1
20190911 C1 约翰 b2c 2

标签: pysparklagrow-number

解决方案


您可以创建 a并按,Window对其进行分区account,然后对其进行分区。nametyperow_number

例子:

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
    [
        ("20190910", "A1", "Linda", "b2c"),
        ("20190911", "A1", "Tom", "consultant"),
        ("20190912", "A1", "John", "b2c"),
        ("20190913", "A1", "Tom", "consultant"),
        ("20190914", "A1", "Tom", "consultant"),
        ("20190915", "A1", "Linda", "consultant"),
        ("20190916", "A1", "Linda", "b2c"),
        ("20190917", "B1", "John", "b2c"),
        ("20190916", "B1", "John", "consultant"),
        ("20190910", "B1", "Linda", "b2c"),
        ("20190911", "B1", "John", "b2c"),
        ("20190915", "C1", "John", "consultant"),
        ("20190916", "C1", "Linda", "consultant"),
        ("20190917", "C1", "John", "b2c"),
        ("20190916", "C1", "RJohn", "consultant"),
        ("20190910", "C1", "Tom", "b2c"),
        ("20190911", "C1", "John", "b2c"),
    ],
    ["Event_date", "account", "name", "type"],
)
w = Window.partitionBy("account", "name", "type").orderBy("Event_date")
df = df.withColumn("row_number", F.row_number().over(w)).orderBy("Event_date")

结果:

+----------+-------+-----+----------+----------+                                
|Event_date|account|name |type      |row_number|
+----------+-------+-----+----------+----------+
|20190912  |A1     |John |b2c       |1         |
|20190911  |A1     |Tom  |consultant|1         |
|20190913  |A1     |Tom  |consultant|2         |
|20190914  |A1     |Tom  |consultant|3         |
|20190915  |A1     |Linda|consultant|1         |
|20190910  |A1     |Linda|b2c       |1         |
|20190916  |A1     |Linda|b2c       |2         |
|20190911  |B1     |John |b2c       |1         |
|20190916  |B1     |John |consultant|1         |
|20190917  |B1     |John |b2c       |2         |
|20190910  |B1     |Linda|b2c       |1         |
|20190910  |C1     |Tom  |b2c       |1         |
|20190915  |C1     |John |consultant|1         |
|20190916  |C1     |RJohn|consultant|1         |
|20190911  |C1     |John |b2c       |1         |
|20190916  |C1     |Linda|consultant|1         |
|20190917  |C1     |John |b2c       |2         |
+----------+-------+-----+----------+----------+

它与您的预期结果不完全相同,因为它是由Event_dateand排序的account

您的预期输出似乎不一致。请再次检查数字,尤其是对于B1. 也在RJohn输入数据中。


推荐阅读