首页 > 解决方案 > 仅在前一行发生更改时才显示表中的行

问题描述

我有一个包含很多记录(6+ 百万)的表,但每个 ID 的大多数行都是相同的。

例子:

日期 ID Col1 Col2 Col3 Col4 Col5
1 01-01-2021 1 一种 b C d e
2 02-01-2021 1 一种 b C d X
3 03-01-2021 1 一种 b C d X
4 04-01-2021 1 一种 b C d X
5 01-01-2021 2 一种 b C d e
6 02-01-2021 2 一种 b X d e
7 03-01-2021 2 一种 b X d e
8 01-01-2021 3 一种 b C d e
9 02-01-2021 3 一种 b C d e
10 03-01-2021 3 一种 b C d e

为了节省空间但也使查询更容易,我想创建一个表,如果与前一行(日期除外)发生变化,则只显示行。

对于上表,这意味着我只想看到:

所以表格看起来像这样:

日期 ID Col1 Col2 Col3 Col4 Col5
1 01-01-2021 1 一种 b C d e
2 02-01-2021 1 一种 b C d X
5 01-01-2021 2 一种 b C d e
6 02-01-2021 2 一种 b X d e
8 01-01-2021 3 一种 b C d e

I am trying to do this in PySpark and have had some success with 1 column only using LAG but having some trouble when there are more columns (there are about 20 in my own table). I would prefer to do this in PySpark but a working version in SQL or Python could also work!

I was wondering if there are better ways to do this.

标签: pythonsqlapache-sparkpysparkapache-spark-sql

解决方案


You can create a lagged array column of all columns of interest and compare it to the current row, then do a filter:

from pyspark.sql import functions as F, Window

cols = df.columns[3:]
w = Window.partitionBy('ID').orderBy('Date')

df2 = df.withColumn(
    'diff', 
    F.coalesce(
        F.lag(F.array(*cols)).over(w) != F.array(*cols), 
        F.lit(True)    # take care of first row where the lag is null
    )
).filter('diff').drop('diff')

df2.show()
+---+----------+---+----+----+----+----+----+
|Row|      Date| ID|Col1|Col2|Col3|Col4|Col5|
+---+----------+---+----+----+----+----+----+
|  1|01-01-2021|  1|   a|   b|   c|   d|   e|
|  2|02-01-2021|  1|   a|   b|   c|   d|   x|
|  5|01-01-2021|  2|   a|   b|   c|   d|   e|
|  6|02-01-2021|  2|   a|   b|   x|   d|   e|
|  8|01-01-2021|  3|   a|   b|   c|   d|   e|
+---+----------+---+----+----+----+----+----+

推荐阅读