首页 > 解决方案 > How to apply condition in PySpark to keep null only if one else remove nulls

问题描述

Condition:

Input:

ID Score
AAA High
AAA Mid
AAA None
BBB None

Desired output:

ID Score
AAA High
AAA Mid
BBB None

I'm having difficulty in writing the if condition in PySpark. Is there any other way to tackle this problem?

标签: pythonapache-sparkpysparkapache-spark-sql

解决方案


You can add a flag of whether all scores are null, and filter the rows where the score is not null or when flag is true (all scores are null):

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'flag', 
    F.min(F.col('Score').isNull()).over(Window.partitionBy('ID'))
).filter('flag or Score is not null').drop('flag')

df2.show()
+---+-----+
| ID|Score|
+---+-----+
|BBB| null|
|AAA| High|
|AAA|  Mid|
+---+-----+

推荐阅读