python - How to apply condition in PySpark to keep null only if one else remove nulls
问题描述
Condition:
- If ID has a Score 'High' or 'Mid' -> remove None
- If ID only has Score None -> just keep None
Input:
ID | Score |
---|---|
AAA | High |
AAA | Mid |
AAA | None |
BBB | None |
Desired output:
ID | Score |
---|---|
AAA | High |
AAA | Mid |
BBB | None |
I'm having difficulty in writing the if condition in PySpark. Is there any other way to tackle this problem?
解决方案
You can add a flag of whether all scores are null, and filter the rows where the score is not null or when flag is true (all scores are null):
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'flag',
F.min(F.col('Score').isNull()).over(Window.partitionBy('ID'))
).filter('flag or Score is not null').drop('flag')
df2.show()
+---+-----+
| ID|Score|
+---+-----+
|BBB| null|
|AAA| High|
|AAA| Mid|
+---+-----+
推荐阅读
- sparql - Sparql 查询显示电影列表
- python - 使用 .format 打印“{”
- javascript - 地图左上方的 react-leaflet 花括号
- xml - 将字符“°”与 UTF-8 编码 API 一起使用后的 XML 解析错误
- javascript - Javascript Array.push 创建一个我无法使用索引访问的嵌套数组
- java - 如果 Firebase 表包含来自更多应用身份验证用户的数据,我该如何遍历它?
- node.js - NextJS 无法识别生产中动态添加的静态资产
- php - laravel env 文件比配置文件更强大吗?
- rust - Rust wasm:如何从 web-sys 访问 HTMLDocument?
- mongodb - MongoDB:对具有相同值的不同字段进行分组