python - Add new column in Pyspark dataframe based on where condition on other column
问题描述
I have a Pyspark data frame as follows:
+------------+-------------+--------------------+
|package_id | location | package_scan_code |
+------------+-------------+--------------------+
|123 | Denver |05 |
|123 | LosAngeles |03 |
|123 | Dallas |09 |
|123 | Vail |02 |
|456 | Jacksonville|05 |
|456 | Nashville |09 |
|456 | Memphis |03 |
"package_scan_code" 03 represents the origin of the package.
I want to add a column "origin" to this dataframe such that for each package (identified by "package_id"), the values in the newly added origin column would be the same location that corresponds to "package_scan_code" 03.
In the above case, there are two unique packages 123 and 456, and they have origins as LosAngeles and Memphis respectively (corresponding to package_scan_code 03).
So I want my output to be as follows:
+------------+-------------+--------------------+------------+
| package_id |location | package_scan_code |origin |
+------------+-------------+--------------------+------------+
|123 | Denver |05 | LosAngeles |
|123 | LosAngeles |03 | LosAngeles |
|123 | Dallas |09 | LosAngeles |
|123 | Vail |02 | LosAngeles |
|456 | Jacksonville|05 | Memphis |
|456 | Nashville |09 | Memphis |
|456 | Memphis |03 | Memphis |
How can I achieve this in Pyspark? I tried .withColumn
method, but I could not get the condition right.
解决方案
Filter the data frame by package_scan_code == '03'
and then join back with the original data frame:
(df.filter(df.package_scan_code == '03')
.selectExpr('package_id', 'location as origin')
.join(df, ['package_id'], how='right')
.show())
+----------+----------+------------+-----------------+
|package_id| origin| location|package_scan_code|
+----------+----------+------------+-----------------+
| 123|LosAngeles| Denver| 05|
| 123|LosAngeles| LosAngeles| 03|
| 123|LosAngeles| Dallas| 09|
| 123|LosAngeles| Vail| 02|
| 456| Memphis|Jacksonville| 05|
| 456| Memphis| Nashville| 09|
| 456| Memphis| Memphis| 03|
+----------+----------+------------+-----------------+
Note: this assumes you have at most one package_scan_code
equal to 03
per package_id
, otherwise the logic wouldn't be correct and you need to rethink how origin
should be defined.
推荐阅读
- excel - Pulling multiple names that are listed from a different sheet to another
- love2d - How can you make the updates/second constant?
- python - 逻辑回归完全是关于统计数据?
- c# - '输入结束时的语法错误' - 尝试从数据库启动存储过程
- python - 根据其他 Dataframe 添加特定的列值
- swift - Swift Scanner 在 Xcode 和 Playground 中的工作方式不同
- javascript - 将 CMD 命令添加到 gulp
- react-native - 我们运行了“xcodebuild”命令,但它以错误代码 65 退出
- mysql - SQL 触发器只影响第一条记录
- electron - 在 Electron Builder 中包含 dll 文件