首页 > 解决方案 > 确定 pyspark DataFrame 行值是否存在于其他列中

问题描述

我正在使用 pyspark 中的数据框,如果数据框的其他列中存在值,则需要逐行评估。例如,给定这个数据框:

东风

+---------+--------------+-------+-------+-------+
|Subject  |SubjectTotal  |TypeA  |TypeB  |TypeC  |
+---------+--------------+-------+-------+-------+
|Subject1 |10            |5      |3      |2      |
+---------+--------------+-------+-------+-------+
|Subject2 |15            |0      |15     |0      |
+---------+--------------+-------+-------+-------+
|Subject3 |5             |0      |0      |5      |
+---------+--------------+-------+-------+-------+

作为输出,我需要确定哪个 Type 具有 100% 的SubjectTotal. 所以我的输出看起来像这样:

df_output

+---------+--------------+
|Subject  |Type          |
+---------+--------------+
|Subject2 |TypeB         |
+---------+--------------+
|Subject3 |TypeC         |
+---------+--------------+

甚至可能吗?

谢谢!

标签: dataframeapache-sparkpysparkapache-spark-sql

解决方案


哟可以尝试在 SQL 中使用when().otherwise()PySpark SQL 函数或case语句

import pyspark.sql.functions as F
df = spark.createDataFrame(
[
  ("Subject1", 10, 5, 3, 2),
  ("Subject2", 15, 0, 15, 0),
  ("Subject3", 5, 0, 0, 5)
],
("subject", "subjectTotal", "TypeA", "TypeB", "TypeC"))
df.show()

+--------+------------+-----+-----+-----+
| subject|subjectTotal|TypeA|TypeB|TypeC|
+--------+------------+-----+-----+-----+
|Subject1|          10|    5|    3|    2|
|Subject2|          15|    0|   15|    0|
|Subject3|           5|    0|    0|    5|
+--------+------------+-----+-----+-----+

df.withColumn("Type", F.
              when(F.col("subjectTotal") == F.col("TypeA"), "TypeA").
             when(F.col("subjectTotal") == F.col("TypeB"), "TypeB").
             when(F.col("subjectTotal") == F.col("TypeC"), "TypeC").
             otherwise(None)).show()

+--------+------------+-----+-----+-----+-----+
| subject|subjectTotal|TypeA|TypeB|TypeC| Type|
+--------+------------+-----+-----+-----+-----+
|Subject1|          10|    5|    3|    2| null|
|Subject2|          15|    0|   15|    0|TypeB|
|Subject3|           5|    0|    0|    5|TypeC|
+--------+------------+-----+-----+-----+-----+

推荐阅读