首页 > 解决方案 > Spark 数据框无法比较 Null 值

问题描述

大家好,我有 2 个数据帧,我正在比较数据帧的值,并基于将值分配给一个新数据帧的值。所有场景都工作正常,期望空字段比较,即如果两个数据帧中的值都为空,那么它应该显示为“已验证”,但它给了我“未可变”我正在共享我的数据帧数据和我正在使用的代码和下面的最终数据框的结果。

    scala> df1.show()
    +---+-----+---+--------+------+-------+
    | id| name|age|lastname|  city|country|
    +---+-----+---+--------+------+-------+
    |  1|rohan| 26|  sharma|mumbai|  india|
    |  2|rohan| 26|  sharma|  null|  india|
    |  3|rohan| 26|    null|mumbai|  india|
    |  4|rohan| 26|  sharma|mumbai|  india|
    +---+-----+---+--------+------+-------+
    scala> df2.show()
    +----+------+-----+----------+------+---------+
    |o_id|o_name|o_age|o_lastname|o_city|o_country|
    +----+------+-----+----------+------+---------+
    |   1| rohan|   26|    sharma|mumbai|    india|
    |   2| rohan|   26|    sharma|  null|    india|
    |   3| rohan|   26|    sharma|mumbai|    india|
    |   4| rohan|   26|      null|mumbai|    india|
    +----+------+-----+----------+------+---------+

    val df3 = df1.join(df2, df1("id") === df2("o_id"))
    .withColumn("result", when(df1("name") === df2("o_name") && 
    df1("age") === df2("o_age") && 
    df1("lastname") === df2("o_lastname") && 
    df1("city") === df2("o_city")  &&
    df1("country") === df2("o_country"), "Varified")
    .otherwise("Not Varified")).show()

    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
    | id| name|age|lastname|  city|country|o_id|o_name|o_age|o_lastname|o_city|o_country|      result|
    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
    |  1|rohan| 26|  sharma|mumbai|  india|   1| rohan|   26|    sharma|mumbai|    india|    Varified|
    |  2|rohan| 26|  sharma|  null|  india|   2| rohan|   26|    sharma|  null|    india|Not Varified|
    |  3|rohan| 26|    null|mumbai|  india|   3| rohan|   26|    sharma|mumbai|    india|Not Varified|
    |  4|rohan| 26|  sharma|mumbai|  india|   4| rohan|   26|      null|mumbai|    india|Not Varified|
    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+

我希望 id '2' 也应该显示为 'Varified'。但是该城市在两个列中都为空,然后显示为 'Not Varified'。有人可以指导我如何修改我的 df3 查询,以便它也可以检查 null 并且对于 id '2' 也可以在结果列中显示为 'Varified'。

标签: scalaapache-sparkapache-spark-sqlpyspark-dataframes

解决方案


使用 <=>而不是===

val df3 = df1.join(df2, df1("id") === df2("o_id"))
    .withColumn("result", when(df1("name") <=> df2("o_name") && 
    df1("age") <=> df2("o_age") && 
    df1("lastname") <=> df2("o_lastname") && 
    df1("city") <=> df2("o_city")  &&
    df1("country") <=> df2("o_country"), "Varified")
    .otherwise("Not Varified")).show()
spark.sql("SELECT NULL AS city1, NULL AS city2").select($"city1" <=> $"city2").show

结果

+-----------------+
|(city1 <=> city2)|
+-----------------+
|            true |
+-----------------+

推荐阅读