首页 > 解决方案 > 如何对pyspark中的数据帧进行算术运算?

问题描述

我需要验证我的书面代码是否正确。为此,我必须使用以下公式:

(nvl(units_inflow,0)- nvl(units_inflow_can,0)-nvl(units_outflow,0)+nvl(units_outflow_can,0))*nav_value

此代码在 Oracle SQL 中,我需要在 PySpark 中执行相同的操作。到目前为止,就像nvl上面代码中使用的一样,我fill()在 Pyspark 中使用了将 null 值替换为 0。

我的 t3 数据框中有这 5 列,即

["units_inflow","units_inflow_can","units_outflow","units_outflow_can","nav_value"]

到目前为止,我编写的代码是:

t3= t3.na.fill(value=0,subset=["units_inflow","units_inflow_can","units_outflow","units_outflow_can"])
z = t3.select("units_inflow").groupby().sum().show()

y = t3.select("units_inflow_can").groupby().sum().show()

x = t3.select("units_outflow").groupby().sum().show()

w = t3.select("units_outflow_can").groupby().sum().show()

u = t3.select("nav_value").groupby().sum().collect()

print(u)

尽管在完成所有这些之后我无法获得输出。我认为我在代码转换的某个地方出错了。通过考虑每列的总和输出,我在计算器中分别进行了算术运算。

标签: apache-sparkpysparkapache-spark-sql

解决方案


Oraclenvl函数与 相同coalesce,您可以通过替换nvl函数来简单地保持公式不变:

from pyspark.sql import functions as F

t3.select(
    (
        F.coalesce(F.col("units_inflow"), F.lit(0)) -
        F.coalesce(F.col("units_inflow_can"), F.lit(0)) -
        F.coalesce(F.col("units_outflow"), F.lit(0)) +
        F.coalesce(F.col("units_outflow_can"), F.lit(0))
    ) * F.col("nav_value")
).show()

或者通过使用 sql 表达式:

t3.select(
    F.expr("""(
            coalesce(units_inflow, 0) - coalesce(units_inflow_can, 0) -
            coalesce(units_outflow, 0) + coalesce(units_outflow_can, 0)
           ) * nav_value
    """)
).show()

推荐阅读