apache-spark - 如何对pyspark中的数据帧进行算术运算?
问题描述
我需要验证我的书面代码是否正确。为此,我必须使用以下公式:
(nvl(units_inflow,0)- nvl(units_inflow_can,0)-nvl(units_outflow,0)+nvl(units_outflow_can,0))*nav_value
此代码在 Oracle SQL 中,我需要在 PySpark 中执行相同的操作。到目前为止,就像nvl
上面代码中使用的一样,我fill()
在 Pyspark 中使用了将 null 值替换为 0。
我的 t3 数据框中有这 5 列,即
["units_inflow","units_inflow_can","units_outflow","units_outflow_can","nav_value"]
到目前为止,我编写的代码是:
t3= t3.na.fill(value=0,subset=["units_inflow","units_inflow_can","units_outflow","units_outflow_can"])
z = t3.select("units_inflow").groupby().sum().show()
y = t3.select("units_inflow_can").groupby().sum().show()
x = t3.select("units_outflow").groupby().sum().show()
w = t3.select("units_outflow_can").groupby().sum().show()
u = t3.select("nav_value").groupby().sum().collect()
print(u)
尽管在完成所有这些之后我无法获得输出。我认为我在代码转换的某个地方出错了。通过考虑每列的总和输出,我在计算器中分别进行了算术运算。
解决方案
Oraclenvl
函数与 相同coalesce
,您可以通过替换nvl
函数来简单地保持公式不变:
from pyspark.sql import functions as F
t3.select(
(
F.coalesce(F.col("units_inflow"), F.lit(0)) -
F.coalesce(F.col("units_inflow_can"), F.lit(0)) -
F.coalesce(F.col("units_outflow"), F.lit(0)) +
F.coalesce(F.col("units_outflow_can"), F.lit(0))
) * F.col("nav_value")
).show()
或者通过使用 sql 表达式:
t3.select(
F.expr("""(
coalesce(units_inflow, 0) - coalesce(units_inflow_can, 0) -
coalesce(units_outflow, 0) + coalesce(units_outflow_can, 0)
) * nav_value
""")
).show()
推荐阅读
- javascript - CSS reset values on Media Query Breakpoint (Resize window or Orientation changes)
- node.js - 使用 UseState() 推入数组时出现无限循环?
- rest-assured - Rest Assured Framework - Handle Bearer Token, X Api key for Apigee proxy URL
- sql - SQL count(*) on a select - division
- haskell - Error while trying folding(foldr()) in Haskell
- sql - 复杂的 PG 正则表达式查询
- python - 这是什么意思 ?警告:root:'PYARROW_IGNORE_TIMEZONE' 环境变量未设置
- python - Mapping obstructions coordinates in 2D to undirected Lattice graph
- swift - Swift Network.Framework TLS with alternative I/O?
- ruby-on-rails - Implementing social login for Rails Api-only app