首页 > 解决方案 > 为什么这个 spark sql 查询返回 null 的列的总和?

问题描述

我的输入:

+-------------------+------------------+-----------------+
|    TransactionDate|Product Major Code|Gross Trade Sales|
+-------------------+------------------+-----------------+
|2017-09-30 00:00:00|                 A|            100.0|
|2017-06-30 00:00:00|                 B|            200.0|
|2017-06-30 00:00:00|                 C|            300.0|
+-------------------+------------------+-----------------+

我的代码:

df.registerTempTable("tmp")
df2=spark.sql("SELECT TransactionDate,'Product Major Code', sum('Gross Trade Sales') FROM tmp GROUP BY TransactionDate,'Product Major Code'")
spark.catalog.dropTempView('tmp')

我的输出:

+-------------------+------------------+--------------------------------------+
|    TransactionDate|Product Major Code|sum(CAST(Gross Trade Sales AS DOUBLE))|
+-------------------+------------------+--------------------------------------+
|2017-09-30 00:00:00|Product Major Code|                                  null|
|2017-06-30 00:00:00|Product Major Code|                                  null|
+-------------------+------------------+--------------------------------------+

任何人都知道为什么它没有正确汇总产品主要代码和总贸易销售额?

更新:

最后我在下面给出了 PaulITs 的答案,因为它更优雅,并且不必担心反引号:

import pyspark.sql.functions as f trydf.groupBy(f.col("TransactionDate"), f.col("Product Major Code")).agg(f.sum(f.col("Gross Trade Sales"))).show()

标签: pysparkpyspark-sql

解决方案


推荐阅读