首页 > 解决方案 > 如何从表格中计算 pyspark 中的每周平均值?

问题描述

我的表如下:

+--------+-----+----+----------+
|  grDate|Items|rate|  unitName|
+--------+-----+----+----------+
|20131222|  SOY|34.7|Samastipur|
|20131223|  SOY|34.7|Samastipur|
|20131224|  SOY|34.7|Samastipur|
|20131225|  SOY|34.0|Samastipur|
|20131225|  SOY|34.7|Samastipur|
|20131227|  SOY|34.7|Samastipur|
|20131228|  SOY|34.7|Samastipur|
|20131229|  SOY|34.7|Samastipur|
|20131230|  SOY|34.6|Samastipur|
|20131230|  SOY|34.7|Samastipur|
|20131231|  SOY|34.7|Samastipur|
|20140101|  SOY|34.6|Samastipur|
|20140102|  SOY|34.6|Samastipur|
|20140103|  SOY|34.6|Samastipur|
|20140106|  SOY|34.6|Samastipur|
|20140107|  SOY|34.6|Samastipur|
|20140110|  SOY|33.9|Samastipur|
|20140111|  SOY|33.9|Samastipur|
|20140112|  SOY|33.9|Samastipur|
|20140113|  SOY|33.9|Samastipur|
+--------+-----+----+----------+
only showing top 20 rows

我想要的是按周计算速率列的平均值并创建一个具有这些值的新表。

我正在尝试执行此查询

spjsoya.createOrReplaceTempView("spjsoya")
spark.sql("SELECT grDate , (sum(total)/(WEEK(MAX(rate)) - WEEK(MIN(rate))+1)) AS rate FROM spjsoya  WHERE YEAR(grDate)='2013' GROUP BY grDate").show()

但它给了我错误

pyspark.sql.utils.AnalysisException: u"Undefined function: 'WEEK'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 29"

我怎样才能编写正确的查询以平均每周?

标签: pythonsqlapache-sparkpyspark

解决方案


以下是您的问题的代码:

import pyspark.sql.functions as F

spjsoya.withColumn(
    'date_str',
    F.from_unixtime(
        F.unix_timestamp('a', 'yyyyMMdd')
    )
).withColumn(
    "week_num",
    F.weekofyear(F.col("date_str"))
).withColumn(
    "month",
    F.month(F.col("date_str"))
).withColumn(
    "year",
    F.year(F.col("date_str"))
).groupby(
    "year",
    "week_num"
).agg(
    F.avg("rate").alias("average_rate")
).orderBy(
    "week_num"
).show()

推荐阅读