首页 > 解决方案 > 如何在 pyspark 中执行类似于 SparkR 示例代码的窗口分区滞后

问题描述

我正在尝试在 pyspark 中实现类似于以下 SparkR 代码的内容。

df <- createDataFrame(mtcars)
# Partition by am (transmission) and order by hp (horsepower)
ws <- orderBy(windowPartitionBy("am"), "hp")
# Lag mpg values by 1 row on the partition-and-ordered table
out <- select(df, over(lag(df$mpg), ws), df$mpg, df$hp, df$am)

有谁知道如何在 pyspark 数据框上执行此操作?

标签: pythondataframepysparkapache-spark-sqlsparkr

解决方案


from pyspark.sql.window import Window
from pyspark.sql.functions import lag  
    
#Create dataframe    
data = (("A", 10), ("B", 20), ("A", 30), ("C", 15))
columns = ["Name", "Number"]
    
df = sqlContext.createDataFrame(data, columns)

#Define the window        
win =  Window.partitionBy("Name").orderBy("Number")

df_lag = df.withColumn("lag", lag("Number", 1, -5).over(win))

推荐阅读