首页 > 解决方案 > 'DataFrame' 对象在 pyspark 中不可调用

问题描述

我想要薪水高于 pyspark 部门平均薪水的员工姓名。

filt = df3.select('SALARY','Dept_name','First_name','Last_name')
filt.filter(filt('SALARY').geq(filt.groupBy('Dept_name').agg(F.mean('SALARY')))).show()

标签: pysparkapache-spark-sql

解决方案


创建示例数据框:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

data=  [[200,'Marketing','Jane','Smith'],
        [140,'Marketing','Jerry','Soreky'],  
        [120,'Marketing','Justin','Sauren'],
        [170,'Sales','Joe','Statham'],
        [190,'Sales','Jeremy','Sage'],
        [220,'Sales','Jay','Sawyer']]
columns= ['SALARY','Dept_name','First_name','Last_name']
df= spark.createDataFrame(data,columns)

df.show()


+------+---------+----------+---------+
|SALARY|Dept_name|First_name|Last_name|
+------+---------+----------+---------+
|   200|Marketing|      Jane|    Smith|
|   140|Marketing|     Jerry|   Soreky|
|   120|Marketing|    Justin|   Sauren|
|   170|    Sales|       Joe|  Statham|
|   190|    Sales|    Jeremy|     Sage|
|   220|    Sales|       Jay|   Sawyer|
+------+---------+----------+---------+

创建查询以检索薪水高于部门平均水平的人员:

w=Window().partitionBy("Dept_name")
df.withColumn("Average_Salary", F.avg("SALARY").over(w))\
  .filter(F.col("SALARY")>F.col("Average_Salary"))\
  .select("SALARY","Dept_name","First_name","Last_name")\
  .show()

+------+---------+----------+---------+
|SALARY|Dept_name|First_name|Last_name|
+------+---------+----------+---------+
|   220|    Sales|       Jay|   Sawyer|
|   200|Marketing|      Jane|    Smith|
+------+---------+----------+---------+

推荐阅读