apache-spark - 如何在 pyspark groupby 上将 UDF 与 pandas 一起使用?
问题描述
我正在努力在 pyspark 上的 pandas 上使用 pandas UDF。你能帮我理解这是如何实现的吗?以下是我的尝试:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
from pyspark import pandas as ps
spark = SparkSession.builder.getOrCreate()
df = ps.DataFrame({'A': 'a a b'.split(),
'B': [1, 2, 3],
'C': [4, 6, 5]}, columns=['A', 'B', 'C'])
@pandas_udf('float')
def agg_a(x):
return (x**2).mean()
@pandas_udf('float')
def agg_b(x):
return x.mean()
spark.udf.register('agg_a_',agg_a)
spark.udf.register('agg_b_',agg_b)
df_means = df.groupby('A')
dfout=df_means.agg({'B':'agg_a_','C':'agg_b_'})
这导致我难以理解的异常:
AnalysisException: expression 'B' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
Aggregate [__index_level_0__#14], [__index_level_0__#14, agg_a_(B#2L) AS B#15, agg_b_(C#3L) AS C#16]
+- Project [A#1 AS __index_level_0__#14, A#1, B#2L, C#3L]
+- Project [__index_level_0__#0L, A#1, B#2L, C#3L, monotonically_increasing_id() AS __natural_order__#8L]
+- LogicalRDD [__index_level_0__#0L, A#1, B#2L, C#3L], false
我尝试使用udf
而不是pandas_udf
但是,同样的异常也失败了
我也尝试仅在一列上使用带有 UDF 的 groupby ,但这也失败了:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark import pandas as ps
spark = SparkSession.builder.getOrCreate()
df = ps.DataFrame({'A': 'a a b'.split(),
'B': [1, 2, 3],
'C': [4, 6, 5]}, columns=['A', 'B', 'C'])
@udf('float')
def agg_a(x):
return (x**2).mean()
@udf('float')
def agg_b(x):
return x.mean()
spark.udf.register('agg_a_',agg_a)
spark.udf.register('agg_b_',agg_b)
df_means = df.groupby('A')['B']
dfout=df_means.agg('agg_a_')
输出:
PandasNotImplementedError: The method `pd.groupby.GroupBy.agg()` is not implemented yet.
我猜这不是真的。如果我不使用 UDF 并使用已定义的函数,如“min”、“max”,我可以使用 groupby。
我尝试在不按列指定不同 UDF 的情况下使用,但也失败了:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark import pandas as ps
spark = SparkSession.builder.getOrCreate()
df = ps.DataFrame({'A': 'a a b'.split(),
'B': [1, 2, 3],
'C': [4, 6, 5]}, columns=['A', 'B', 'C'])
@udf('float')
def agg_a(x):
return (x**2).mean()
@udf('float')
def agg_b(x):
return x.mean()
spark.udf.register('agg_a_',agg_a)
spark.udf.register('agg_b_',agg_b)
df_means = df.groupby('A')
dfout=df_means.agg('agg_a_')
输出:
AnalysisException: expression 'B' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
Aggregate [__index_level_0__#14], [__index_level_0__#14, agg_a_(B#2L) AS B#15, agg_a_(C#3L) AS C#16]
+- Project [A#1 AS __index_level_0__#14, A#1, B#2L, C#3L]
+- Project [__index_level_0__#0L, A#1, B#2L, C#3L, monotonically_increasing_id() AS __natural_order__#8L]
+- LogicalRDD [__index_level_0__#0L, A#1, B#2L, C#3L], false
解决方案
根据GroupedData.agg
文档,您需要定义您的pandas_udf
with PandasUDFType
。如果您需要聚合,那么它将是PandasUDFType.GROUPED_AGG
.
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf('float', PandasUDFType.GROUPED_AGG)
def agg_a(x):
return (x**2).mean()
@pandas_udf('float', PandasUDFType.GROUPED_AGG)
def agg_b(x):
return x.mean()
spark.udf.register('agg_a_',agg_a)
spark.udf.register('agg_b_',agg_b)
df.groupby('A').agg({'B':'agg_a_','C':'agg_b_'}).show()
# +---+---------+---------+
# | A|agg_a_(B)|agg_b_(C)|
# +---+---------+---------+
# | b| 9.0| 5.0|
# | a| 2.5| 5.0|
# +---+---------+---------+
推荐阅读
- c++ - OpenGL 将 ClipCoord 转换为 ScreenCoord
- javascript - 在字符串数组中使用 .map 但为空字符串返回未定义
- keras - 如何从 .h5 和 .json 文件中获取模型架构?
- pandas - 在熊猫中将年中值扩展到月份
- html - 无法使 CSS 按钮可点击
- excel - Excel公式查找与某个值组合最接近的X值组合
- opencv - 如何让 apt-get 忽略 ros-kinetic-opencv3?
- java - Optaplanner:bestScore和bestSolution得分之间的差异
- c++ - 是否可以将多个结构作为一个数据包存储在一个函数中并传递给另一个函数并在那里提取?
- javascript - 如何将数字 1-20 放入数组中,然后将偶数和奇数分开并在 Web 浏览器上打印