首页 > 解决方案 > 如何在 Pyspark 聚合中进行多重算术运算

问题描述

最近,我只是尝试在 Pyspark 聚合中进行单次算术运算,这是熊猫代码

packetmonthly=packet.groupby(['year','month','msisdn']).apply(lambda s: pd.Series({ 
    "packet_sum": s.amount.sum(),
    "packet_avg": s.amount.mean()
})).reset_index()

和 pyspark 代码

from pyspark.sql.functions import  max as pyspark_max, min as pyspark_min, sum as pyspark_sum, avg

fd_packet = fd_packetpurchase \
            .groupBy('year', 'month', 'msisdn') \
            .agg(pyspark_min('amount').alias('packet_min'),
            pyspark_max('amount').alias('packet_max'),
            avg('amount').alias('packet_avg'),
            pyspark_sum('amount').alias('packet_sum'))

这是我需要翻译的熊猫代码

datadaily=profile[profile.month.isin([12,1])].groupby(['year','month','day','msisdn']).apply(lambda s: pd.Series({ 
    "totrev_sum": (s["voice_revenue"]+s["sms_revenue"]+s["dta_revenue"]+s["vas_revenue"]).sum(),
    "data_yield_sum": (s["dta_revenue"].sum()/s["data_usage"].sum())
})).reset_index()

如何在pyspark中做到这一点?

标签: pythonpandaspyspark

解决方案


推荐阅读