pyspark - Pyspark：用 groupby 替换逐行循环？

问题描述

我想通过基于数据框的一组列执行一组操作来生成特征。我的数据框看起来像：

root
 |-- CreatedOn: string (nullable = true)
 |-- ID: string (nullable = true)
 |-- Industry: string (nullable = true)
 |-- region: string (nullable = true)
 |-- Customer: string (nullable = true)

例如。在过去 3/2/1 个月内使用 ID 和区域的次数。为此，我必须将整个数据帧扫描到当前记录。当前逻辑：

    1. for i in df.collect() - Row-wise collect.
    2. Filter the data 3 months before this row.
    3. Generate features.

代码运行良好，但由于它是按行循环，因此运行时间超过 10 小时。有什么方法可以替换 Pyspark 中的逐行操作，因为它没有利用 pyspark 提供的并行性。像groupby这样的东西？？

样本数据：

S.No    ID             CreatedOn          Industry  Region
1   ERP 05thMa2020  Communications  USA
2   ERP 28thSept2020    Communications  USA
3   ERP 15thOct2020 Communications   Europe
4   ERP 15thNov2020 Communications  Europe
5   Cloud   1stDec2020       Insurance  Europe

考虑记录#4.. 特征 1 (Count_3monthsRegion)：我想看看过去 3 个月内 ERP 在欧洲使用了多少次 (wrt CreatedOn)。答案是 1。（虽然记录#2 是 ERP 但在同一地区）

功能 2（Count_3monthsIndustry）：我想看看过去 3 个月内 ERP 在通信中使用了多少次（wrt CreatedOn）。答案将是 2。

预期输出：

S.No    ID             CreatedOn          Industry  Region  Count_3monthsRegion Count_3monthsIndustry
1   ERP 05thMay2020 Communications  USA 0   0
2   ERP 28thSept2020    Communications  USA 0   0
3   ERP 15thOct2020 Communications   Europe 0   1
4   ERP 15thNov2020 Communications  Europe  1   2
5   Cloud   1stDec2020       Insurance  Europe  0   0

标签： pysparkpyspark-dataframes

pyspark - Pyspark：用 groupby 替换逐行循环？

问题描述

解决方案

推荐阅读