pyspark - 创建一个标记客户的新列

问题描述

我的目标是汇总 customerID（计数），创建一个新列并标记经常返回文章的客户。我怎样才能做到这一点？（使用 Databricks，pyspark）

train.select("itemID","customerID","returnShipment").show(10)
+------+----------+--------------+
|itemID|customerID|returnShipment|
+------+----------+--------------+
|   186|       794|             0|
|    71|       794|             1|
|    71|       794|             1|
|    32|       850|             1|
|    32|       850|             1|
|    57|       850|             1|
|     2|       850|             1|
|   259|       850|             1|
|   603|       850|             1|
|   259|       850|             1|
+------+----------+--------------+

标签： pyspark

您可以定义一个阈值，然后将此阈值与returnShipments每个的总和进行比较customerID：

from pyspark.sql import functions as F

threshold=5
df.groupBy("customerID")\
    .sum("returnShipment") \
    .withColumn("mark", F.col("sum(returnShipment)") > threshold) \
    .show()

pyspark - 创建一个标记客户的新列

问题描述

解决方案

推荐阅读