apache-spark - PySpark Dataframe 将两列转换为基于第三列值的元组新列
问题描述
正如主题所描述的,我有一个 PySpark 数据框,我需要将两列转换为一个新列,该列是基于第三列值的元组列表。此转换将通过键值(在这种情况下为产品 ID)以及每个键一行的结果来减少或展平数据帧。
该数据框中有数亿行,具有 3700 万个唯一产品 ID。因此,我需要一种方法来在 spark 集群上进行转换,而不会将任何数据带回驱动程序(在本例中为 Jupyter)。
这是我仅针对 1 个产品的数据框的摘录:
+-----------+-------------------+-------------+--------+----------+---------------+
| product_id| purchase_date|days_warranty|store_id|year_month| category|
+-----------+-------------------+-----------+----------+----------+---------------+
|02147465400|2017-05-16 00:00:00| 30| 205| 2017-05| CATEGORY A|
|02147465400|2017-04-15 00:00:00| 30| 205| 2017-04| CATEGORY A|
|02147465400|2018-07-11 00:00:00| 30| 205| 2018-07| CATEGORY A|
|02147465400|2017-06-14 00:00:00| 30| 205| 2017-06| CATEGORY A|
|02147465400|2017-03-16 00:00:00| 30| 205| 2017-03| CATEGORY A|
|02147465400|2017-08-14 00:00:00| 30| 205| 2017-08| CATEGORY A|
|02147465400|2017-09-12 00:00:00| 30| 205| 2017-09| CATEGORY A|
|02147465400|2017-01-21 00:00:00| 30| 205| 2017-01| CATEGORY A|
|02147465400|2018-08-14 00:00:00| 30| 205| 2018-08| CATEGORY A|
|02147465400|2018-08-23 00:00:00| 30| 205| 2018-08| CATEGORY A|
|02147465400|2017-10-11 00:00:00| 30| 205| 2017-10| CATEGORY A|
|02147465400|2017-12-12 00:00:00| 30| 205| 2017-12| CATEGORY A|
|02147465400|2017-02-15 00:00:00| 30| 205| 2017-02| CATEGORY A|
|02147465400|2018-04-12 00:00:00| 30| 205| 2018-04| CATEGORY A|
|02147465400|2018-03-12 00:00:00| 30| 205| 2018-03| CATEGORY A|
|02147465400|2018-05-15 00:00:00| 30| 205| 2018-05| CATEGORY A|
|02147465400|2018-02-12 00:00:00| 30| 205| 2018-02| CATEGORY A|
|02147465400|2018-06-14 00:00:00| 30| 205| 2018-06| CATEGORY A|
|02147465400|2018-01-11 00:00:00| 30| 205| 2018-01| CATEGORY A|
|02147465400|2017-07-20 00:00:00| 30| 205| 2017-07| CATEGORY A|
|02147465400|2017-11-11 00:00:00| 30| 205| 2017-11| CATEGORY A|
|02147465400|2017-01-05 00:00:00| 90| 205| 2017-01| CATEGORY B|
|02147465400|2017-01-21 00:00:00| 90| 205| 2017-01| CATEGORY B|
|02147465400|2017-10-09 00:00:00| 90| 205| 2017-10| CATEGORY B|
|02147465400|2018-07-11 00:00:00| 90| 205| 2018-07| CATEGORY B|
|02147465400|2017-04-16 00:00:00| 90| 205| 2017-04| CATEGORY B|
|02147465400|2018-09-16 00:00:00| 90| 205| 2018-09| CATEGORY B|
|02147465400|2018-04-14 00:00:00| 90| 205| 2018-04| CATEGORY B|
|02147465400|2018-01-12 00:00:00| 90| 205| 2018-01| CATEGORY B|
|02147465400|2017-07-15 00:00:00| 90| 205| 2017-07| CATEGORY B|
+-----------+-------------------+-----------+----------+----------+---------------+
这是所需的结果数据框,一个产品的一行,其中原始数据框的行将 purchase_date 和 days_warranty 列作为元组数组转换为基于类别列值的新列:
+-----------+----------------------------+----------------------------+
| product_id| CATEGORY A| CATEGORY B|
+-----------+----------------------------+----------------------------+
|02147465400| [ (2017-05-16 00:00:00,30),| [ (2017-01-05 00:00:00,90),|
| | (2017-04-15 00:00:00,30),| (2017-01-21 00:00:00,90),|
| | (2018-07-11 00:00:00,30),| (2017-10-09 00:00:00,90),|
| | (2017-06-14 00:00:00,30),| (2018-07-11 00:00:00,90),|
| | (2017-03-16 00:00:00,30),| (2017-04-16 00:00:00,90),|
| | (2017-08-14 00:00:00,30),| (2018-09-16 00:00:00,90),|
| | (2017-09-12 00:00:00,30),| (2018-04-14 00:00:00,90),|
| | (2017-01-21 00:00:00,30),| (2018-01-12 00:00:00,90),|
| | (2018-08-14 00:00:00,30),| (2017-07-15 00:00:00,90) |
| | (2018-08-23 00:00:00,30),| ] |
| | (2017-10-11 00:00:00,30),| |
| | (2017-12-12 00:00:00,30),| |
| | (2017-02-15 00:00:00,30),| |
| | (2018-04-12 00:00:00,30),| |
| | (2018-03-12 00:00:00,30),| |
| | (2018-05-15 00:00:00,30),| |
| | (2018-02-12 00:00:00,30),| |
| | (2018-06-14 00:00:00,30),| |
| | (2018-01-11 00:00:00,30),| |
| | (2017-07-20 00:00:00,30) | |
| | ] |
+-----------+----------------------------+----------------------------+
解决方案
如果您在使用 pivot 时遇到性能问题,下面的方法是解决同一问题的另一种方法,尽管它允许您通过使用 for 循环将作业拆分为每个类别的阶段来进行更多控制。对于每次迭代,这会将 category_x 的新数据附加到 acc_df 中,其中将保存累积的结果。
schema = ArrayType(
StructType((
StructField("p_date", StringType(), False),
StructField("d_warranty", StringType(), False)
))
)
tuple_list_udf = udf(tuple_list, schema)
buf_size = 5 # if you get OOM error decrease this to persist more often
categories = df.select("category").distinct().collect()
acc_df = spark.createDataFrame(sc.emptyRDD(), df.schema) # create an empty df which holds the accumulated results for each category
for idx, c in enumerate(categories):
col_name = c[0].replace(" ", "_") # spark complains for columns containing space
cat_df = df.where(df["category"] == c[0]) \
.groupBy("product_id") \
.agg(
F.collect_list(F.col("purchase_date")).alias("p_date"),
F.collect_list(F.col("days_warranty")).alias("d_warranty")) \
.withColumn(col_name, tuple_list_udf(F.col("p_date"), F.col("d_warranty"))) \
.drop("p_date", "d_warranty")
if idx == 0:
acc_df = cat_df
else:
acc_df = acc_df \
.join(cat_df.alias("cat_df"), "product_id") \
.drop(F.col("cat_df.product_id"))
# you can persist here every buf_size iterations
if idx + 1 % buf_size == 0:
acc_df = acc_df.persist()
函数 tuple_list 负责生成包含 purchase_date 和 days_warranty 列的元组的列表。
def tuple_list(pdl, dwl):
return list(zip(pdl, dwl))
其输出将是:
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|product_id |CATEGORY_B |CATEGORY_A |
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|02147465400|[[2017-04-16 00:00:00, 90], [2018-09-16 00:00:00, 90], [2017-10-09 00:00:00, 90], [2018-01-12 00:00:00, 90], [2018-07-11 00:00:00, 90], [2017-01-21 00:00:00, 90], [2018-04-14 00:00:00, 90], [2017-01-05 00:00:00, 90], [2017-07-15 00:00:00, 90]]|[[2017-06-14 00:00:00, 30], [2018-08-14 00:00:00, 30], [2018-01-11 00:00:00, 30], [2018-04-12 00:00:00, 30], [2017-10-11 00:00:00, 30], [2017-05-16 00:00:00, 30], [2018-05-15 00:00:00, 30], [2017-04-15 00:00:00, 30], [2017-02-15 00:00:00, 30], [2018-02-12 00:00:00, 30], [2017-01-21 00:00:00, 30], [2018-07-11 00:00:00, 30], [2018-06-14 00:00:00, 30], [2017-03-16 00:00:00, 30], [2017-07-20 00:00:00, 30], [2018-08-23 00:00:00, 30], [2017-09-12 00:00:00, 30], [2018-03-12 00:00:00, 30], [2017-12-12 00:00:00, 30], [2017-08-14 00:00:00, 30], [2017-11-11 00:00:00, 30]]|
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
推荐阅读
- vue.js - vuex - 未知的动作类型(无法调度我的动作)
- c# - How to set the index of a combobox that was dynamically added to a datagridview
- c# - 列表视图和组合框的 SelectedItem 颜色
- python - 如何检查一个字典的所有键是否存在于另一个字典中?
- excel - excel中数据量不均匀的插值
- sql - SQL循环表将记录插入新表,然后获取新ID并插入其他表
- javascript - 如何在不知道原始类型的情况下将缓冲区转换为字符串/数字/日期
- javascript - 反应导航抽屉多次更新
- components - 将查询从 index.js 移动到 GatsbyJS 中的组件后,无法读取未定义的属性 'allContentfulBlogPost'(使用 Contenful 和 GraphQL)
- mysql - 过滤 MySQL 中的每一列