首页 > 解决方案 > 用数组列展平数据框

问题描述

假设我有一个 PySpark 数据框,它df.printSchema()是:

root
 |-- shop_id: int (nullable = false)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- item_id: int (nullable = false)

如何将其转换为:

root
 |-- shop_id: int (nullable = false)
 |-- item_id: int (nullable = false)

换句话说,在每个条目shop_id中都“附加”到每个条目,item_id并且这些对被定向到单个流中。

更直观的解释:

[
   {
      "shop_id":42,
      "items":[{"item_id":101}, {"item_id":102}]
   },
   {
      "shop_id":43,
      "items":[{"item_id":203}]
   }
]

[
   {"shop_id":42,"item_id":101},
   {"shop_id":42,"item_id":102},
   {"shop_id":43,"item_id":203}
]

标签: pythonpyspark

解决方案


tl;博士

df.select('shop_id',F.explode('items.item_id').alias('item_id'))

测试

from pyspark.sql.types import StructType, StructField, ArrayType, StructType, IntegerType

schema = StructType([
    StructField('shop_id', IntegerType()),
    StructField('items', ArrayType(
        StructType([
            StructField('item_id', IntegerType()),
        ])
    ))
])

data = [
   {
      "shop_id":42,
      "items":[{"item_id":101}, {"item_id":102}]
   },
   {
      "shop_id":43,
      "items":[{"item_id":203}]
   }
]

df = spark_session.createDataFrame(data, schema)

df.printSchema()

root
 |-- shop_id: integer (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- item_id: integer (nullable = true)

df = df.select('shop_id',F.explode('items.item_id').alias('item_id'))
df.printSchema()

root
 |-- shop_id: integer (nullable = true)
 |-- item_id: integer (nullable = true)

df.collect()

[Row(shop_id=42, item_id=101),
 Row(shop_id=42, item_id=102),
 Row(shop_id=43, item_id=203)]

推荐阅读