首页 > 解决方案 > 使用指定列级别在 pyspark 中展开嵌套的 json 文件

问题描述

我创建了一个使用 pypsark 分解嵌套 json 文件的代码,但我希望他设置每一列的级别

 |-- City: string (nullable = true)
 |-- RecordNumber: integer (nullable = true)
 |-- State: string (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- Adress: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- House: string (nullable = true)
 |    |    |-- Street: string (nullable = true)
 |    |    |-- Appartement: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- number: string (nullable = true)
 |    |    |    |    |-- level: string (nullable = true)

这段代码爆炸了我的数据框

 #Display  json file 
df=df.select("City", "State","RecordNumber","ZipCodeType","Zipcode", psf.explode("Adress").alias("Adress"))\
    .select("City","State" ,"RecordNumber","ZipCodeType","Zipcode" , "Adress.*")\
    .select("City","State" ,"RecordNumber","ZipCodeType","Zipcode" , "House", "Street", psf.explode("Appartement").alias("Appartement"))\
    .select("City","State" ,"RecordNumber","ZipCodeType","Zipcode" , "House", "Street", "Appartement.*")
df.show()

这是之后的架构

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- RecordNumber: integer (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- House: string (nullable = true)
 |-- Street: string (nullable = true)
 |-- number: string (nullable = true)
 |-- level: string (nullable = true)

我想更新此代码以显示例如 House 作为 Adress.House , Adress.Street , Adress.Appartement.number , Adress.Appartement.level 就像我只是更改架构的名称

标签: pythonapache-sparkpyspark

解决方案


推荐阅读