首页 > 解决方案 > PySpark:将 nullType 字段转换为结构类型列下的字符串

问题描述

我有一个具有以下架构的数据框。列下的translation_version字段在中。我想将所有内容都转换为字符串。我有 17 种语言translations --> languages (no, pt,...)nulltranslation_versiontranslations

root
|-- translations: struct (nullable = true)
|    |-- no: struct (nullable = true)
|    |    |-- Description: string (nullable = true)
|    |    |-- class: string (nullable = true)
|    |    |-- description: string (nullable = true)
|    |    |-- translation_version: null (nullable = true) // Want to cast as string
|    |-- pt: struct (nullable = true)
|    |    |-- Description: string (nullable = true)
|    |    |-- class: string (nullable = true)
|    |    |-- description: string (nullable = true)
|    |    |-- translation_version: null (nullable = true)
|    |-- fr: struct (nullable = true)
|    |    |-- Description: string (nullable = true)
|    |    |-- class: string (nullable = true)
|    |    |-- description: string (nullable = true)
|    |    |-- translation_version: null (nullable = true)

我试过df = df.na.fill('null')但没有改变任何东西。还尝试使用以下代码进行投射

df = df.withColumn("translations", F.col("translations").cast("struct<struct<translation_version: string>>"))

但这返回了以下错误

pyspark.sql.utils.ParseException: u"\nmismatched input '<' expecting ':'(line 1, pos 13)\n\n== SQL ==\nstruct<struct<translation_version: string>>\n-------------^^^\n"

知道如何将所有translation_version语言都转换为字符串吗?

标签: apache-sparkpysparkaws-glue

解决方案


这应该可以解决问题

from pyspark.sql.functions import col, struct
from pyspark.sql.types import StructType, StructField, StringType

schema_ = StructType([StructField("Description",StringType(),True),
                      StructField("class",StringType(),True),
                      StructField("description",StringType(),True),
                      StructField("translation_version",StringType(),True)
                     ]
                    )

df_1 = (
    df
    .select("translations.*")
    .withColumn("translations", struct(
        col("fr").cast(schema).alias("fr"),
        col("pt").cast(schema).alias("pt"),
        col("no").cast(schema).alias("no")
               )
               )
    .drop("fr", "pt", "no")
)

推荐阅读