首页 > 解决方案 > 在 Pyspark Dataframe 的一个字段中将(年、月、日、小时、分钟、秒)的嵌套值分解为日期时间类型

问题描述

START_Time我正在尝试将嵌套字段转换为一个 DATETIME 类型的字段,当我使用explode 函数时出现错误:由于数据类型不匹配,无法解析“explode( )”

我拥有的数据:

 |-- MODEL: string (nullable = true)
 |-- START_Time: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)
 |-- WEIGHT: string (nullable = true)
 |-- REGISTRED: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)
 |-- TOTAL: string (nullable = true)

我希望得到的结果:使用 START_TIME 和 REGISTRED 作为 DATE 类型

+---------+------------------+----------+-----------------+---------+
|MODEL    |   START_Time     | WEIGHT   |REGISTED         |TOTAL    |
+---------+------------------+----------+-----------------+---------+
|.........| yy-mm-dd-hh-mm-ss| WEIGHT   |yy-mm-dd-hh-mm-ss|TOTAL    |

我努力了 :

df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), .....)

但是当嵌套字段中有空值时,它会进入 (-----) 并得到我:

+---------+------------------+----------+-----------------+---------+
|MODEL    |   START_Time     | WEIGHT   |REGISTED         |TOTAL    |
+---------+------------------+----------+-----------------+---------+
|value    | -----            | value    | -----           |value    |

标签: dataframepyspark

解决方案


连接后,您可以将整个列转换为timestamp类型,Spark 将为您处理丢失(和无效)的数据并null返回

from pyspark.sql import functions as F

(df
    .withColumn('raw_string_date', F
        .concat(
            F.col('START_TIME.year'),
            F.lit('-'),
            F.col('START_TIME.month'),
            F.lit('-'),
            F.col('START_TIME.day'),
            F.lit(' '),
            F.col('START_TIME.hour'),
            F.lit(':'),
            F.col('START_TIME.minute'),
            F.lit(':'),
            F.col('START_TIME.second'),
        )
    )
    .withColumn('date_type', F.col('raw_string_date').cast('timestamp'))
    .show(10, False)
)

# +------------------------------------+---------------+-------------------+
# |START_TIME                          |raw_string_date|date_type          |
# +------------------------------------+---------------+-------------------+
# |{1, 2, 3, 4, 5, 2021}               |2021-4-1 2:3:5 |2021-04-01 02:03:05|
# |{, , , , , }                        |-- ::          |null               |
# |{null, null, null, null, null, null}|null           |null               |
# +------------------------------------+---------------+-------------------+


推荐阅读