首页 > 解决方案 > PySpark DateTime 函数返回空值

问题描述

我正在阅读一些芝加哥犯罪数据,并且需要使用内置的 pyspark 日期时间函数来创建月份和年份列。我遵循了文档并尝试了几种方法,但都没有成功。

我导入以下内容。

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types
from pyspark.sql.functions import unix_timestamp, from_unixtime
from pyspark.sql.window import Window

我的架构显示日期,并且它具有字符串值。

df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Case_Number: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Block: string (nullable = true)
 |-- IUCR: string (nullable = true)
 |-- Primary_Type: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Location_Description: string (nullable = true)
 |-- Arrest: boolean (nullable = true)
 |-- Domestic: boolean (nullable = true)
 |-- District: integer (nullable = true)
 |-- Community_Area: integer (nullable = true)
 |-- FBI_Code: string (nullable = true)
 |-- X_Coordinate: integer (nullable = true)
 |-- Y_Coordinate: integer (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Updated_On: string (nullable = true)
 |-- Location: string (nullable = true)

查看值,数据列中没有空值。

df.select('Date').show()
+--------------+
|          Date|
+--------------+
|9/5/2015 13:30|
|9/4/2015 11:30|
| 9/1/2018 0:01|
|9/5/2015 12:45|
|9/5/2015 13:00|
|9/5/2015 10:55|
|9/4/2015 18:00|
|9/5/2015 13:00|
|9/5/2015 11:30|
| 5/1/2016 0:25|
|9/5/2015 14:00|
|9/5/2015 11:00|
| 9/5/2015 3:00|
|9/5/2015 12:50|
|9/3/2015 13:00|
|9/5/2015 11:45|
|9/5/2015 13:30|
| 7/8/2015 0:00|
| 9/5/2015 9:55|
|9/5/2015 12:35|
+--------------+
only showing top 20 rows

然后我调用以下命令但得到所有空值。

 df2 = df.withColumn("Date", unix_timestamp("Date", "MM/dd/yyyy hh:mm"))
    df2.select("Date").show()
    +----------+
    |      Date|
    +----------+
    |      null|
    |1441384200|
    |      null|
    |1441431900|
    |      null|
    |1441468500|
    |      null|
    |      null|
    |1441470600|
    |      null|
    |      null|
    |1441468800|
    |1441440000|
    |1441432200|
    |      null|
    |1441471500|
    |      null|
    |      null|
    |1441464900|
    |1441431300|
    +----------+
    only showing top 20 rows

df2 = df.withColumn("Date", df.Date.cast(types.TimestampType()))
df2.select("Date").show()
+----+
|Date|
+----+
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
+----+
only showing top 20 rows

我只想使用日期列来创建月份和年份。

### Get Month from date in pyspark
from pyspark.sql.functions import month, year
#df = df.withColumn("Date", df.Date.cast(types.TimestampType()))
#df = df.withColumn("Date", unix_timestamp("Date", "MM/dd/yyyy"))
df = df.withColumn('Year', year(df['Date']))
df = df.withColumn('Month', month(df['Date']))

In: df.select('Month').distinct().collect()
Out: [Row(Month=None)]

标签: pyspark

解决方案


好的,首先 - 一些可重现的例子会很好..

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('stack').getOrCreate()
data = [("9/5/2015 13:30"),("9/4/2015 11:30"), ("9/1/2018 0:01"), ("9/5/2015 12:45"),("9/5/2015 13:00"), ("9/5/2015 10:55"), ("9/4/2015 18:00"), ("9/5/2015 13:00")]
from pyspark.sql.types import *

dummy_df = spark.createDataFrame(data, StringType()).toDF('Datetime')
dummy_df.show()
dummy_df.printSchema()

这将为您提供输出:

+--------------+
|      Datetime|
+--------------+
|9/5/2015 13:30|
|9/4/2015 11:30|
| 9/1/2018 0:01|
|9/5/2015 12:45|
|9/5/2015 13:00|
|9/5/2015 10:55|
|9/4/2015 18:00|
|9/5/2015 13:00|
+--------------+
root
 |-- Datetime: string (nullable = true)

现在,如前所述,请注意日期列的格式并遵循https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html的说明

from pyspark.sql.functions import regexp_replace, month, year, col, to_date

#simple regex, watch out for the white space
dummy_df = dummy_df.withColumn('Date', regexp_replace('Datetime', '( \d+:\d+)', ''))
#transform the date-column
dummy_df = dummy_df.withColumn('Date', to_date(col('Date'), "M/d/y"))\ 
        .withColumn('Month', month(col('Date')))\
        .withColumn('Year', year(col('Date')))
dummy_df.show()
dummy_df.printSchema()

最后我们得到:

+--------------+----------+-----+----+
|      Datetime|      Date|Month|Year|
+--------------+----------+-----+----+
|9/5/2015 13:30|2015-09-05|    9|2015|
|9/4/2015 11:30|2015-09-04|    9|2015|
| 9/1/2018 0:01|2018-09-01|    9|2018|
|9/5/2015 12:45|2015-09-05|    9|2015|
|9/5/2015 13:00|2015-09-05|    9|2015|
|9/5/2015 10:55|2015-09-05|    9|2015|
|9/4/2015 18:00|2015-09-04|    9|2015|
|9/5/2015 13:00|2015-09-05|    9|2015|
+--------------+----------+-----+----+

root
 |-- Datetime: string (nullable = true)
 |-- Date: date (nullable = true)
 |-- Month: integer (nullable = true)
 |-- Year: integer (nullable = true)

推荐阅读