pyspark - PySpark DateTime 函数返回空值
问题描述
我正在阅读一些芝加哥犯罪数据,并且需要使用内置的 pyspark 日期时间函数来创建月份和年份列。我遵循了文档并尝试了几种方法,但都没有成功。
我导入以下内容。
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types
from pyspark.sql.functions import unix_timestamp, from_unixtime
from pyspark.sql.window import Window
我的架构显示日期,并且它具有字符串值。
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- Case_Number: string (nullable = true)
|-- Date: string (nullable = true)
|-- Block: string (nullable = true)
|-- IUCR: string (nullable = true)
|-- Primary_Type: string (nullable = true)
|-- Description: string (nullable = true)
|-- Location_Description: string (nullable = true)
|-- Arrest: boolean (nullable = true)
|-- Domestic: boolean (nullable = true)
|-- District: integer (nullable = true)
|-- Community_Area: integer (nullable = true)
|-- FBI_Code: string (nullable = true)
|-- X_Coordinate: integer (nullable = true)
|-- Y_Coordinate: integer (nullable = true)
|-- Year: integer (nullable = true)
|-- Updated_On: string (nullable = true)
|-- Location: string (nullable = true)
查看值,数据列中没有空值。
df.select('Date').show()
+--------------+
| Date|
+--------------+
|9/5/2015 13:30|
|9/4/2015 11:30|
| 9/1/2018 0:01|
|9/5/2015 12:45|
|9/5/2015 13:00|
|9/5/2015 10:55|
|9/4/2015 18:00|
|9/5/2015 13:00|
|9/5/2015 11:30|
| 5/1/2016 0:25|
|9/5/2015 14:00|
|9/5/2015 11:00|
| 9/5/2015 3:00|
|9/5/2015 12:50|
|9/3/2015 13:00|
|9/5/2015 11:45|
|9/5/2015 13:30|
| 7/8/2015 0:00|
| 9/5/2015 9:55|
|9/5/2015 12:35|
+--------------+
only showing top 20 rows
然后我调用以下命令但得到所有空值。
df2 = df.withColumn("Date", unix_timestamp("Date", "MM/dd/yyyy hh:mm"))
df2.select("Date").show()
+----------+
| Date|
+----------+
| null|
|1441384200|
| null|
|1441431900|
| null|
|1441468500|
| null|
| null|
|1441470600|
| null|
| null|
|1441468800|
|1441440000|
|1441432200|
| null|
|1441471500|
| null|
| null|
|1441464900|
|1441431300|
+----------+
only showing top 20 rows
df2 = df.withColumn("Date", df.Date.cast(types.TimestampType()))
df2.select("Date").show()
+----+
|Date|
+----+
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
+----+
only showing top 20 rows
我只想使用日期列来创建月份和年份。
### Get Month from date in pyspark
from pyspark.sql.functions import month, year
#df = df.withColumn("Date", df.Date.cast(types.TimestampType()))
#df = df.withColumn("Date", unix_timestamp("Date", "MM/dd/yyyy"))
df = df.withColumn('Year', year(df['Date']))
df = df.withColumn('Month', month(df['Date']))
In: df.select('Month').distinct().collect()
Out: [Row(Month=None)]
解决方案
好的,首先 - 一些可重现的例子会很好..
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('stack').getOrCreate()
data = [("9/5/2015 13:30"),("9/4/2015 11:30"), ("9/1/2018 0:01"), ("9/5/2015 12:45"),("9/5/2015 13:00"), ("9/5/2015 10:55"), ("9/4/2015 18:00"), ("9/5/2015 13:00")]
from pyspark.sql.types import *
dummy_df = spark.createDataFrame(data, StringType()).toDF('Datetime')
dummy_df.show()
dummy_df.printSchema()
这将为您提供输出:
+--------------+
| Datetime|
+--------------+
|9/5/2015 13:30|
|9/4/2015 11:30|
| 9/1/2018 0:01|
|9/5/2015 12:45|
|9/5/2015 13:00|
|9/5/2015 10:55|
|9/4/2015 18:00|
|9/5/2015 13:00|
+--------------+
root
|-- Datetime: string (nullable = true)
现在,如前所述,请注意日期列的格式并遵循https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html的说明
from pyspark.sql.functions import regexp_replace, month, year, col, to_date
#simple regex, watch out for the white space
dummy_df = dummy_df.withColumn('Date', regexp_replace('Datetime', '( \d+:\d+)', ''))
#transform the date-column
dummy_df = dummy_df.withColumn('Date', to_date(col('Date'), "M/d/y"))\
.withColumn('Month', month(col('Date')))\
.withColumn('Year', year(col('Date')))
dummy_df.show()
dummy_df.printSchema()
最后我们得到:
+--------------+----------+-----+----+
| Datetime| Date|Month|Year|
+--------------+----------+-----+----+
|9/5/2015 13:30|2015-09-05| 9|2015|
|9/4/2015 11:30|2015-09-04| 9|2015|
| 9/1/2018 0:01|2018-09-01| 9|2018|
|9/5/2015 12:45|2015-09-05| 9|2015|
|9/5/2015 13:00|2015-09-05| 9|2015|
|9/5/2015 10:55|2015-09-05| 9|2015|
|9/4/2015 18:00|2015-09-04| 9|2015|
|9/5/2015 13:00|2015-09-05| 9|2015|
+--------------+----------+-----+----+
root
|-- Datetime: string (nullable = true)
|-- Date: date (nullable = true)
|-- Month: integer (nullable = true)
|-- Year: integer (nullable = true)
推荐阅读
- postscript - 覆盖 postscript /setcmykcolor
- mysql - 我应该索引具有低“真”基数 MySQL 的布尔字段吗?
- python - 用漂亮的汤,grequests 进行异步 Webscraping
- c++ - 如何读取 LZF 压缩的 HDF5 数据集?
- javascript - 反应 setstate 嵌套对象问题
- asp.net-mvc - DisplayFormat DataFormatString 不适用于 Razor 视图中的十进制值
- curl - 如何使用 curl 从 Google Cloud Storage 下载文件
- mysql - Mysql加入,计数和求和?
- javascript - 如何在redux中存储某个状态的多个副本
- java - 不传参数删除db中包含父子关系的多个表的所有数据【Spring boot】