python - Pyspark - 将时间戳传递给 udf
问题描述
我正在尝试根据下面的时间戳检查一个条件,它给我一个错误。谁能指出我在这里做错了什么-
timestamp1 = pd.to_datetime('2018-02-14 12:09:36.0')
timestamp2 = pd.to_datetime('2018-02-14 12:10:00.0')
def check_formula(timestamp2, timestamp1, interval):
if ((timestamp2-timestamp1)<=datetime.timedelta(minutes=(interval/2))):
return True
else:
return False
chck_formula = udf(check_formula, BooleanType())
ts= chck_formula(timestamp2, timestamp1, 5)
print(ts)
以下是我得到的错误 -
An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.sql.Timestamp]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
解决方案
在 spark 中,无论我们做什么,我们都需要使用rdd
or dataframe
。所以你只能申请udf
其中任何一个。所以你需要改变你申请的方式udf
。
这里有两种方法:-
from pyspark.sql import functions as F
import datetime
df = sqlContext.createDataFrame([
['2018-02-14 12:09:36.0', '2018-02-14 12:10:00.0'],
], ["t1", "t2"])
interval = 5
df.withColumn("check", F.datediff(F.col("t2"),F.col("t1")) <= datetime.timedelta(minutes=(interval/2)).total_seconds()).show(truncate=False)
+---------------------+---------------------+-----+
|t1 |t2 |check|
+---------------------+---------------------+-----+
|2018-02-14 12:09:36.0|2018-02-14 12:10:00.0|true |
+---------------------+---------------------+-----+
from pyspark.sql.functions import udf, lit
from pyspark.sql.types import BooleanType
def check_formula(timestamp2, timestamp1, interval):
if ((timestamp2-timestamp1)<=datetime.timedelta(minutes=(interval/2))):
return True
else:
return False
chck_formula = udf(check_formula, BooleanType())
df.withColumn("check", chck_formula(F.from_utc_timestamp(F.col("t2"), "PST"), F.from_utc_timestamp(F.col("t1"), "PST"), F.lit(5))).show(truncate=False)
+---------------------+---------------------+-----+
|t1 |t2 |check|
+---------------------+---------------------+-----+
|2018-02-14 12:09:36.0|2018-02-14 12:10:00.0|true |
+---------------------+---------------------+-----+
推荐阅读
- mongodb - 如何在 Spring Boot 中禁用 GridFS MD5 计算?
- c# - ASP.NET MVC Web 应用程序无法通过 Arvixe 托管的 IIS 服务器上的 SMTP 发送电子邮件
- r - R topicmodels包:我们做LDA时如何识别Beta(eta)的参数?
- php - 路由 [admin.settings.edit] 未定义 laravel
- c# - 只要按下按钮,在 C# WPF 应用程序中是否有可能做某事?
- python - 无法使用 Mongoengine ImageField 将图像上传到 MEDIA_ROOT
- sqlite - 如果使用 WHERE 子句不存在值,则创建表合并两个表设置默认值
- c# - 祖先绑定在 ListView 中只工作一次
- perl - Mojo::useragent SSL 失败
- python - Pandas Dataframe groupby + agg + lambda + unique 抛出 ValueError