首页 > 解决方案 > 使用 pyspark 限制列中的值的出现

问题描述

我想限制 pyspark 中某个值的出现。我试过了 :

table = table.filter(countDistinct(date_format(table['stamp'], 'yyyy-MM-dd')) == 4)

但它不起作用,因为我有一个错误:

An error occurred while calling o110.showString.
: java.lang.UnsupportedOperationException: Cannot     evaluate expression: count(distinct date_format(cast(input[13, string, true] as timestamp) etc.

你有别的想法吗?

标签: apache-sparkfiltercountpyspark

解决方案


如果您能提出一个示例和预期的输出,那确实很好。countDistinct如果要检查值的出现,不清楚为什么要使用。那么你应该宁愿countgroupBy语句中使用。

不过,此代码段可能会对您有所帮助:

df_new = spark.createDataFrame([
(1, datetime.datetime(2018,9,1,12)), (1, datetime.datetime(2018,9,1,12)),   (1,datetime.datetime(2018,9,1,12)), (1,datetime.datetime(2018,9,1,12)),
(1,datetime.datetime(2018,9,2,13)), (1,datetime.datetime(2018,9,2,13)),  (1,datetime.datetime(2018,9,2,13)),(2,datetime.datetime(2018,9,1,13)), (2,datetime.datetime(2018,9,1,13)), (2,datetime.datetime(2018,9,1,13))
], ("id", "time"))


occurences_df = df_new.groupBy("time").count().withColumnRenamed("time","count_time")
df_new.join(occurences_df, df_new["time"]==occurences_df["count_time"],how="left").show()

输出:

+---+-------------------+-------------------+-----+
| id|               time|         count_time|count|
+---+-------------------+-------------------+-----+
|  1|2018-09-01 12:00:00|2018-09-01 12:00:00|    4|
|  1|2018-09-01 12:00:00|2018-09-01 12:00:00|    4|
|  1|2018-09-01 12:00:00|2018-09-01 12:00:00|    4|
|  1|2018-09-01 12:00:00|2018-09-01 12:00:00|    4|
|  2|2018-09-01 13:00:00|2018-09-01 13:00:00|    3|
|  2|2018-09-01 13:00:00|2018-09-01 13:00:00|    3|
|  2|2018-09-01 13:00:00|2018-09-01 13:00:00|    3|
|  1|2018-09-02 13:00:00|2018-09-02 13:00:00|    3|
|  1|2018-09-02 13:00:00|2018-09-02 13:00:00|    3|
|  1|2018-09-02 13:00:00|2018-09-02 13:00:00|    3|
+---+-------------------+-------------------+-----+

然后,您可以按计数列过滤具有所需出现次数的行。


推荐阅读