首页 > 解决方案 > 在 spark sql--pyspark 中查找特定字符串

问题描述

我试图在员工数据框的数据框列中找到完全匹配的字符串

Employee  days_present
Alex      1,2,11,23,
John      21,23,25,28

需要根据 days_present 列的预期输出来查找 2 日在场的员工:Alex

以下是我尝试过的

    df = spark.sql("select * from employee where days_present RLIKE '2')
    df.show()

This returns both Alex & John

我也想知道谁在 2 和 11,在这种情况下,预期的输出只有 ALex

标签: pandasapache-sparkpyspark-sql

解决方案


我们可以使用array_intersect从 Spark-2.4+ 开始的函数,然后检查数组大小,如果size >=2

Example:

df.show()
+--------+------------+
|Employee|days_present|
+--------+------------+
|    Alex|   1,2,11,23|
|    John| 21,23,25,28|
+--------+------------+
#DataFrame[Employee: string, days_present: string]

df.withColumn("tmp",split(col("days_present"),",")).\
withColumn("intersect",array_intersect(col("tmp"),array(lit("2"),lit("11")))).\
filter(size("intersect") >= 2).\
drop("tmp","intersect").\
show()

#+--------+------------+
#|Employee|days_present|
#+--------+------------+
#|    Alex|   1,2,11,23|
#+--------+------------+

In spark-sql:

df.createOrReplaceTempView("tmp")

spark.sql("""select Employee,days_present from (select *,size(array_intersect(split(days_present,","),array("2","11")))size from tmp)e where size >=2""").show()

#+--------+------------+
#|Employee|days_present|
#+--------+------------+
#|    Alex|   1,2,11,23|
#+--------+------------+

推荐阅读