首页 > 解决方案 > 结构数组上的 Pyspark 过滤器

问题描述

我有一个具有以下架构的 pyspark 数据框:

root
 |-- channelGrouping: string (nullable = false)
 |-- clientId: string (nullable = false)
 |-- customDimensions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- index: long (nullable = true)
 |    |    |-- value: string (nullable = true)

我想获取 customDimensions 的索引“2”与正则表达式 r'b' 匹配的所有记录。

样本输入

+--------+---------------+--------------------------+
|clientId|channelGrouping| customDimensions         |                                                                                                             
+--------+---------------+--------------------------+
|  123   |  alpha        | [[1, a], [2, b]          |
|  456   |  beta         | [[1, a], [2, b], [3, c]  |
|  789   |  gama         | [[1, a], [2, a], [3, c]  |
+--------+---------------+--------------------------+

预期输出:

+--------+---------------+--------------------------+
|clientId|channelGrouping| customDimensions         |                                                                                                             
+--------+---------------+--------------------------+
|  123   |  alpha        | [[1, a], [2, b]          |
|  456   |  beta         | [[1, a], [2, b], [3, c]  |
+--------+---------------+--------------------------+

在 bigquery 中,这可以使用以下查询来实现:

select * from some_table where REGEXP_CONTAINS((SELECT value FROM UNNEST(customDimensions) WHERE index=2), r'b')

标签: pythondataframeapache-sparkpyspark

解决方案


设法编写解决方案:

from pyspark.sql.functions import *

df = spark.read.json("some_path")

df1 = df.where(element_at(map_from_entries(col("customDimensions")), 2).rlike('b'))

推荐阅读