首页 > 解决方案 > PySpark:检查 col 中的值是否类似于 dict 中的键

问题描述

我想取出包含关键字的字典并检查 pyspark df 中的列,以查看该关键字是否存在,如果存在,则在新列中返回字典中的值。

问题看起来像这样;

myDict = {
'price': 'Pricing Issue',
'support': 'Support Issue',
'android': 'Left for Competitor'
}

df = sc.parallelize([('1','Needed better Support'),('2','Better value from android'),('3','Price was to expensive')]).toDF(['id','reason'])

+-----+-------------------------+
| id  |reason                   |
+-----+-------------------------+
|1    |Needed better support    |
|2    |Better value from android|
|3    | Price was to expensive  |
|4    | Support problems        |
+-----+-------------------------+

我正在寻找的最终结果是:

+-----+-------------------------+---------------------+
| id  |reason                   |new_reason           |
+-----+-------------------------+---------------------+
|1    |Needed better support    | Support Issue       |
|2    |Better value from android| Left for Competitor |
|3    |Price was to expensive   | Pricing Issue       |
|4    |Support issue            | Support Issue       |
+-----+-------------------------+---------------------+

在 pyspark 中构建高效函数的最佳方法是什么?

标签: apache-sparkpysparkapache-spark-sql

解决方案


您可以使用when表达式来检查列是否reason与 dict 键匹配。您可以通过传递 listwhen使用 python函数动态生成表达式:functools.reducemyDict.keys()

from functools import reduce
from pyspark.sql import functions as F

df2 = df.withColumn(
    "new_reason",
    reduce(
        lambda c, k: c.when(F.lower(F.col("reason")).rlike(rf"\b{k.lower()}\b"), myDict[k]),
        myDict.keys(),
        F
    )
)

df2.show(truncate=False)
#+---+-------------------------+-------------------+
#|id |reason                   |new_reason         |
#+---+-------------------------+-------------------+
#|1  |Needed better Support    |Support Issue      |
#|2  |Better value from android|Left for Competitor|
#|3  |Price was to expensive   |Pricing Issue      |
#|4  |Support problems         |Support Issue      |
#+---+-------------------------+-------------------+

推荐阅读