apache-spark - PySpark:检查 col 中的值是否类似于 dict 中的键
问题描述
我想取出包含关键字的字典并检查 pyspark df 中的列,以查看该关键字是否存在,如果存在,则在新列中返回字典中的值。
问题看起来像这样;
myDict = {
'price': 'Pricing Issue',
'support': 'Support Issue',
'android': 'Left for Competitor'
}
df = sc.parallelize([('1','Needed better Support'),('2','Better value from android'),('3','Price was to expensive')]).toDF(['id','reason'])
+-----+-------------------------+
| id |reason |
+-----+-------------------------+
|1 |Needed better support |
|2 |Better value from android|
|3 | Price was to expensive |
|4 | Support problems |
+-----+-------------------------+
我正在寻找的最终结果是:
+-----+-------------------------+---------------------+
| id |reason |new_reason |
+-----+-------------------------+---------------------+
|1 |Needed better support | Support Issue |
|2 |Better value from android| Left for Competitor |
|3 |Price was to expensive | Pricing Issue |
|4 |Support issue | Support Issue |
+-----+-------------------------+---------------------+
在 pyspark 中构建高效函数的最佳方法是什么?
解决方案
您可以使用when
表达式来检查列是否reason
与 dict 键匹配。您可以通过传递 listwhen
使用 python函数动态生成表达式:functools.reduce
myDict.keys()
from functools import reduce
from pyspark.sql import functions as F
df2 = df.withColumn(
"new_reason",
reduce(
lambda c, k: c.when(F.lower(F.col("reason")).rlike(rf"\b{k.lower()}\b"), myDict[k]),
myDict.keys(),
F
)
)
df2.show(truncate=False)
#+---+-------------------------+-------------------+
#|id |reason |new_reason |
#+---+-------------------------+-------------------+
#|1 |Needed better Support |Support Issue |
#|2 |Better value from android|Left for Competitor|
#|3 |Price was to expensive |Pricing Issue |
#|4 |Support problems |Support Issue |
#+---+-------------------------+-------------------+
推荐阅读
- javascript - Passport-google-oauth-20 不适用于heroku
- azure-functions - 获取 DataFactoryManagementClient 的凭据
- python - 从 pandas 数据框创建 BigQuery 表,无需明确指定架构
- python - BeautifulSoup 解析器没有解析完整的网页
- azure - 当我使用 CLI 进行预配时,为什么 Azure 会说我的 ARM 模板中缺少参数?
- java - 自动装配,值注释不起作用
- mysql - MYSQL:将多行分组在一个ID下,不要合并
- javascript - 使用 JetBrains Chrome 扩展在 IntelliJ 上远程调试 javascript 不起作用
- python - 通过 USB 与 python 通信
- c# - 在 lambda 中创建 IDisposable 时没有 CA2000 错误