python - 带有列表元素的 Pyspark regexp_replace 不会替换字符串
问题描述
我正在尝试使用 regexp_replace 替换数据框列中的字符串。我必须将正则表达式模式应用于数据框列中的所有记录。但是字符串没有按预期替换。
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark import sql
from pyspark.sql.functions import regexp_replace,col
import re
conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)
df=sc.parallelize([('2345','ADVANCED by John'),
('2398','ADVANCED by ADVANCE'),
('2328','Verified by somerandomtext'),
('3983','Double Checked by Marsha')]).toDF(['ID', "Notes"])
reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]
for i in range(len(reg_patterns)):
res_split=re.findall(r"[^/]+",reg_patterns[i])
res_split[0]
df=df.withColumn('NotesUPD',regexp_replace(col('Notes'),res_split[0],res_split[1]))
df.show()
输出 :
+----+--------------------+--------------------+
| ID| Notes| NotesUPD|
+----+--------------------+--------------------+
|2345| ADVANCED by John| ADVANCED by John|
|2398| ADVANCED by ADVANCE| ADVANCED by ADVANCE|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+
Expected Output:
+----+--------------------+--------------------+
| ID| Notes| NotesUPD|
+----+--------------------+--------------------+
|2345| ADVANCED by John| ADV by John|
|2398| ADVANCED by ADVANCE| ADV by ADV |
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+
解决方案
你应该写一个udf
函数并循环reg_patterns
如下
reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]
import re
from pyspark.sql import functions as f
from pyspark.sql import types as t
def replaceUdf(column):
res_split=[]
for i in range(len(reg_patterns)):
res_split=re.findall(r"[^/]+",reg_patterns[i])
for x in res_split[0].split("|"):
column = column.replace(x,res_split[1])
return column
reg_replaceUdf = f.udf(replaceUdf, t.StringType())
df = df.withColumn('NotesUPD', reg_replaceUdf(f.col('Notes')))
df.show()
你应该有
+----+--------------------+--------------------+
| ID| Notes| NotesUPD|
+----+--------------------+--------------------+
|2345| ADVANCED by John| ADV by John|
|2398| ADVANCED by ADVANCE| ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+
推荐阅读
- linux - linux globbing 中的 [01]、[0-1] 和 [0,1] 有什么区别?
- java - Android每次Java加载大型JSON文件
- c++ - 我需要有关 cpp 中奇怪的时间限制超出错误的帮助
- algorithm - 当 x*n 溢出时,如何按 n/d 对 x 进行除垢?
- python - python可以一次将相同的值分配给不同的键吗?
- powershell - Powershell可以使用convertfrom-stringdata返回完整的插值吗
- python - 如何将多维数组输入到Python中的函数
- reactjs - 使用 Next.js 进行服务器端渲染与使用普通 React 相比,有哪些潜在的缺点?
- c# - 请求正文在 C# Visual Studio 的 API 自动化测试中返回 null
- javascript - Javascript分类对象数组排序