首页 > 解决方案 > Pyspark:选择部分字符串(文件路径)列值

问题描述

Pyspark:拆分并选择部分字符串列值

如何从 spark DF 的列中选择第 4 个(左起)反斜杠之后的字符或文件路径?

pyspark 列的示例行:

\\D\Dev\johnny\Desktop\TEST
\\D\Dev\matt\Desktop\TEST\NEW
\\D\Dev\matt\Desktop\TEST\OLD\TEST
\\E\dev\peter\Desktop\RUN\SUBFOLDER\New
\\K924\prod\ums\Desktop\RUN\SUBFOLDER\New
\\LE345\jskx\rfk\Desktop\RUN\SUBFOLDER\New
.
.
.
\\ls53\f7sn3\vso\hsk\mwq\sdsf\kse

预期产出

johnny\Desktop\TEST
matt\Desktop\TEST\NEW
matt\Desktop\TEST\OLD\TEST
peter\Desktop\RUN\SUBFOLDER\New
ums\Desktop\RUN\SUBFOLDER\New
rfk\Desktop\RUN\SUBFOLDER\New
.
.
.
vso\hsk\mwq\sdsf\kse

我之前的问题导致了这个新问题。感谢任何帮助。

标签: pythondataframepysparkapache-spark-sql

解决方案


regexp_replace您可以在例如使用正则表达式。

from pyspark.sql import functions as F

df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\[a-zA-Z0-9]+\\\\[a-zA-Z0-9]+\\\\",""))

您也可以使用此解决方案更灵活,例如。

from pyspark.sql import functions as F
no_of_slashes=4 # number of slashes to consider here

# we build the regular expression by repeating `"[a-zA-Z0-9]+\\\\"`
# NB. We subtract 2 since we start with the frst 2 slashes
df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\"+("[a-zA-Z0-9]+\\\\"*(no_of_slashes-2)),""))

让我知道这是否适合您。


推荐阅读