首页 > 解决方案 > 如何拆分火花数据框列字符串?

问题描述

我有一个如下所示的数据框:

|--------------------------------------|---------|---------|
|   path                                         |  content|  
|------------------------------------------------|---------|
|    /root/path/main_folder1/folder1/path1.txt   |   Val 1 |      
|------------------------------------------------|---------|
|    /root/path/main_folder1/folder2/path2.txt   |   Val 1 |      
|------------------------------------------------|---------|
|    /root/path/main_folder1/folder2/path3.txt   |   Val 1 |      
|------------------------------------------------|---------|

我想用“/”分割路径中的列值并只获取值直到 /root/path/mainfolder1 我想要的输出是

|--------------------------------------|---------|---------|---------------------------|
|   path                                         |  content|  root_path                |
|------------------------------------------------|---------|---------------------------|
|    /root/path/main_folder1/folder1/path1.txt   |   Val 1 |  /root/path/main_folder1  |    
|------------------------------------------------|---------|---------------------------|
|    /root/path/main_folder1/folder2/path2.txt   |   Val 1 |  /root/path/main_folder1  |    
|------------------------------------------------|---------|---------------------------|
|    /root/path/main_folder1/folder2/path3.txt   |   Val 1 |  /root/path/main_folder1  |    
|------------------------------------------------|---------|---------------------------|

我知道我必须使用 withColumn split 和 regexp_extract,但我并不知道如何限制 regexp_extract 的输出。

我必须做什么才能获得所需的输出?

标签: pythonapache-sparkpysparkapache-spark-sql

解决方案


您可以使用正则表达式来提取前三个目录级别。

df.withColumn("root_path", F.regexp_extract(F.col("path"), "^((/\w*){3})",1))\
    .show(truncate=False)

输出:

+-----------------------------------------+-------+-----------------------+
|path                                     |content|root_path              |
+-----------------------------------------+-------+-----------------------+
|/root/path/main_folder1/folder1/path1.txt|val 1  |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path2.txt|val 2  |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path3.txt|val 3  |/root/path/main_folder1|
+-----------------------------------------+-------+-----------------------+

推荐阅读