首页 > 解决方案 > 删除 pyspark 中的特定前导零

问题描述

我想删除pyspark中一列的特定数量的前导零?

如果您可以看到我只想删除前导零只有一个的零。那么输出应该是:

+-----------+-----------------+
|subcategory|output           |
+-----------+-----------------+
|      00EEE|            00EEE|
|    0000EEE|           000EEE|
|       0EEE|              EEE| 
+-----------+-----------------+

同样,如果我想从前导零为 2 的零中删除,那么输出应该是:

+-----------+-----------------+
|subcategory|output           |
+-----------+-----------------+
|      00EEE|              EEE|
|    0000EEE|           000EEE|
|       0EEE|             0EEE| 
+-----------+-----------------+

有什么办法吗?

标签: regexapache-sparkpysparkapache-spark-sql

解决方案


我创建了一个通用函数来删除前导“0”,具体取决于您想要的数字:

from pyspark.sql import functions as F

def remove_lead_zero(col, n):
    """
    col: name of the column you want to modify
    n: number of leading 0 you want to remove
    """
    return F.when(
        F.regexp_extract(col, "^0{{{n}}}[^0]".format(n=n), 0) != "",
        F.expr("substring({col}, {n}, length({col}))".format(col=col, n=n+1))
    ).otherwise(F.col(col))


df.withColumn("output", remove_lead_zero("subcategory", 2)).show()
+-----------+-------+
|subcategory| output|
+-----------+-------+
|      00EEE|    EEE|
|    0000EEE|0000EEE|
|       0EEE|   0EEE|
+-----------+-------+

df.withColumn("output", remove_lead_zero("subcategory", 1)).show()
+-----------+-------+
|subcategory| output|
+-----------+-------+
|      00EEE|  00EEE|
|    0000EEE|0000EEE|
|       0EEE|    EEE|
+-----------+-------+

推荐阅读