首页 > 解决方案 > Optimized way for String length validation for the Pyspark

问题描述

I have the below code for validating the string length in pyspark . collect the result in two dataframe one with valid dataframe and the other with the data frame with invalid records .

def val_string(DfName,column,len,nullable):
  if(nullable=='no'):
    dt_valid   = DfName.where(DfName[column].cast("string").isNotNull())
    valid_len  = dt_valid.where(f.length(col(column)) <= len)
    invalid_len= dt_valid.where(f.length(col(column)) > len)
    invalid_len= invalid_len.withColumn("dataTypeValidationErrors", f.lit(column+'  '+'Length More than specified'))
    dt_invalid = DfName.where(DfName[column].cast("string").isNull())
    dt_invalid = dt_invalid.withColumn('dataTypeValidationErrors', f.lit(column+'  '+'Invalid Data for the Datatype'))
    dt_invalid = unionAll(dt_invalid,invalid_len)
    return valid_len,dt_invalid

For one column the validate is running fine . When this is running in loop for 100 columns the run time is way too high . its multiplying run timeexponentially. let me know if there is way to handle this .

标签: pyspark-dataframes

解决方案


sdf = sc.parallelize([[123,123], [456,456],[12345678,None],[None,1245678]]).toDF(["col1","col2"])
sdf.show()
+--------+-------+
|    col1|   col2|
+--------+-------+
|     123|    123|
|     456|    456|
|12345678|   null|
|    null|1245678|
+--------+-------+
length_dict = {"col1":5, "col2":3}
def val_length(col, length_dict=length_dict):
    return sf.length(col) <= sf.lit(length_dict[col])

sdf.select("*", *[val_length(i, length_dict).alias(i+"_length_val") for i in sdf.columns]).show()
+--------+-------+---------------+---------------+
|    col1|   col2|col1_length_val|col2_length_val|
+--------+-------+---------------+---------------+
|     123|    123|           true|           true|
|     456|    456|           true|           true|
|12345678|   null|          false|           null|
|    null|1245678|           null|          false|
+--------+-------+---------------+---------------+

推荐阅读