pyspark-dataframes - Optimized way for String length validation for the Pyspark
问题描述
I have the below code for validating the string length in pyspark . collect the result in two dataframe one with valid dataframe and the other with the data frame with invalid records .
def val_string(DfName,column,len,nullable):
if(nullable=='no'):
dt_valid = DfName.where(DfName[column].cast("string").isNotNull())
valid_len = dt_valid.where(f.length(col(column)) <= len)
invalid_len= dt_valid.where(f.length(col(column)) > len)
invalid_len= invalid_len.withColumn("dataTypeValidationErrors", f.lit(column+' '+'Length More than specified'))
dt_invalid = DfName.where(DfName[column].cast("string").isNull())
dt_invalid = dt_invalid.withColumn('dataTypeValidationErrors', f.lit(column+' '+'Invalid Data for the Datatype'))
dt_invalid = unionAll(dt_invalid,invalid_len)
return valid_len,dt_invalid
For one column the validate is running fine . When this is running in loop for 100 columns the run time is way too high . its multiplying run timeexponentially. let me know if there is way to handle this .
解决方案
sdf = sc.parallelize([[123,123], [456,456],[12345678,None],[None,1245678]]).toDF(["col1","col2"])
sdf.show()
+--------+-------+
| col1| col2|
+--------+-------+
| 123| 123|
| 456| 456|
|12345678| null|
| null|1245678|
+--------+-------+
length_dict = {"col1":5, "col2":3}
def val_length(col, length_dict=length_dict):
return sf.length(col) <= sf.lit(length_dict[col])
sdf.select("*", *[val_length(i, length_dict).alias(i+"_length_val") for i in sdf.columns]).show()
+--------+-------+---------------+---------------+
| col1| col2|col1_length_val|col2_length_val|
+--------+-------+---------------+---------------+
| 123| 123| true| true|
| 456| 456| true| true|
|12345678| null| false| null|
| null|1245678| null| false|
+--------+-------+---------------+---------------+
推荐阅读
- continuous-integration - 如何在竹任务和微服务之间交换数据
- database - 数据库“postgres”的 Postgres 从 10.4 升级到 11.5 编码不匹配:旧的“SQL_ASCII”,新的“UTF8”
- r - 使用 tidyr::fill 但使用特定的填充值
- javascript - 保存用户从动态添加的字段中的多个输入
- java - 分别运行 Lagom Service Locator / Kafka / Cassandra
- c - 如何在 ARMv8 32 位上生成地址大小错误?
- html - 我在容器背景和前景中有 2 个图像。如何使前景图像响应
- function - LISP 编写一个名为 cut-in-half 的函数,它接收一个列表并创建一个新列表,其元素是前半部分和后半部分
- php - Laravel 5.7,雄辩:此集合实例上不存在属性 [X],多对多关系
- ios - 绘制半径等于手指路径的圆