首页 > 解决方案 > How to determine what are the columns I need since ApplyMapping is'nt case sensitive?

问题描述

I'm updating a Pyspark script with a new Database model and I've encountered some problems calling/updating columns since PySpark apparently brings all columns in uppercase but when I use ApplyMapping it is not case sensitive BUT when I join(By left) with another table it is case sensitive and I end up with multiple columns with the same name but one of them in uppercase and the other one in lowercase and I want to use SelectFields function.

I've tried the exact same name (Case sensitive) of the columns but it always brings the same.

I've tried printing the schema but the only difference is only the case.

testDF = testDF.join(test2DF, "COLUMN",how='left')

test3DF=test3DF.withColumn("column", test3DF['ATTRB'].substr(4,4))

fullDF= testDF.join(test3DF, (testDF['ID'] == test3DF['ID']) )

.....

applymappingTest = ApplyMapping.apply(frame = fullDF, mappings = [
    ('ID', 'string', 'identifier', 'string')
    ('column', 'string', 'myColumn', 'string')
    ('COLUMN', 'string', 'someother', 'string')
    ], transformation_ctx = "applymappingTest")

......

selectfieldsTest= SelectFields.apply(frame = applymappingTest, paths = [
     "identifier",
     "myColumn",
], transformation_ctx = "selectfieldsTest")
Expected result:
myColumn is the column with the name in lowercase.
Actual result: 
myColumn is the column with the name in uppercase.

标签: pythonpyspark-sqlaws-gluepyspark-dataframes

解决方案


您可以在 applymapping 中设置 caseSensitive 选项。

def applyMapping( mappings : Seq[Product4[String, String, String, String]], caseSensitive : Boolean = true, transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame 

https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-dynamicframe-class.html#glue-etl-scala-apis-glue-dynamicframe-class-defs-应用映射

顺便说一句,如果我没记错的话,applyMapping 默认是区分大小写的。但 spark SQL 默认不区分大小写。要在 spark SQL 中设置区分大小写,您可以使用: spark_session.sql('set spark.sql.caseSensitive=true')


推荐阅读