apache-spark - 从 S3 读取具有不相等列分区的数据
问题描述
我在 S3 中有一些分区数据,每个分区都有不同数量的列,如下所示。当我读取 pyspark 和 tru 中的数据以打印模式时,我只能读取通常存在于所有分区但不是全部的列。阅读所有列并重命名几列的最佳方法是什么。
aws s3 ls s3://my-bkt/test_data/
PRE occ_dt=20210426/
PRE occ_dt=20210428/
PRE occ_dt=20210429/
PRE occ_dt=20210430/
PRE occ_dt=20210503/
PRE occ_dt=20210504/
spark.read.parquet("aws s3 ls s3://my-bkt/test_data/").printSchema()
|-- map_api__450jshb457: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- first_name: string (nullable = true)
|-- map_api_592yd749dn: string (nullable = true)
|-- last_name: string (nullable = true)
|-- map_api_has_join: string (nullable = true)
# When I read partition 20210504
spark.read.parquet("aws s3 ls s3://my-bkt/test_data/occ_dt=20210504/").printSchema()
|-- map_api__450jshb457: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- first_name: string (nullable = true)
|-- map_api_592yd749dn: string (nullable = true)
|-- last_name: string (nullable = true)
|-- map_api_has_join: string (nullable = true)
|-- cust_activity: string (nullable = true)
|-- map_api__592rtddvid: string (nullable = true)
# When I read partition 20210503
spark.read.parquet("aws s3 ls s3://my-bkt/test_data/occ_dt=20210503/").printSchema()
|-- map_api__450jshb457: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- first_name: string (nullable = true)
|-- map_api_592yd749dn: string (nullable = true)
|-- last_name: string (nullable = true)
|-- map_api_4js3nnju8572d93: string (nullable = true)
|-- map_api_58943h64u47v: string (nullable = true)
|-- map_api__58943h6220dh: string (nullable = true)
如上所示,分区 20210503 & 20210504 中的字段比其他分区多。当我读取 s3 存储桶以获取架构时,仅显示所有分区中通用的字段。我希望在读取 s3 loc 时返回所有字段的预期结果如下。
Expected Output :
spark.read.parquet("aws s3 ls s3://my-bkt/test_data/").printSchema()
|-- map_api__450jshb457: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- first_name: string (nullable = true)
|-- map_api_592yd749dn: string (nullable = true)
|-- last_name: string (nullable = true)
|-- map_api_has_join: string (nullable = true)
|-- map_api_4js3nnju8572d93: string (nullable = true)
|-- map_api_58943h64u47v: string (nullable = true)
|-- map_api__58943h6220dh: string (nullable = true)
|-- cust_activity: string (nullable = true)
|-- map_api__592rtddvid: string (nullable = true)
提前致谢!!
解决方案
在选项中添加了 mergeSchema。
spark.read.option("mergeSchema", "true").parquet("aws s3 ls s3://my-bkt/test_data/").printSchema()
|-- map_api__450jshb457: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- first_name: string (nullable = true)
|-- map_api_592yd749dn: string (nullable = true)
|-- last_name: string (nullable = true)
|-- map_api_has_join: string (nullable = true)
|-- map_api_4js3nnju8572d93: string (nullable = true)
|-- map_api_58943h64u47v: string (nullable = true)
|-- map_api__58943h6220dh: string (nullable = true)
|-- cust_activity: string (nullable = true)
|-- map_api__592rtddvid: string (nullable = true)
推荐阅读
- security - 使 PHP-FPM security.limit_extensions 不区分大小写
- yocto - tegra-minimal-initramfs.bb 是否有任何补丁来防止此错误?
- r - 从 R 中的不同文本中提取元素
- javascript - 在javascript中生成小于7的伪随机数
- android - 如何使用wait()和notify()?
- c++ - 如何返回“空值”?
- arrays - 如何更改 TS 对象数组格式?
- c# - 在 Linq where 子句中动态反序列化 Json
- maven - 在多模块项目中只执行一次插件目标
- html - 如何使用 getBoundingClientRect() 在 Angular 的 html 页面中根据 div 部分的位置执行一些操作