apache-spark - How to bucket tables using AWS Glue and Spark SQL?
问题描述
I am trying to run this query on AWS Glue
CREATE TABLE bucketing_example
USING parquet
CLUSTERED BY (id) INTO 2 BUCKETS
LOCATION 's3://my-bucket/bucketing_example'
AS SELECT * FROM (
VALUES(1, 'red'),
(2, 'orange'),
(5, 'yellow'),
(10, 'green'),
(11, 'blue'),
(12, 'indigo'),
(20, 'violet'))
AS Colors(id, value)
and I am getting the following exception:
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
at org.apache.hadoop.fs.Path.<init>(Path.java:175)
at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.stringToURI(ExternalCatalogUtils.scala:236)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:343)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:339)
at scala.Option.map(Option.scala:146)
Also, I tried to run a Spark SQL query similar to these on a bucketed table that was created using Athena (still using Glue).
Although the a DESCRIBE EXTENDED
on the table reveals the bucket column, the Exchanges on the arms of the JOIN remain in the plan.
Does bucketing work with Glue and Spark SQL?
解决方案
“要利用 Athena 中的分桶表,您必须使用 Apache Hive 创建数据文件,因为 Athena 不支持 Apache Spark 分桶格式”
另见: https ://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
Hive 分桶写入支持 - 目标版本 3.2.0 https://issues.apache.org/jira/browse/SPARK-19256
推荐阅读
- python-3.x - 如何正确使用双循环?
- html - 用固定的图像比例制作相等的正方形,无需任何滚动
- pandas - 仅当目标数据框中的目标字段为空白时,才从 pandas 数据框中的一列复制值
- web-services - 如何在 HTTPRIO 客户端应用程序中发送/选择客户端证书?
- typeorm - 为 typeorm 库和 vue.js 项目创建一个基本的管理器类
- python - HTML 按钮对象未转到预期链接
- java - LinkedHashMap - 获取插入项的索引
- javascript - eslint 忽略默认包的规则
- oracle - Oracle APEX - 如何不转义 IG 列中的特殊字符
- python - python中的扫雷设置值问题