首页 > 解决方案 > How to bucket tables using AWS Glue and Spark SQL?

问题描述

I am trying to run this query on AWS Glue

CREATE TABLE bucketing_example
  USING parquet
  CLUSTERED BY (id) INTO 2 BUCKETS
  LOCATION 's3://my-bucket/bucketing_example'
  AS SELECT * FROM (
   VALUES(1, 'red'),
         (2, 'orange'),
         (5, 'yellow'),
         (10, 'green'),
         (11, 'blue'),
         (12, 'indigo'),
         (20, 'violet'))
   AS Colors(id, value)

and I am getting the following exception:

java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
  at org.apache.hadoop.fs.Path.<init>(Path.java:175)
  at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.stringToURI(ExternalCatalogUtils.scala:236)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:343)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:339)
  at scala.Option.map(Option.scala:146)

Also, I tried to run a Spark SQL query similar to these on a bucketed table that was created using Athena (still using Glue).

Although the a DESCRIBE EXTENDED on the table reveals the bucket column, the Exchanges on the arms of the JOIN remain in the plan.

Does bucketing work with Glue and Spark SQL?

标签: apache-sparkaws-glue

解决方案


“要利用 Athena 中的分桶表,您必须使用 Apache Hive 创建数据文件,因为 Athena 不支持 Apache Spark 分桶格式”

另见: https ://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

Hive 分桶写入支持 - 目标版本 3.2.0 https://issues.apache.org/jira/browse/SPARK-19256


推荐阅读