首页 > 解决方案 > AWS Glue not deleting or deprecating tables generated over now removed S3 data

问题描述

Due to user error, our S3 directory over which a Glue crawler ran routinely became flooded with .csv files. When Glue ran over the S3 directory- it created a table for each of the 200,000+ csv files. I ran a script that deleted the .csv files shortly after (S3 bucket has versioning enabled), and re-ran the Glue crawler with the following settings:

Schema updates in the data store    Update the table definition in the data catalog.
Inherit schema from table   Update all new and existing partitions with metadata from the table.
Object deletion in the data store   Delete tables and partitions from the data catalog.

Within the cloudwatch logs- it's updating the tables matching the remaining data, but it's not deleting any of the tables generated from those .csv files. According to it's configuration log on Cloudwatch- it should be able to do so.

INFO : Crawler configured with Configuration 
{
    "Version": 1,
    "Grouping": {
        "TableGroupingPolicy": "CombineCompatibleSchemas"
    }
}
 and SchemaChangePolicy 
{
    "UpdateBehavior": "UPDATE_IN_DATABASE",
    "DeleteBehavior": "DELETE_FROM_DATABASE"

I should include there is another crawler that is set to crawl over the S3 bucket, but it's not been run in over a year, so I doubt that could be a point of conflict.

I'm are stumped on what could be the issue; as it stands, I can write a script to pattern match the existing tables and drop those with a csv in their name or delete and rebuild the database by having Glue re crawl S3, but if possible- I'd much rather Glue drops the tables itself after identifying they point to no files within S3 itself.

标签: amazon-web-servicesamazon-s3aws-glueamazon-athena

解决方案


我目前正在采用编写脚本来删除 Athena 创建的表的方法。从 Athena 查询生成的所有文件都有 49 个字符长,_结果文件有 5 个字符,元数据有 6个字符,对于结果查询结果和查询元_数据,通常遵循以 a 结尾的格式。_csv_csv_metadata

我得到了我的数据库中所有表名的列表,过滤它只包括那些长度为 49 个字符、以 a 结尾_csv_metadata并且其中有六个字符的表名_。我正在迭代每个字符串并删除它们在数据库中的相应表。对于以 结尾的结果查询_csv,我_csv_metadata将截断截断字符串的尾随九个字符_metadata

如果我要对此进行改进,我还将查询该表并确保其中没有数据并匹配某些列名定义。


推荐阅读