amazon-web-services - AWS Glue not deleting or deprecating tables generated over now removed S3 data
问题描述
Due to user error, our S3 directory over which a Glue crawler ran routinely became flooded with .csv files. When Glue ran over the S3 directory- it created a table for each of the 200,000+ csv files. I ran a script that deleted the .csv files shortly after (S3 bucket has versioning enabled), and re-ran the Glue crawler with the following settings:
Schema updates in the data store Update the table definition in the data catalog.
Inherit schema from table Update all new and existing partitions with metadata from the table.
Object deletion in the data store Delete tables and partitions from the data catalog.
Within the cloudwatch logs- it's updating the tables matching the remaining data, but it's not deleting any of the tables generated from those .csv files. According to it's configuration log on Cloudwatch- it should be able to do so.
INFO : Crawler configured with Configuration
{
"Version": 1,
"Grouping": {
"TableGroupingPolicy": "CombineCompatibleSchemas"
}
}
and SchemaChangePolicy
{
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "DELETE_FROM_DATABASE"
I should include there is another crawler that is set to crawl over the S3 bucket, but it's not been run in over a year, so I doubt that could be a point of conflict.
I'm are stumped on what could be the issue; as it stands, I can write a script to pattern match the existing tables and drop those with a csv in their name or delete and rebuild the database by having Glue re crawl S3, but if possible- I'd much rather Glue drops the tables itself after identifying they point to no files within S3 itself.
解决方案
我目前正在采用编写脚本来删除 Athena 创建的表的方法。从 Athena 查询生成的所有文件都有 49 个字符长,_
结果文件有 5 个字符,元数据有 6个字符,对于结果查询结果和查询元_
数据,通常遵循以 a 结尾的格式。_csv
_csv_metadata
我得到了我的数据库中所有表名的列表,过滤它只包括那些长度为 49 个字符、以 a 结尾_csv_metadata
并且其中有六个字符的表名_
。我正在迭代每个字符串并删除它们在数据库中的相应表。对于以 结尾的结果查询_csv
,我_csv_metadata
将截断截断字符串的尾随九个字符_metadata
。
如果我要对此进行改进,我还将查询该表并确保其中没有数据并匹配某些列名定义。
推荐阅读
- javascript - 发送数据并保存在文本字段中
- django - Django注册组:组匹配查询不存在
- c# - Windows 窗体 RDLC:将项目更新为包引用格式后缺少程序集
- java - 使用正则表达式在java中拆分字符串
- swift - 如何将单元格固定在 Tableview 中的选择上?
- http - Set-Cookie 的哪一部分应该发送到服务器?
- r - 试图运行这个广义线性模型来发现多个随机效应对蟋蟀质量的影响,我得到了以下错误
- agora.io - 从 Web SDK 到 iOS SDK 的视频通话
- html - 身体标签看到了相同类型的元素
- git - 未跟踪的忽略文件仍然出现在 git status 中