amazon-web-services - 如何获得胶水爬虫事件状态?
问题描述
我正在关注此文档https://aws.amazon.com/premiumsupport/knowledge-center/start-glue-job-run-end/lambda
以在完成时设置自动触发crawler
。我设置的事件模式cloudwatch
是:
{
"detail": {
"crawlerName": [
"reddit_movie"
],
"state": [
"Succeeded"
]
},
"detail-type": [
"Glue Crawler State Change"
],
"source": [
"aws.glue"
]
}
我在 cloudwatch 中添加了一个 lambda 函数作为此规则的目标。
我手动触发了爬虫,但它在完成后不会触发 lambda。从爬虫日志中我可以看到:
04:36:28
[6c8450a5-970a-4190-bd2b-829a82d67fdf] INFO : Table redditmovies_bb008c32d0d970f0465f47490123f749 in database video has been updated with new schema
04:36:30
[6c8450a5-970a-4190-bd2b-829a82d67fdf] BENCHMARK : Finished writing to Catalog
04:37:37
[6c8450a5-970a-4190-bd2b-829a82d67fdf] BENCHMARK : Crawler has finished running and is in state READY
以上日志是否意味着爬虫成功完成?我怎么知道为什么爬虫没有触发 lambda 函数?
以及如何调试此问题?我应该查看哪个日志?
解决方案
以下作品——
Cloudwatch Event Rule -
{
"source": [
"aws.glue"
],
"detail-type": [
"Glue Crawler State Change"
],
"detail": {
"state": [
"Succeeded"
]
}
}
样本 lambda -
def lambda_handler(event, context):
try:
if event and 'detail' in event and event['detail'] and 'crawlerName' in event['detail']:
crawler_name = event['detail']['crawlerName']
print('Received event from crawlerName - {0}'.format(crawler_name))
crawler = glue.get_crawler(Name=crawler_name)
print('Received crawler from glue - {0}'.format(str(crawler)))
database = crawler['Crawler']['DatabaseName']
except Exception as e:
print('Error handling events from crawler. Details - {0}'.format(e))
raise e