首页 > 解决方案 > 如何获得胶水爬虫事件状态?

问题描述

我正在关注此文档https://aws.amazon.com/premiumsupport/knowledge-center/start-glue-job-run-end/lambda以在完成时设置自动触发crawler。我设置的事件模式cloudwatch是:

{
  "detail": {
    "crawlerName": [
      "reddit_movie"
    ],
    "state": [
      "Succeeded"
    ]
  },
  "detail-type": [
    "Glue Crawler State Change"
  ],
  "source": [
    "aws.glue"
  ]
}

我在 cloudwatch 中添加了一个 lambda 函数作为此规则的目标。

我手动触发了爬虫,但它在完成后不会触发 lambda。从爬虫日志中我可以看到:

04:36:28
[6c8450a5-970a-4190-bd2b-829a82d67fdf] INFO : Table redditmovies_bb008c32d0d970f0465f47490123f749 in database video has been updated with new schema

04:36:30
[6c8450a5-970a-4190-bd2b-829a82d67fdf] BENCHMARK : Finished writing to Catalog

04:37:37
[6c8450a5-970a-4190-bd2b-829a82d67fdf] BENCHMARK : Crawler has finished running and is in state READY

以上日志是否意味着爬虫成功完成?我怎么知道为什么爬虫没有触发 lambda 函数?

以及如何调试此问题?我应该查看哪个日志?

标签: amazon-web-servicesaws-lambdaamazon-cloudwatchaws-glue

解决方案


以下作品——

Cloudwatch Event Rule -

{
  "source": [
    "aws.glue"
  ],
  "detail-type": [
    "Glue Crawler State Change"
  ],
  "detail": {
    "state": [
      "Succeeded"
    ]
  }
}

样本 lambda -

def lambda_handler(event, context):
    try:        
        if event and 'detail' in event and event['detail'] and 'crawlerName' in event['detail']:
            crawler_name = event['detail']['crawlerName']
            print('Received event from crawlerName - {0}'.format(crawler_name))

            crawler = glue.get_crawler(Name=crawler_name)
            print('Received crawler from glue - {0}'.format(str(crawler)))

            database = crawler['Crawler']['DatabaseName']
    except Exception as e:
        print('Error handling events from crawler. Details - {0}'.format(e))
        raise e

这是屏幕截图 - 添加爬虫 Cloudwatch 事件规则


推荐阅读