docker - 除非处于调试模式,否则 Airflow Docker Swarm 不会启动
问题描述
我正在使用 Docker Swarm 跨多个 ec2 实例部署 Airflow 2.0.1。在 AWS 管理器节点上,有 webserver、调度程序和三个正在运行的工作程序,我有 redis 作为消息代理和 celery 执行程序,以及作为监控工具的花。有 2 个额外的工作节点,每个节点都有一个正在运行的工作节点。
我遇到了调度程序的问题。即使在 20 分钟后,默认的健康检查也没有成功,即使健康检查只是对网络服务器的一个小 ping。它宁可保持在(健康:开始)模式,直到健康检查用 SIGTERM 15 杀死调度程序。
这意味着工作人员(取决于调度程序)一个接一个地失败。这一切都是在调度程序实际上工作正常并完成其工作以及正在执行的任务和 dags 时发生的。
奇怪的是,如果环境 AIRFLOW__LOGGING__LOGGING_LEVEL 设置为 DEBUG,则运行状况检查有效,但如果它在 INFO 中则无效。我在尝试调试问题时遇到了这种行为。
这很烦人,因为 DEBUG 日志占用了大量磁盘空间,这显然不是所需的行为
我的设置如下:airflow.env:
PYTHONPATH=/opt/airflow/
AIRFLOW_UID=1000
AIRFLOW_GID=0
AIRFLOW_HOME=/opt/airflow/
AIRFLOW__CORE__AIRFLOW_HOME=/opt/airflow/
AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags
AIRFLOW__CORE__ENABLE_XCOM_PICKLING=true
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0
AIRFLOW__CORE__FERNET_KEY=################
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=true
AIRFLOW__CORE__LOAD_EXAMPLES=false
AIRFLOW__CORE__PLUGINS_FOLDER=/plugins/
AIRFLOW__CORE__PARALLELISM=128
AIRFLOW__CORE__DAG_CONCURRENCY=32
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=1
AIRFLOW__WEBSERVER__DAG_DEFAULT_VIEW=graph
AIRFLOW__WEBSERVER__LOG_FETCH_TIMEOUT_SEC=30
AIRFLOW__WEBSERVER__HIDE_PAUSED_DAGS_BY_DEFAULT=true
AIRFLOW__WEBSERVER__PAGE_SIZE=1000
AIRFLOW__WEBSERVER__NAVBAR_COLOR='#75eade'
AIRFLOW__SCHEDULER__CATCHUP_BY_DEFAULT=false
AIRFLOW__LOGGING__LOGGING_LEVEL=DEBUG
CELERY_ACKS_LATE=true
CELERY_WORKER_MAX_TASKS_PER_CHILD=500
C_FORCE_ROOT=true
AIRFLOW__CORE__REMOTE_LOGGING=true
AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://airflow-logs-docker/production_vm/
AIRFLOW__CORE__REMOTE_LOG_CONN_ID=aws_s3
码头工人-compose.yaml:
version: '3.7'
services:
postgres:
image: postgres:13
env_file:
- ./config/postgres_prod.env
ports:
- 5432:5432
volumes:
- postgres-db-volume:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-d", "postgres", "-U", "airflow"]
interval: 5s
retries: 5
restart: always
depends_on: []
deploy:
placement:
constraints: [ node.role == manager ]
redis:
image: redis:latest
env_file:
- ./config/postgres_prod.env
ports:
- 6379:6379
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 30s
retries: 50
restart: always
depends_on: []
deploy:
placement:
constraints: [ node.role == manager ]
airflow-webserver:
image: airflow-ommax
build:
context: .
dockerfile: Dockerfile
env_file:
- ./config/airflow.env
- ./config/postgres_prod.env
volumes:
- ./:/opt/airflow
user: "${AIRFLOW_UID:-1000}:${AIRFLOW_GID:-0}"
command: webserver
ports:
- 8080:8080
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 10s
timeout: 10s
retries: 5
restart: always
depends_on:
- airflow-init
deploy:
placement:
constraints: [ node.role == manager ]
airflow-scheduler:
image: airflow-ommax
build:
context: .
dockerfile: Dockerfile
env_file:
- ./config/airflow.env
- ./config/postgres_prod.env
volumes:
- ./:/opt/airflow
user: "${AIRFLOW_UID:-1000}:${AIRFLOW_GID:-0}"
command: scheduler
restart: always
depends_on:
- airflow-init
deploy:
placement:
constraints: [ node.role == manager ]
airflow-worker1:
image: airflow-ommax
build:
context: .
dockerfile: Dockerfile
env_file:
- ./config/airflow.env
- ./config/postgres_prod.env
volumes:
- ./:/opt/airflow
user: "${AIRFLOW_UID:-1000}:${AIRFLOW_GID:-0}"
command: celery worker
restart: always
ports:
- 8791:8080
depends_on:
- airflow-scheduler
- airflow-webserver
- airflow-init
deploy:
placement:
constraints: [ node.role == manager ]
airflow-worker2:
image: airflow-ommax
build:
context: .
dockerfile: Dockerfile
env_file:
- ./config/airflow.env
- ./config/postgres_prod.env
volumes:
- ./:/opt/airflow
user: "${AIRFLOW_UID:-1000}:${AIRFLOW_GID:-0}"
command: celery worker
restart: always
ports:
- 8792:8080
depends_on:
- airflow-scheduler
- airflow-webserver
- airflow-init
deploy:
placement:
constraints: [ node.role == manager ]
airflow-worker3:
image: airflow-ommax
build:
context: .
dockerfile: Dockerfile
env_file:
- ./config/airflow.env
- ./config/postgres_prod.env
volumes:
- ./:/opt/airflow
user: "${AIRFLOW_UID:-1000}:${AIRFLOW_GID:-0}"
command: celery worker
restart: always
ports:
- 8793:8080
depends_on:
- airflow-scheduler
- airflow-webserver
- airflow-init
deploy:
placement:
constraints: [ node.role == manager ]
airflow-worker4:
image: airflow-ommax
build:
context: .
dockerfile: Dockerfile
env_file:
- ./config/airflow.env
- ./config/postgres_prod.env
volumes:
- ./:/opt/airflow
user: "${AIRFLOW_UID:-1000}:${AIRFLOW_GID:-0}"
command: celery worker
restart: always
ports:
- 8794:8080
depends_on:
- airflow-scheduler
- airflow-webserver
- airflow-init
deploy:
placement:
constraints: [ node.role == manager ]
airflow-worker-pt1:
image: localhost:5000/myadmin/airflow-ommax
build:
context: /home/ubuntu/ommax_etl
dockerfile: Dockerfile
env_file:
- ./config/airflow.env
- ./config/postgres_prod.env
volumes:
- /home/ubuntu/ommax_etl/:/opt/airflow
user: "${AIRFLOW_UID:-1000}:${AIRFLOW_GID:-0}"
command: celery worker -q airflow_pt
restart: always
ports:
- 8795:8080
depends_on:
- airflow-scheduler
- airflow-webserver
- airflow-init
deploy:
placement:
constraints: [ node.role == worker ]
airflow-worker-pt2:
image: localhost:5000/myadmin/airflow-ommax
build:
context: /home/ubuntu/ommax_etl
dockerfile: Dockerfile
env_file:
- ./config/airflow.env
- ./config/postgres_prod.env
volumes:
- /home/ubuntu/ommax_etl/:/opt/airflow
user: "${AIRFLOW_UID:-1000}:${AIRFLOW_GID:-0}"
command: celery worker -q watchhawk
restart: always
ports:
- 8796:8080
depends_on:
- airflow-scheduler
- airflow-webserver
- airflow-init
deploy:
placement:
constraints: [ node.role == worker ]
airflow-init:
image: airflow-ommax
build:
context: .
dockerfile: Dockerfile
env_file:
- ./config/airflow.env
- ./config/postgres_prod.env
- ./config/init.env
volumes:
- ./:/opt/airflow
# user: "${AIRFLOW_UID:-50000}:${AIRFLOW_GID:-50000}"
user: "${AIRFLOW_UID:-1000}:${AIRFLOW_GID:-0}"
command: version
depends_on:
- postgres
- redis
deploy:
placement:
constraints: [ node.role == manager ]
flower:
image: airflow-ommax
build:
context: .
dockerfile: Dockerfile
env_file:
- ./config/airflow.env
- ./config/postgres_prod.env
volumes:
- ./:/opt/airflow
user: "${AIRFLOW_UID:-1000}:${AIRFLOW_GID:-0}"
command: celery flower
ports:
- 5555:5555
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
interval: 10s
timeout: 10s
retries: 5
restart: always
depends_on: []
deploy:
placement:
constraints: [ node.role == manager ]
selenium-chrome:
image: selenium/standalone-chrome:latest
ports:
- 4444:4444
deploy:
placement:
constraints: [ node.role == worker ]
depends_on: []
volumes:
postgres-db-volume:
Dockerfile:
FROM apache/airflow:2.0.1-python3.7
COPY config/requirements.txt /tmp/
RUN mkdir -p /home/airflow/.cache/zeep
RUN chmod -R 777 /home/airflow/.cache/zeep
RUN mkdir -p /home/airflow/.wdm
RUN chmod -R 777 /home/airflow/.wdm
RUN pip install -r /tmp/requirements.txt
解决方案
我做了一些源代码扫描,我能看到的唯一真正的实现取决于日志级别是在worker.py
.
AIRFLOW__LOGGING__LOGGING_LEVEL
当它不是 DEBUG 时,您设置的日志级别是什么?
这是我正在查看的代码片段。这样的事情会出现在任何地方吗?
try:
loglevel = mlevel(loglevel)
except KeyError: # pragma: no cover
self.die('Unknown level {0!r}. Please use one of {1}.'.format(loglevel, '|'.join(l for l in LOG_LEVELS if isinstance(l, string_t))))
推荐阅读
- r - 如何绘制多个 ggplot2 元素并让标签排斥/出现?
- firebase - 如何在firebase中存储每个帐户的数据?
- c# - 从 REST api 反序列化 Json 并使用从 Json 获得的值
- typescript - 使用值类型约束键
- laravel - axios response.data 返回 HTML 而不是对象
- r - R - 如何将字符列表与在线文章进行比较
- vue.js - 使用 vuejs 和 azuredevops 提供环境变量
- linux - python3子进程模块不能用于cp
- python - 我试图找出哪一天最大压力和最小压力之间的差异最大,
- java - 当视图的背景发生变化时,我遇到了 CardView 和 Material Design 提升的问题