Airflow alerts
Infrastructure Submissions:
- [YES] Have you added an explanation of what your changes do and why you'd like us to include them?
- [NO] I have updated the documentation accordingly.
- [YES] I have added tests to cover my changes.
- [YES/NO/NA] All new and existing tests passed.
- [YES] I have formatted the terraform code. (
terraform fmt -recursive && go fmt ./...
)
Current Behavior or Linked Issues
Currently when there is issue with airflow infra or service, we get to know about it from logs or when one is using the airflow service. There are no monitoring alerts for airflow and hence we have a delayed response to infra or service issues. The Alerts added will be solution in that direction.
Does this introduce a breaking change?
- [NO]
Other information
The following alerts have been added.
# Airflow Component Host Count Alert #
Alert to trigger when the host count of airflow component goes below the required count, for 2 or more simultaneous breaches, to be checked every five minutes over a period of five minutes.
airflow-scheduler-host-count-alert Threshold count - 1
airflow-web-host-count-alert Threshold count - 2
airflow-worker-host-count-alert Threshold count - 1
# Airflow Component CPU Usage Alert #
Alert to trigger when the CPU Usage of the airflow component goes above the threshold limit, for 3 or more simultaneous breaches, to be checked every five minutes over a period of five minutes.
airflow-scheduler-CPU-Usage-alert Threshold limit- 80%
airflow-web-CPU-Usage-alert Threshold limit- 80%
airflow-worker-CPU-Usage-alert Threshold limit- 80%
airflow-postgres-CPU-Usage-alert Threshold limit- 85%
# Airflow Component Memory Usage Alert #
Alert to trigger when the Memory Usage of the airflow component goes above the threshold limit, for 3 or more simultaneous breaches, to be checked every five minutes over a period of five minutes.
airflow-scheduler-memory-usage-alert Threshold limit- 80%
airflow-web-memory-usage-alert Threshold limit- 80%
airflow-worker-memory-usage-alert Threshold limit- 80%
airflow-Redis-memory-usage-alert Threshold limit- 80%
# Airflow Service Alerts #
airflow-service-error-rate-alert
Alert to trigger when error rate for 5xx goes above the threshold limit of 20, to be checked every 5 minutes over a period of 10 minutes.
airflow-scheduler-heartbeat-alert
Alert to trigger when scheduler heartbeat goes below the threshold limit of 2, to be checked every 5 minutes over a period of 5 minutes, for 2 or more consecutive breaches.
airflow-dag-processor-timeout-alert
Alert to trigger when dag processing timeouts occur, to be checked every 5 minutes over a period of 5 minutes, for 2 or more consecutive breaches.
airflow-import-errors-alert
Alert to trigger when dag import errors occur, to be checked every 5 minutes over a period of 5 minutes, for 2 or more consecutive breaches.