Fixed bugs in the Airflow Monitoring and Alerts
Infrastructure Submissions:
- [YES] Have you added an explanation of what your changes do and why you'd like us to include them?
- [NA] I have updated the documentation accordingly.
- [NA] I have added tests to cover my changes.
- [YES] All new and existing tests passed.
- [YES] I have formatted the terraform code. (
terraform fmt -recursive && go fmt ./...
)
Current Behavior or Linked Issues
A few noted bugs in airflow dashboards:
- The granularity of data points was fixed at 15 min mark, so if one applies a time-period of 1 hour it would return only 3-4 data points instead of 60.
- In all the 3 dashboards DatapartitionId filter-name is changed to ClusterName.
- The charts on service dashboard were split on the basis of Metric Name and not Cluster Name. So corrected it.
- In dags dashboard the datapoints were split on basis on Metric Name so corrected it to dagName/TaskId where applicable. Noted bugs in Airflow Alerts:
- The granularity of metrics in alert queries was set to be 5min in some alerts where it was expecting more metrics so changed it to 1min and 30 sec in Host-count alerts.
- Changed aggregation type of import-error alert to Max
- Changed aggregation type of Error Rate alert to Sum
Does this introduce a breaking change?
- [NO]
MR Guildelines
-
Paste TF Plan for the MR. -
Pre-Merge pipeline should be run before merging. (Azure team) -
Does the module exists for new resource. -
Is there a new variable added in the MR. (Don’t use library variables and use locals)