[ALERT[ airflow-webserver has been in a non-ready state for longer than 15 minutes.
Issue Description: In a multi-node environment where airflow is installed, a critical error alert may be fired with the message
- airflow-webserver has been in a non-ready state for longer than 15 minutes
Root Cause: Multiple replicas of the scheduler are created by default but are not needed and cause instability. See 'Root Cause' field for more details.
Diagnosing: An alert fired in Rancher is received that is similar to the following:
[ALERT]
alertname = KubePodNotReady
namespace = airflow
pod = airflow-webserver-67485b9d9-p75cl
prometheus = cattle-monitoring-system/rancher-monitoring-prometheus
severity = critical
Annotations
message = Pod airflow/airflow-webserver-67485b9d9-p75cl has been in a non-ready state for longer than 15 minutes.
runbook_url = https://docs.uipath.com/automation-suite/docs/alert-runbooks
summary = Pod has been in a non-ready state for more than 15 minutes.
Also see an alert concerning Process Mining similar to the following:
[1] Firing
Labels
alertname = ProcessMiningPurgeDeletedAppsDagWarning
container = statsd
dag_id = purge_deleted_apps
endpoint = statsd-scrape
instance = 10.42.9.131:9102
job = airflow-statsd
namespace = airflow
pod = airflow-statsd-bb67f9d78-7whgh
prometheus = cattle-monitoring-system/rancher-monitoring-prometheus
service = airflow-statsd
severity = warning
Annotations
description = Process Mining purge deleted apps dag warning
message = Process Mining purge deleted apps dag failed in the last hour
severity_level = warning
Source
Root Cause: In Apache Airflow, the scheduler is responsible for orchestrating the execution of tasks on a trigger or schedule. The scheduler is aware of the state of the tasks in active DAG runs, and decides which tasks can be run by triggering their execution.
In a multi-node setup with multiple scheduler replicas, certain types of operations can cause issues, such as race conditions, that result in instability. Process Mining, which involves the ingestion and analysis of event logs to model processes within an organization, is resource-intensive task that involves multiple steps, potentially spread across multiple tasks and DAGs in Airflow. For example, one Process Mining associated DAG is to purge all the apps that have been marked for deletion after 30 days.
Here are some general reasons why multiple scheduler replicas could be causing issues:
- Task Conflicts:
With more than one scheduler instance, it's possible that both try to schedule the same task simultaneously. While Airflow is designed to handle some level of concurrency, having multiple schedulers can push this to a limit and result in unexpected behavior or conflicts, such as double-triggering tasks.
-
Resource Contention:
Multiple schedulers can also lead to resource contention where they are competing for resources like CPU, memory, or database connections. In a Process Mining workflow, which is likely computationally intensive, this resource contention can lead to performance bottlenecks or even failures.
-
Complexity:
The architecture itself becomes more complex with multiple schedulers, making it harder to debug issues. The interdependencies of tasks in Process Mining might make this complexity unmanageable.
-
Database Locks:
The schedulers read from and write to the backend database. Multiple schedulers may cause more frequent database locks, leading to delays and timeouts, which is particularly problematic for time-sensitive tasks like data ingestion in Process Mining.
-
Inconsistent State:
If the schedulers are not perfectly in sync, they could create an inconsistent state for the DAG runs, which would be problematic for tasks that have to be executed in a specific order, a common requirement in Process Mining workflows.
-
External Dependencies:
As Process Mining relies on external dependencies (e.g., other services or databases), multiple schedulers can complicate the orchestration logic. This can lead to race conditions and other unforeseen issues.
Resolution: Scale down to a single scheduler instance as multiple replicas are not needed for our use case and lead to instability issues.
- Patch the airflow-scheduler manifest replica value
- kubectl patch deployment airflow-scheduler -n airflow-scheduler --patch '{"spec": {"replicas": 1}}'
- Check that the airflow-scheduler deployment status
- kubectl patch deployment airflow-scheduler -n airflow-scheduler --patch '{"spec": {"replicas": 1}}'
- Observe 1/1 pods in the READY column after the new pod has started successfully.