This is the second post in a series exploring the theme of long-running service workloads in YARN. See for the introductory post.
Long-running services deployed on YARN are by definition expected to run for a long period of time—in many cases forever. Services such as Apache™ HBase, Apache Accumulo and Apache Storm can be run on YARN to provide a layer of services to end users, and they usually have a central master running in conjunction with an ApplicationMaster (AM). The AM is a potential single point of failure for all of these services.
In this blog post, we will discuss a few strategies for making long-running services on YARN more fault-tolerant to AM failures.
The way things were
Before HDP 2.2, YARN already supported basic mechanisms to restart the AM in case of AM crash or failure. Together, ResourceManager and the NodeManagers running the AMs are responsible for automatically detecting an AM crash and restarting it. However, there are a couple of significant limitations to this approach:
- Loss of work-in-progress: When an AM crashes, all the containers it launched are forcefully killed along with it. So all the work being done by the existing containers is lost if and when an AM dies.
- Limited AM retries: The number of times an AM can be restarted is limited by a user-specified value at application-submission time [via the API ApplicationSubmissionContext#setMaxAppAttempts(int maxAppAttempts)]. This value is also capped by a configurable platform-level global limit set by yarn.resourcemanager.am.max-attempts. The final value of maxAppAttempts determines how many AM instances can be created over the lifetime of a particular application. Once those AM attempts are exhausted, the application is deemed as a failure and will not restart—an unacceptable outcome for long-running services.
Let’s take a look at solutions available now with HDP 2.2 that can overcome these limitations.
1. Container-preserving AM restart
The way to prevent the loss of work-in-progress at an AM crash is to implement container-preserving AM restart. This allows the AM to be restarted without killing its associated containers, and then to rebind/reconnect the restarted AM with its previous running containers. All the related work for container-preserving AM restart is tracked by the Apache Hadoop YARN JIRA ticket YARN-1489(note that the ticket is still open for addressing a few minor enhancements).
Here’s how it works:
- When an AM shuts down, the ResourceManager (RM) no longer kills the running containers that belong to the current application-attempt. Instead, it keeps them running.
- The new AM is notified of the previous AM’s container locations when it boots up and re-registers with the RM.
- Containers managed by the previous AM (and any clients that were talking with the previous AM) then communicate with the RM to get the new location of the new AM. This completes the transfer of the containers’ running state to the newly-created AM.
In this solution, note that all the non-running containers (such as any reserved containers and acquired-but-not-yet-launched containers) are still killed by the RM. Any outstanding container requests are cleared as well. While this may sound like a limitation, it is a necessary result of an underlying focus on scalability. Even today, RM does not persistently record all containers’ information—it simply learns the status of running containers from NodeManagers’ (NMs) registration. In the case of an RM crash-reboot event, information about those not-yet-running containers cannot be retrieved unless RM explicitly persists in each and every state of every container in the system—a design anti-goal for RM scalability.
Control flow for container-preserving AM restart
Let’s dig a little deeper into the mechanism for notifying the newly-started AM about the previous running containers. This happens through the ApplicationMasterService registration call.
When a new AM is re-launched, it first re-registers with the RM. Then the RM registration-response generates a list of containers that were running before the AM went down. Once the new AM receives this information, it gains the location of previous running containers and maps them to any framework-level tasks like HBase RegionServers, Storm Supervisors, etc.
In a secure environment, the AM also requires the relevant NMTokens in order to communicate with the corresponding NMs. In this case, the RM also re-creates the relevant NMTokens and sends them across to the AM via the registration call along with the list of previous running containers.
Note that the containers themselves are still not notified of the new AM location. This is left as an application-level detail.
How to use container-preserving AM restart
To use this feature, application writers need to be aware of a few APIs:
- Enable work-preserving AM restart for a particular application
API: ApplicationSubmissionContext#setKeepContainersAcrossApplicationAttempts(keepContainers)
This API sets a flag indicating whether YARN should keep containers across application attempts or not. If the flag is true, running containers will not be killed when an AM crashes and restarts. These containers will be retrieved by the new application attempt upon AM registration via theAPI: ApplicationMasterProtocol#registerApplicationMaster(RegisterApplicationMasterRequest).
- Get the list of running containers from previous application attempts
API: List<Container>RegisterApplicationMasterResponse#getContainersFromPreviousAttempts()
- Get the list of necessary NMTokens from previous application attempts
API: List<NMToken>RegisterApplicationMasterResponse#getNMTokensFromPreviousAttempts()
About Container IDs
YARN container IDs have always been derived from the ID of the application attempt that originally allocated the container. Even though these transferred containers are now managed by the newly-created AM (with a new/different Application Attempt ID), the container ID is still tied to the ID of the application attempt that originated the container.
2. Addressing limited AM retries
The limited number of AM restarts makes sense for short-lived applications like MapReduce that can typically run into completion within hours or days. If the AMs of short-lived applications keep failing for whatever reason, it’s better to fail the application after some number of attempts instead of retrying it forever. However, that logic just doesn’t work for long-running services, especially those designed to run forever.
Sometimes, an AM failure is caused by a restart (during upgrade) or by the decommission of an NM (on which the AM container is running). In cases like these, ideally the AM should not be penalized—but it is because each of those conditions still counts as an AM failure by the ResourceManager.
Because the retry count for the AMs (maxAppAttempts) is pre-configured and does not reset after application submission, the number of attempts will eventually cross the restart threshold, given enough time. Once that happens, RM will mark the application as failed and shut it down—a bad end-state for long-running services.
YARN in HDP 2.2 addresses these limitations by two related efforts:
- Isolation of real AM failures from failures induced by platform-level events.
- Resetting the AM retry count based on a policy.
Isolation of real AM failures
The idea here is to stop penalizing the AM restart count for events that aren’t really AM problems, by separating true AM failures from ones induced by hardware or YARN-level issues. Events like NM restart/failures, AM container preemption, etc., are no longer counted towards the limited number of AM retries. This work is tracked by the Apache JIRA YARN-614.
When an AM fails, the RM will check the exit status for the AM container. If the exit status matches any of the following statuses, the AM failure count will not be increased:
- ContainerExitStatus.PREEMPTED: the AM container itself is preempted.
- ContainerExitStatus.ABORTED: the RM aborted allocation of the AM container for reasons such as DNS hiccup.
- ContainerExitStatus.DISKS_FAILED: the NM that runs the AM container crashed the AM due to disk failures.
- ContainerExitStatus.KILLED_BY_RESOURCEMANAGER: the NM that runs the AM container resyncs due to state-drift, so RM kills all the containers.
Because the RM ignores the above exit statuses when determining whether a new application attempt needs to be launched, application resiliency is significantly improved.
AM retry windows
Another method to improve fault tolerance is to create a policy to reset the AM retry count. We have introduced a new user-specifiable parameter, the attempt_failures_validity_interval, which creates an AM-retry window. When implemented, the RM checks how many AM failures happened during the last time window of size attempt_failures_validity_interval. If the retry count reaches the configured limit of maxAppAttempts within the given time window, only then will the application be failed. Otherwise, a new attempt will be launched. This parameter accepts values in milliseconds and can be set in ApplicationSubmissionContext during application submission. The related work is tracked at YARN-611.
Every application can have its own attempt_failures_validity_interval, and the RM will use it to determine if a new AM needs to be launched whenever an AM crashes.
How to use AM retry windows
To use this feature, application writers need to specify attempt_failures_validity_interval in their application-submission code using the
API: ApplicationSubmissionContext#setAttemptFailuresValidityInterval(attemptFailuresValidityInterval).
The default value for this parameter is -1, falling back to today’s behavior. When attemptFailuresValidityInterval is set to > 0 milliseconds, the RM calculates the desired time window.
Conclusion
In conclusion, this blog post discussed some of the key YARN efforts delivered in HDP 2.2 that enable reliable and fault-tolerant management of long-running services. We hope these features help you bring your service workloads to YARN clusters more easily.
The post Apache Hadoop YARN in HDP 2.2: Fault-Tolerance Features for Long-running Services appeared first on Hortonworks.