Bringing down an entire application is easy. All it takes is the failure of a single service and the entire set of services that make up the application can come crashing down like a house of cards. Just one minor error from a non-critical service can be disastrous to the entire application.
There are, of course, many ways to prevent dependent services from failing. However, adding extra resiliency in non-critical services also adds complexity and cost, and sometimes it is not needed.
Looking at the figure below, what happens if Service D is not critical to the running of Service A? Why should Service A fail simply because Service D has failed? Why should Service D have a high resiliency if highly critical Service A can survive without it?
How do you know when a service dependency link is critical and when it isn’t? Service tiers are one way to help manage this.
What are service tiers?
A service tier is simply a label associated with a service that indicates how critical a service is to the operation of your business. Service tiers let you distinguish between services that are mission-critical, and those that are useful and helpful but not essential.
By comparing service tier levels of dependent services, you can determine which service dependencies are your most sensitive and which are less important.
Assigning service tier labels to services
All services in your system—no matter how big or how small—should be assigned a service tier. The following sections outline an example scale that I use. You can use it as is or adjust it to suit your particular business needs.
Tier 1 services are the most critical services in your system. A service is considered Tier 1 if a failure of that service will result in a significant impact to customers or to the company’s bottom line.
The following are some examples of Tier-1 services:
- Login service. A service that lets users log in to your system.
- Credit card processor. A service that handles customer payments.
- Permission service. A service that tells you what features a given user may have access to.
- Order-accepting service. A service that lets customers purchase a product on your website.
A Tier-1 service failure is a serious concern to your company.
A Tier-2 service is one that is important to your business but less critical than a Tier 1. A failure in a Tier-2 service can cause a degraded customer experience in a noticeable and meaningful way but does not completely prevent your customer from interacting with your system.
Tier-2 services are also services that affect your backend business processes in significant ways, but might not be directly noticeable to your customers. The following are some examples of Tier-2 services:
- Search service. A service that provides a search function on your website.
- Order fulfillment service. A service that makes it possible for your warehouse to process an order for shipment to a customer.
A failure of a Tier-2 service will have a negative customer impact but does not represent a complete system failure.
A Tier-3 service is one that can have minor, unnoticeable or difficult-to-notice customer impact, or have limited effects on your business and systems.
The following are some examples of Tier-3 services:
- Customer-icon service. A service that displays a customer icon or avatar on a website page.
- Recommendations service. A service that displays alternate products a customer may be interested in based on what they are currently viewing.
- Message of the day service. A service that displays alerts or messages to customers at the top of the webpage.
Customers may or may not even notice that a Tier-3 service is failing.
A Tier-4 service is a service that, when it fails, causes no significant effect on the customer experience and does not significantly affect the customer’s business or finances.
The following are some examples of Tier-4 services:
- Sales report generator service. A service that generates a weekly sales report. Although the sales report is important, a short-term failure of the generator service will not have a significant impact.
- Marketing email sending service. A service that generates emails sent regularly to your customers. If this service is down for a period of time, email generation might be delayed, but that will typically not significantly affect you or your customers.
How to Use Service Tiers
Service tiers impact two aspects of your system, required responsiveness to problems and dependency between services.
The service tier level of a service determines how fast or not fast a problem with the service should be addressed. Of course, the higher the significance of a problem, the faster it should be addressed. But, in general, the lower the service tier number, the higher importance the problem likely is and the faster it should be addressed. A low-to-medium severity Tier-1 problem is likely more important and impactful than a high severity Tier-4 problem.
Given the difference in responsiveness that is given to higher importance services (lower service tier numbers), this impacts your dependency map between services and assumptions you can make about your service dependencies.
If a Tier-4 (low priority) service makes a call to a Tier-1 (high priority) service, then it probably is safe for the Tier-4 service to assume that the Tier-1 service will always respond, and if for some reason it does not respond, it would typically be acceptable for the Tier-4 service to simply fail itself. After all, if a Tier-1 service for your application is down, significant efforts will be immediately in place to try and resolve that service problem. The fact that a Tier-4 service is also down will not be of consequence. Think of the case where your web application is down because users cannot log in (a Tier-1 service problem). How concerning will it be that the marketing emails for the day might be delayed a bit (a Tier-4 service problem)?
But the reverse is not true. If a Tier-1 service depends on a Tier-4 service, that Tier-1 service must have developed contingency plans and failover recovery plans for when that Tier-4 service might be down. After all, you don’t want a Tier-1 service to fail simply because a much lower priority Tier-4 service is not functioning. As an example, you do not want your web application to fall down and fail simply because you cannot display the customer’s avatar in the corner of every page. You will want to gracefully recover and simply not display the avatar, but continue having your application work otherwise normally.
Take a look at the figure below. In this figure, we assigned service tiers to each service. Given the rules described above, note that we need additional resiliency added between Service A and Service D because Service A is a higher priority service (Tier 1) than is Service D (Tier 3). Therefore, Service A needs to protect itself from Service D failures, given Service D is lower priority.
Now look at Service B. Service B also depends on Service D, but in this case, according to our rules above, Service B does not need the additional resiliency between it and Service D. This is because Service B is a lower priority service (Tier 4) than Service D (Tier 3). So, it’s more acceptable for Service B to suffer an outage at a time when Service D is unavailable. Service D, in this example, is more important.
By careful analysis of your services and proper tier assignments, you can determine where to focus your development, testing, and resiliency efforts for inter-service dependencies, prioritizing the most critical and most vulnerable interfaces first, without over-investing in less-critical interfaces.
Service tiers are labels
Service tiers simply provide a “labeling” system that gives you information on the importance of every service in your system. You can use that label to determine problem escalation policies, procedures, and prioritizations.
But you can also use that label to determine the amount and type of back off and recovery necessary if one service cannot make a call to a dependent service. What you do and how you respond depends on if you are calling a higher- or lower-tier service.