Is Your Provider Down? New Relic Can Help!

On October 22, 2012, Amazon Web Services had a significant event in their US-East region. The net result was that the websites of many companies were down or otherwise negatively impacted.

Even though I work at New Relic, I run a couple of personal websites that were affected by the event. And New Relic helped me quickly figure out what was going on and what I could do about it.

It All Started with a Page
My first indication that there was a problem came from a page sent to my iPhone. At 10:46 am, I got the following message:

“ALRT #212 on New Relic: Alert on nimbus secure opened”

New Relic Alert page from PagerDuty

The page came from PagerDuty (another great support tool), that I have setup to receive notifications of events from New Relic. Within minutes of the start of the event, I was engaged.

What Was Going On?
So, what was up? I logged into New Relic and immediately found that my applications were in distress:

Applications alert screenshot

Neither application was serving traffic. I’d already received an alert on one of them. And the second was about to … obviously something was very, very wrong.

I clicked on the first application. This confirmed no traffic was going through. My Apdex score had dropped to zero:

Apdex Score

Given this, my next step was to look at my application servers:

Recent Server Events

I noticed two things here. First, half of my servers were being impacted – all within in single AWS availability zone. The number went up as the event went on and spread. Secondly, my servers were reporting that disk I/O went up to 100%. I clicked on ip-10-1-0-101, which is my primary webserver for my distressed application and got the following charts:

Load Average

CPU Usage

From these, I saw that the processes were getting backed up trying to do disk I/O which appeared to look backlogged. Looking further, I saw this:

Disk I/O Utilization & Network I/O

I saw that I was no longer sending data over the network to my disks (in AWS, the disks are on remote servers, called EBS volumes.) Obviously, Amazon’s EBS service was having problems. This is a common cause of outages on AWS, so it wasn’t much of a surprise to me. I went to the Amazon website and confirmed they were aware of the problem and working on a resolution.

AWS Current Status

Now What …
You can see that within a few minutes of the problem occurring, and with just a couple minutes of evaluation of data on existing New Relic charts, I was able to determine exactly what the problem was and exactly which servers were impacted and in which availability zone. (Amazon never tells you which availability zones are impacted, just whether it’s one or more.)

As a result, I knew exactly what my options were. For example, I could have deployed new servers with backup images to other servers in other AWS availability zones (avoiding the problematic one) or have traffic routed to servers I setup in other backup service providers.

Summary
In this particular case, the problem came from a large-scale outage on AWS and it was difficult to self-recover since so much was impacted. However, this sort of problem occurs on a much smaller scale at other times with AWS and similar providers. Disk volumes go bad, servers go down, I/O gets backlogged, etc. The point is that I was able to immediately diagnose the problem in just a couple minutes using New Relic, and could quickly switch my focus to helping my customers and working around the issue. Using just the tools and data that AWS (or other service providers) give me, it’s a potentially significantly harder — and more time consuming — problem to solve.

Have you run into a similar situation? Let us know in the comments below.

Lee Atchison is the Senior Director, Cloud Architecture at New Relic. For the last eight years he has helped design and build a solid service-based product architecture that scaled from startup to high traffic public enterprise. Lee has 32 years of industry experience, including seven years as a Senior Manager at Amazon.com, and has consulted with leading organizations on how to modernize their application architectures and transform their organizations at scale. He is the author of the O’Reilly book Architecting for Scale and author of the blog Lee@Scale. View posts by .

Interested in writing for New Relic Blog? Send us a pitch!