On October 22, 2012, Amazon Web Services had a significant event in their US-East region. The net result was that the websites of many companies were down or otherwise negatively impacted.
Even though I work at New Relic, I run a couple of personal websites that were affected by the event. And New Relic helped me quickly figure out what was going on and what I could do about it.
It All Started with a Page
My first indication that there was a problem came from a page sent to my iPhone. At 10:46 am, I got the following message:
“ALRT #212 on New Relic: Alert on nimbus secure opened”
The page came from PagerDuty (another great support tool), that I have setup to receive notifications of events from New Relic. Within minutes of the start of the event, I was engaged.
What Was Going On?
So, what was up? I logged into New Relic and immediately found that my applications were in distress:
Neither application was serving traffic. I’d already received an alert on one of them. And the second was about to … obviously something was very, very wrong.
I clicked on the first application. This confirmed no traffic was going through. My Apdex score had dropped to zero:
Given this, my next step was to look at my application servers:
I noticed two things here. First, half of my servers were being impacted – all within in single AWS availability zone. The number went up as the event went on and spread. Secondly, my servers were reporting that disk I/O went up to 100%. I clicked on ip-10-1-0-101, which is my primary webserver for my distressed application and got the following charts:
From these, I saw that the processes were getting backed up trying to do disk I/O which appeared to look backlogged. Looking further, I saw this:
I saw that I was no longer sending data over the network to my disks (in AWS, the disks are on remote servers, called EBS volumes.) Obviously, Amazon’s EBS service was having problems. This is a common cause of outages on AWS, so it wasn’t much of a surprise to me. I went to the Amazon website and confirmed they were aware of the problem and working on a resolution.
Now What …
You can see that within a few minutes of the problem occurring, and with just a couple minutes of evaluation of data on existing New Relic charts, I was able to determine exactly what the problem was and exactly which servers were impacted and in which availability zone. (Amazon never tells you which availability zones are impacted, just whether it’s one or more.)
As a result, I knew exactly what my options were. For example, I could have deployed new servers with backup images to other servers in other AWS availability zones (avoiding the problematic one) or have traffic routed to servers I setup in other backup service providers.
In this particular case, the problem came from a large-scale outage on AWS and it was difficult to self-recover since so much was impacted. However, this sort of problem occurs on a much smaller scale at other times with AWS and similar providers. Disk volumes go bad, servers go down, I/O gets backlogged, etc. The point is that I was able to immediately diagnose the problem in just a couple minutes using New Relic, and could quickly switch my focus to helping my customers and working around the issue. Using just the tools and data that AWS (or other service providers) give me, it’s a potentially significantly harder — and more time consuming — problem to solve.
Have you run into a similar situation? Let us know in the comments below.