The New Relic lightning talks at AWS re:Invent 2018 gave attendees a bevy of best practices for making monitoring a critical part of their AWS cloud journeys. But a truly dialed-in and impactful monitoring strategy can be a tricky task to perfect—especially when you’re dealing with performance issues at crunch time.
Expert advice on success with New Relic in the AWS cloud
To help smooth the path, Matt van Zanten—a DevOps engineer at AWS consulting partner and systems integrator Onica, was on hand at one of New Relic’s two re:Invent 2018 booths to deliver a lightning talk: “AWS Best Practices From the Largest Pure AWS SI in North America.” Matt, a veteran consultant who works with e-commerce companies to support their cloud-native applications, told attendees that he and his team use out-of-the box metrics gathered by Amazon CloudWatch to monitor system performance from an outside-in view, alerting on key metrics like CPU, latency, and HTTP 5xx errors. These metrics, Matt said, helped his team track system health and react accordingly.
But alerts alone, Matt added, don’t tell you why an application is suddenly performing poorly. This is where New Relic truly shines—and Matt delved into three critical best practices for operating natively in the AWS cloud with New Relic at your side:
- Detect and solve anomalies with real-time application data.
- Uncover transactions and queries that require tuning.
- Improve monitoring and alerting from within your applications.
Detecting and resolving anomalies with real-time application data
Matt opened his discussion with an anecdote: While working on a consulting project he discovered that his client’s Java application couldn’t handle more than 2,000 user requests at a time. Faced with more than 2,000 concurrent user requests, the app’s database tables deadlocked and the app crashed.
Matt used New Relic APM to instrument and monitor the troublesome Java app in real time, and found a more nuanced issue: evidence of increased database transaction latency on top of the application’s (expected) latency.
Once he saw the database transaction latency, Matt attacked the problem by asking two key questions: First, which queries were requested most frequently? Second, which were the slowest to complete? The answers to these questions, Matt said, helped him determine which queries—and which portions of the app’s underlying code—would need to be optimized to fully resolve the latency issues.
Uncovering transactions and queries that require tuning
Matt used another story involving a client to illustrate the second of his AWS monitoring best practices. In this case, Matt and his team were finishing testing and maintenance work to prepare a client for a big Black Friday and Cyber Monday push. Just as the team was wrapping up its work, however, the system threw them a curveball: previously stable CPU metrics suddenly began jumping around and increasing. This was a big problem, since backend API CPU usage directly correlated to the app’s overall latency—and app latency, of course, can have huge revenue-killing potential during the busy holiday shopping season.
After analyzing the relevant CPU and latency metrics, Matt said, the team also looked at request counts to see if increased traffic was causing the problem. CloudWatch’s RequestCount metric typically goes up during the day and down overnight, Matt said, but since that metric showed no changes, he quickly eliminated increased traffic as a reason for the change in CPU utilization.
After a quick examination of the New Relic metrics, however, Matt’s team found their CPU utilization culprit: Java garbage-collection processes running rampant on the API cluster.
Since the client’s product dataset had grown dramatically during the previous few weeks, Matt surmised that the API servers weren’t able to cache the entire data set in-memory. This change, in turn, caused the Java app to run its garbage collection processes continuously to free up memory and to make room for other data.
With Black Friday and Cyber Monday fast approaching, Matt said, the team moved quickly to provision a larger Amazon EC2 instance class—giving the containers enough memory to cache the application’s entire product-data set. The team updated its AWS CloudFormation template, tested it, and rolled the changes out to production.
After the production rollout, the team went back into New Relic to check on the app’s garbage collection performance. The results, Matt said, proved that the team had solved the problem: The application was once again caching its full product-data set, its garbage-collection processes were no longer running rampant, and CPU utilization was back to normal.
Improving monitoring and alerting from within an application
Finally, Matt noted, once a team uses New Relic monitoring data to solve a problem, it’s important to continue monitoring the relevant metrics—and to alert the team if the pattern resurfaces.
Matt cited an example involving irregular latency issues with some transactions on a client’s server. His team used New Relic APM to monitor and analyze the transaction data, and identified a third-party API as the source of the latency issues. The team, Matt said, could then implement code changes to fix the problem. Just as important, the team now knew that it would have to be more vigilant about similar issues in the future. The team addressed the issue by setting an alert that triggers whenever the system shows a consistent and significant change in service latency—combining proactive protection with efficient use of the team’s resources.
New Relic monitoring—a faster path to success with AWS
These stories illustrate how critical it can be to protect an AWS cloud investment with monitoring tools that can quickly get your team to the root cause of a performance issue and suggest a practical solution. This is why so many teams turn to New Relic to instrument and monitor their AWS cloud-native apps. The results, in terms of customer experience and revenue impact, speak for themselves.