Our good friend Frank Ober of Intel Corp recently had a conversation with some colleagues about the DevOps movement and the advantages to both development and IT Ops teams of using today’s more agile APM tools. We successfully cajoled Frank into sharing his insights with you here. Without further ado…
“Effective Triage of Applications with Today’s APM,” by Frank Ober, Intel Corp.
Fast, effective triage of applications requires visibility into the segments of an application that can cause problems – this is what modern monitoring can do today. Your monitoring tool should have an application-centric mentality. But, until now, most such tools start from the infrastructure up, and therefore provide minimal value to the most important role inside Intel – The Developer. How many developers really depend on large systems management tools today? That’s where application byte code instrumentation comes into play as the core of finding problem areas – fast!
Let me provide 3 meaty and brief examples…
1. Hung Thread. Let’s say a hung thread creates an app bottleneck, which is a very tough problem to find in a log because the thread isn’t getting to the log writer. But with graphical stack trace of a production app, you are in business in minutes… and you know this is the right way to go about fixing it, because you understand what a thread dump gets you. This modern, “2011” approach is lite and agile, and it helps you diagnose the root cause and get to the code owner fast, because you understand the namespaces concept, and know your code versus the app server or web server vendors’ code.
2. Slow Memory Leak. You’re not sure what is wrong, only that app performance seriously degrades over time. So without visibility into the app itself, operators cycle the web server once a week to just “make it go away.” But what’s real the cost to the company of the sluggish user response, and the bad mojo this presents? It’s bad for the end user and it’s bad for the brand.
In comes memory analysis that your mother understands. Sure, you have to look at memory objects in your namespace to see something bad in the code, either a coding toolset, or SQL profiler. But I am talking about efficient triage here, not every last step. A memory profile of the VM sitting behind the Web Server is far more interesting to look at directly than high-level memory metrics from your MoM (big systems management consoles, or manager of managers in the IT data center).
3. Transactional Health. Nightmare code never opens itself up in the log files, or waves a flag over the cube farm, saying “Hey, excuse me, can you please re-write me, please!!!” Monitoring at the transactional level that works for developers and operators (DevOps) and is worth every dime of your money and lets you move away from mediocre. Let’s say a web transaction is iterating and killing the user experience , but your users have come to expect bad performance form your site so they never report it. So how do you find it? App-centric monitoring that includes transactional health and drill down lets you look at pieces of the web page to see what page component is getting you down. If you have a Masthead Component, for example, you can determine that buffer flushes in 42 milliseconds even under stress test. Is anyone at your company doing that level of refactoring or improvement? It possible and its highly recommended.
Easy end-to-end triage means fixes at the component level happen so much sooner because root cause determination is actually possible. Then, you make this all pro-active through emailed weekly reports, that’s how you stay on top of it. And by the way , the app data is always being collected, so no transactions are slipping in and are not being caught. That’s the difference between a production worthy tool, and what we have traditionally called “profilers.”
This is APM today.