One recent Saturday I had my plans with a friend go up in flames due to the realities of how our favorite buzzwords like ‘cloud’ and ‘DevOps’ play out in the wild. I was peeved, so when I came home that night I wrote up a description of what happened. The following is that post—after removing what I now realize was an almost awe-inspiring amount of curse words.
Today I had plans to meet a friend who works for a Fortune 500 software vendor. We were going to spend the day together doing things that make us sound super awesome, like go shop at Frys and play tons of Overwatch while drinking low-quality beer (my wife was out of town).
Then, this morning I received the following texts:
We eventually ended up meeting around 6 p.m. at a local bar to partake of some the most delectable queso dip the gods have ever bestowed on humankind. We finished eating around 7:10 p.m., and promptly went stereotypical Millennial by pulling out our phones to check work emails and try to boost our social-media egos by posting low-production-value food pics.
Unfortunately, my friend saw there was yet another incident that required his immediate attention. He then dragged out his embarrassingly large laptop directly at our table and began to troubleshoot the issue on the spot. Bear in mind, we had already paid the bill at this point, but he was polite enough to let me ask probing questions throughout the process, so I stuck around.
Troubleshooting in the real world
My friend spent the first 10 minutes battling with the VPN required to access his company’s home-baked diagnostic tools. Eventually he got some form of access but was greeted with a NullRefException on login, which prevented him from doing any further debugging. Frustrated, he noted that the tool rarely provided value anyway, as it was only a step up from parsing log files directly.
Next, he re-evaluated the ticket information and went back and forth with some coworkers and used his personal domain knowledge of the system to deduce what the issue might be—that is, he guessed. Amazingly, a second issue cropped up during this debugging process, so he sent his guess to another coworker and put the first issue on the back burner in order to deal with the new one.
The second issue appeared to result from “some jackass” doing a deploy on a Saturday evening. Since my friend was on call, he—and not said jackass—was responsible for babysitting it.
The company’s on-call rotation puts one developer on call for one week out of roughly every eight. That developer is responsible for around 50 services—only 5 of which the developer typically works in. This meant my friend had to lug a pager and laptop with him 24/7 for a week straight every two months, in order to be able to debug services for which he had no domain knowledge and no common debugging framework. Worse, all the other developers felt no direct accountability for any issues, since it wasn’t their unlucky week. And again, the few tools he had available to aid this process were buggy, were supported with zero or hidden documentation, and ultimately were less valuable than just doing it the old-school way of logging in directly to the machines.
Over the course of the next 90 minutes, my friend restarted his machine twice, logged into three different VPNs, and sifted through countless log files. At which point he finally threw in the towel and canceled the rest of our night’s plans, so he could go home and dig into this. When I told him how long he’d been working on the issue, his somber response was “Wow, really? I haven’t made that much progress.”
An all-too-common problem
This situation isn’t uncommon. My friend told me he typically spends his entire on-call work week addressing low-priority issues, and gets an average of five after-hours calls every rotation (the worst was eight in one week). Other friends at other major software companies echo the same sad stories of long nights and lost weekends.
Making things even worse is the thankless nature of the job. As “pager rotations” are an accepted part of the job and our industry’s culture, no one commends people for taking the weekend call, but they certainly do chastise you for missing it.
So, what’s really going on here? More important, what can we do about it and how can New Relic help? These three points can help put things in perspective:
1. Security that gets in the way isn’t doing its job
My friend’s issues were exacerbated by cumbersome security measures. He literally was unable to access a VM because he mistyped his password once! He couldn’t double hop on the VPN, so he was required to continuously keep track of and connect to multiple isolated networks—one by one—to get tiny pieces of a massive puzzle.
This issue persisted for longer than it needed to because antiquated security measures were prioritized above empowering employees to do their job and fix customers’ issues. Congratulations—your data is so locked down your employees can’t access it when needed. You’ve won the “I Suck at Business” award!
2. Company “innovation” initiatives often don’t address the real issues
I keep hearing about the importance of “Digital Transformation,” but many times the real problem isn’t transitioning to digital, it’s dealing with the fallout from your crappy, rushed digital business implementation from years ago. Too often, top-down edicts to move to the cloud or containers or DevOps are merely half-baked promises of a “silver bullet” to solve years of technical debt, rushed implementations, and bad decisions.
My friend complained that his company’s big central pushes to “modernize” felt like lip-service. Too often, they lacked quantifiable measures of success (or failure) and ended up being abandoned halfway through in favor of direct differentiated value projects. In many cases, that left things in worse conditions than before the effort started.
3. Execs may not see the problems
My friend stressed how concerned he was about the quality of his company’s products and how difficult they were to maintain. But he felt that top brass often refused to acknowledge the problem, either out of ignorance or willful disregard.
Sure, in the real world stuff has to get done and sometimes doing it quickly and poorly is the only viable approach. Unfortunately, though, it’s people like my friend who lose their Saturdays because of it. The people on the ground pick up the slack, feel the pain, and desperately need the solution. CTOs and IT directors and other bigwigs need to understand that meeting their numbers ultimately boils down to empowering boots on the ground to do their jobs well.
How New Relic can help
New Relic is already dedicated to solving these exact kinds of problems. I literally (and a bit smugly) showed off New Relic on my phone while my friend was fumbling to debug his issues. “Oh, check it out, here is every exception thrown in production by our main web app. Whoa, our transaction times have gotten a lot faster in the last year! Man, that database query looks complex, good thing it’s performing so well and we don’t have to drill into it.”
If my friend would have had New Relic, he would have gotten his Saturday back. His customers would have had a better user experience, faster. His company would likely have had a higher revenue and his CTO/IT Director would have had improved metrics to present to shareholders.
New Relic provides value on a lot of levels, but we will never lose sight of the people we have always served so well: the “fingers on keyboards” folks who support the business on their shoulders!