Guest author Vasken Hauri is Vice President, Platforms and Systems, for 10up, which creates finely crafted websites and tools for content creators, helping clients like Microsoft, Time, ESPN, and Adobe create a better web experience.
Creating great content for the web is seldom easy, and slow publishing tools can make an already difficult job even more frustrating. That’s where the team at 10up comes in, working to help the often underserved, underrepresented, and overlooked group of content creators who build the websites we all read and enjoy every day.
Case in point: When the content creators at a major 10up client complained about slow response from their WordPress content management system (CMS) for the news and community site for a major beverage retailer, it was the kind of problem that doesn’t always get addressed. The issue didn’t directly affect the company’s main e-commerce platform, or even the frontend of the community site, so most people viewing the content would never even know about it. That means the small group of people who are adding data to the CMS are often forgotten because they’re not the consumers and they’re not the revenue drivers—they’re just the poor people who are forced to wait two minutes for a page to save.
It’s actually a bigger deal than most people realize. WordPress developers often stuff a ton of things into the admin dashboard and don’t necessarily spend a lot of time thinking about how all the extras will affect the process of saving posts. They think to themselves, “Oh, saving posts is quick. It only takes a minute.” Well, if you’re editing 40 pieces of content a day, and you edit each of them twice, that’s over an hour that you’re sitting there waiting. Multiply that by a 50-person editorial group and you’ve created a huge time-waster.
In response, frustrated content creators may start flipping back and forth among multiple tabs, and maybe they forget to make an edit. They start to make mistakes. That makes it harder for our client to credibly tell its story of community and corporate responsibility.
But with the help of New Relic, we were able to pinpoint the issue and solve it in a matter of hours and minutes rather than days, and it really did make the content creators’ lives better—as well as save the company money when workers don’t have to spend half an hour a day trying to “save” articles—and instead spend that time writing the next one.
Seeing these performance problems getting solved makes the content creators happy, turning them from CMS detractors into big advocates. Now they say, “WordPress is great! 10up is great! We want to work with them more.” So it doesn’t just help the client and viewers of the site, it helps our bottom line, too.”
How 10up fixed the problem
The WordPress portion of the site in question runs in a very locked-down environment. It’s only for publishing. It serves up API requests to the React frontend, which actually goes and pulls in stories from an API in the CMS where they are created.
When the client’s content team first notified us of the issue, just by looking at this one New Relic screen, within two minutes we were able to say, “Okay, it’s slow and it’s this weird sawtooth pattern where something resets itself every week. In between, it gets slower, and slower, and slower.”
We started to dig in and realized this is web external. We noticed that there were multiple protocols happening per page load and that sawtooth pattern was there in our ElasticPress hosted service. So it could have been related to ElasticPress:
But then we clicked over to the next service, which was Amazon Web Services, and we saw that same sawtooth pattern, regardless of the service:
In this screenshot, you can see these sawtooth patterns matched one another, that the request got slower and slower if they were an external request, regardless of whether they were to our service or to a completely separate Amazon storage service. At that point, we knew there was an issue with the underlying cURL commands.
Looking at individual stack traces for when cURL was running slow, we determined that that API was sending out so many simultaneous requests that the cURL requests were starting to queue up. As more page-load requests came in, they would start, get to the cURL request, and then have to wait for cURL commands from previous page-load requests to complete before cURL would execute. We were able to immediately identify that the bottleneck was 100% due to the fact that this cURL command could execute only four simultaneous cURL requests at a time.
10up solved the problem by looking at the application logic and determining that there were way too many API requests firing. We were able to cache the content that actually lived in Amazon S3 storage a lot better in the app, which is why the screenshot below shows garbage collection dropping off as the API requests were reduced. Once we pushed the final deployment, there’s almost no time spent on those external requests because we greatly reduced their number.
Without New Relic, we probably would have given up
Without New Relic, we would have spent hours debugging code and changing things. Instead, in a nondestructive way, we went through this in an hour of real time, and we were able to go back to the team and say, “Here’s the issue. We’ve upgraded cURL on your servers as a precaution. That didn’t really seem to solve it, but why don’t you go ahead and try to reduce the number of simultaneous API calls and see if that helps?” They did, and it was all good.
After we were able to fix the problem thanks to New Relic, everything looks just peachy! There’s absolutely no hint of that sawtooth pattern. In fact, within about a day, the garbage collection for the React app catches up and all of that web external time is completely gone. That was really great to see. Without New Relic, we probably never would have found it. We probably would have given up.
How 10up met New Relic
10up was able to make a quick fix like this because New Relic was already on board. I joined 10up some six years ago, when I was just getting up to speed with New Relic. I used a plug-in that some coworkers at that previous job had written that added WordPress-specific data to New Relic, and we quickly realized that, while our staging data sets are usually tiny and our local data sets are tiny, most of our performance problems happen with large data sets and lots of traffic on the site. New Relic offered the unique value proposition of being able to monitor performance and get detailed information without having to turn on a bunch of debug logging and totally mess up the servers.
That was enough for 10up engineering leadership to say, “We need to have this. We need to start running this on all of our production sites. We need to encourage our customers to use it whenever possible so that they can troubleshoot.” Today, 10up relies on New Relic for debugging performance situations and code, figuring out why a site is suddenly slow or is slow only in a particular section of the site.