Guest author Zack Kendall is a software engineer at Agilysys

Agilysys, a leading provider of hospitality software, has successfully made the transition from green screen to a new generation of Software-as-a-Service (SaaS) hospitality solutions. To monitor our production cloud services, we rely on New Relic. Here’s the story of when we used New Relic to figure out a misbehaving microservice.

First, some background. The New Relic Java agent provides a wealth of information without making our developers hand-implement the instrumentation. We’ve also added custom attributes to tag transactions by customer and other attributes that help us analyze our business. New Relic has become a key part of our alerting infrastructure and a way for us to monitor the minute-by-minute and day-to-day performance of our cloud services. We deploy weekly, and New Relic lets us know if things take a turn for the worse before any issues get out of hand.

Issues with our authentication microservice

One of the key microservices in our SaaS application handles authentication. This service gets called frequently inside other services’ transactions, so the performance of the authentication service directly impacts the latency of other services’ APIs. Unfortunately, as our business and customer usage grew, the performance of the authentication service degraded.

web transactions percentiles

As the latency grew to alarmingly high levels, we worked hard to figure out what was causing the problem. New Relic APM showed us that the app was making a lot of database calls. We rewrote some of our algorithms to fetch data from our Mongo database using more bulk operations. We also wrote local integration tests for this service that tested the performance, so we could show that the algorithm changes had an effect. Unfortunately, when those changes made it to production, New Relic revealed only marginal improvements.

Hard to reproduce the problem

The big challenge was that we couldn’t locally reproduce the magnitude of latency we saw in production. We wrote JMeter tests to hammer away at the service to try to make a visible crack locally, but the authentication service performed snappily in these tests.

One possibility was that our server hardware just wasn’t good enough; after all, New Relic showed our CPUs struggling, RAM filling up, and our Java virtual machine (JVM) garbage collectors firing off at an uncomfortably high frequency. We upgraded the class of cloud servers we used, but unfortunately that also delivered only marginal gains.

From the beginning we played with New Relic features to try to shed more light on the issue. We had used the X-Ray thread profiler to show a breakdown of where time was being spent. However, the hotspot in our code was just a simple Java method that made a couple of simple database calls. We didn’t see it as special or unique, and we were unable to make it lag when testing in a debuggable environment. Furthermore, the real hotspot was deep in the Spring Framework code, which we used everywhere—including places that weren’t having performance problems—so some of the team were skeptical of these results. (Sorry, New Relic, we should have known better.)

spring framework code

At some point in banging our heads against the wall, we noticed that the _class column in one of our MongoDB collections had many different package values. This field is created and used by the Spring MongoDB tooling to identify what Java class to create from the database document. Our interest was really piqued when we saw the field had package names from other services and was following patterns we no longer used. We used this information to adjust the data in our local performance testing.

Bingo!

We were finally able to reproduce slowness and to toggle it based on the data in this field. This finding matched our observed information and behavior perfectly. But we had already tried many things without success, so we really wanted to be completely certain that this was the critical aspect of the performance problem.

To confirm, we wrote isolated tests using a Java micro-benchmarking framework (JMH). The results came back in spades:

First JMH snippet:
Benchmark                                             Mode  Cnt    Score    Error  Units
MongoPackageBenchmark.getContextsBadClass             avgt   15  153.173 ± 48.580  ms/op
MongoPackageBenchmark.getContextsMatchingClass        avgt   15    5.740 ±  1.631  ms/op

The JMH also provided a hotspot stack trace:

Second JMH snippet:
   ....[Thread state distributions]...................................................
   55.1%         BLOCKED
   25.8%         TIMED_WAITING
   16.5%         RUNNABLE
        2.5%         WAITING
   ....[Thread state: BLOCKED].........................................................
   55.1% 100.0% java.lang.ClassLoader.loadClass
        0.0%   0.0% java.util.zip.ZipFile.getEntry

You can see that a significant amount of time is spent blocking, which causes high latency. These results correlate with the hotspot found in the New Relic X-Ray profiler. Looks like we should have trusted it from the beginning!

With this information we could debug and understand the Spring code and identify the mechanism causing the performance issue. When asking for a document from the database, the Java class resolved from the value in the _class field is cached, but only if it is found in the class loader. Otherwise, every single document inside every single query attempts to search the class-path for the class. This class loader and I/O work is a blocking operation that causes delays when done frequently in a multi-threaded environment.

Here you can see one of the multiple places in which the Spring code caches type information, but only if the value is not null. The value remains null when the _class value doesn’t match a class in the Java class loader, so it always retries the expensive operation.

public class SimpleTypeInformationMapper implements TypeInformationMapper {
   ...
   private final Map<String, ClassTypeInformation<?>> CACHE
      = new ConcurrentHashMap<String, ClassTypeInformation<?>>();
   ...
   public ClassTypeInformation<?> resolveTypeFrom(Object alias) {
      ...
      ClassTypeInformation<?> information = CACHE.get(value);
      ...
      if (information != null) {
         CACHE.put(value, information);
      }
      ...

The takeaway: trust New Relic

We were originally unable to reproduce the latency issue in our authentication microservice because we had bootstrapped our testing environment with documents having the same _class values used in the authentication service. In production, however, the data was primarily created by another service using a different Java class and was not migrated from previous iterations of the class with different package names. So there were many different _class values, none of which were available to the authentication service.

After making this discovery, it was relatively easy to provide a fix that drastically improved our authentication services performance—and thereby reduce the transaction latency of our other services as well.

web transactions percentiles

We used a lot of tools in this journey, but New Relic was always there as an excellent way to monitor our applications and our efforts to improve them. New Relic’s out-of-the-box instrumentation provided endless value, and in the end, the product earned even more of our trust.

 

Zack Kendall is a software engineer at Agilysys. He is addicted to analyzing and fixing software performance problems. View posts by .

Interested in writing for New Relic Blog? Send us a pitch!