Building a Transaction Tracer for Node.js the Hard Way*

It’s with a fair amount of pride (and relief! So much relief!) that I see New Relic for Node.js released into the wider world. It’s a project to which I’ve devoted more or less my entire life since April of 2012, and represents the most significant technical achievement of my career. I hope it’s so simple to use that you’ll never need to know how much work we put into writing it, but it’s had an involved (maybe “difficult” is a better word) journey into reality.

New Relic’s agents do many things, and we solved a whole bunch of hard problems to bring New Relic to Node, but I’m going to write a little about the hardest, and most interesting part, of writing the module: the asynchronous execution tracer that gathers all of the metrics and slow execution traces from your applications. To the best of my knowledge, New Relic’s 100% JavaScript transaction tracer is a first for Node. What follows is a little technical, but I think it’s pretty interesting.

Persistence and failure

The tracer in New Relic for Node.js 1.0 is actually the fourth attempt we made at building a transaction tracer. We had to fail a few times before we came up with something that actually worked. Won’t you take a look at my failures with me?

    • First attempt: The prototype I built the module on was something that Saxon D’Aubin, one of New Relic’s principal engineers, had spent a couple months hacking together. It built on a piece (called hookit.js) of Adam Crabtree’s very useful trycatch error-handling module, and used that and some tricks involving writing values directly onto the JavaScript call stack. The prototype addressed one of the key problems of building an async transaction tracer, which is propagating trace state across asynchronous calls, but didn’t work with Node’s concurrency model. By which I mean that it didn’t actually trace transactions. Welp.
    • Second attempt: It had taken a while for me to prove to my satisfaction that the first tracer wasn’t going to work (and, in fact, couldn’t work), so my first impulse was to sit down and write up from first principles a set of test cases that any real solution would have to handle. I used that to write a new version based on a shared state and taking advantage of Node’s single-threaded model to do things that would probably make experienced developers of multithreaded code throw up. This version of the tracer actually handled concurrent requests. Great! However, it only worked when all of the modules used by a request were wrapped by our instrumentation, which basically describes no real-world Node applications. Not great!
    • Third attempt: So I’d learned that we needed both a way to asynchronously propagate state and a global context for sharing that state that would work with Node’s concurrency model. Node actually has an API that does both those things – the built-in domain module, which is meant to simplify asynchronous error-handling, long a pain point in Node. This actually allowed us to trace execution through real applications, but the module already used domains for its error tracer (a piece of code that’s at least as complicated as the transaction tracer, although less exotic), and the conflation of concerns combined with the conflicting requirements of the execution and error tracers meant that using domains for both things was going to be really hard to get working correctly, and then be a nightmare to maintain.

(Eventual) Success

I had finally identified all of the tracer’s requirements: it needed to propagate asynchronous state, it couldn’t interfere with error handling, and it was most likely going to use a shared context. What I wanted was more or less a less error-focused version of domains. I was faced with the unappealing idea of monkeypatching much of the runtime to make it work, but I had a pretty good idea of what the API would need to look like, and previous attempts had left me with a decent bed of tests.

Fortunately, New Relic’s efforts had caught the attention of the Node community, and around this time I got help from Isaac Schlueter, Daniel Shaw, and, through the Node Firm, the incredibly talented Tim Caswell. Their help was instrumental in turning my sketch of an API into something that actually works.

It took a fair amount of grappling with generic, abstract concepts to even come up with a name, but we ended up calling the thing we built “continuation-local storage” (CLS). The resulting API was basically the exact thing I’d been groping for this entire time – a simple API that allows the New Relic module to set and read values from a shared context, where the values are tied to a particular chain of asynchronous function calls. Because the API is only about key-value pairs, it’s not especially specific to New Relic, and if you have a need for something similar, you should give it a look. I think it’s likely to be pretty useful in a number of places.

At Isaac’s suggestion, once we (by which I mean mostly Tim) had the API stable – and I’d gotten New Relic’s tracer running on top of it, which was both straightforward and amazingly gratifying after over a year of false starts – I opened a pull request against Node. Response to the pull request was surprisingly muted, but Trevor Norris, an engineer employed by Mozilla to improve Node’s performance, objected and said that domains had been enough of a performance regression and that there was no way he was allowing another feature like that into core. We talked about it on IRC and he said that he thought it wouldn’t be too tough to come up with a simpler API that did what I needed and didn’t affect performance.

Two months of heroic effort later, Trevor had put his own pull request together. I pretty much owe him a truckload of beer for his work, which is excellent. The asyncListener API he came up with is exactly as advertised: simple, and with no performance cost when it’s not in use. Tim and I were able to adapt CLS to use the asyncListener API, and I eventually ended up splitting the glue logic Tim had written out into a polyfill that mirrors Trevor’s API for use in versions of Node that don’t support it natively. In the process I polished it considerably, and my new teammate Jacob Groundwater has given it some much-needed review and polish.

Today

This tracer looks like it will go the distance. Amazingly enough, this is only a high-level summary of what went into producing the asynchronous tracer (for instance, I didn’t mention all of the complications around binding Node’s EventEmitters onto CLS contexts). The important parts are that it’s gathering trace data reliably, Trevor’s very close to getting the asyncListener added to Node’s master branch (which means it will be included in Node 0.12), and New Relic for Node is ready to ship.

Something that should be clear is that even though I’ve put a lot of myself into making this happen, it was not by any means something I could have done all by myself. This is true of most significant software engineering projects, but it’s especially true of this one, where pretty much every step of the way was an experiment. In addition to all of the people mentioned above, it wouldn’t have been possible without a huge amount of support by my colleagues here at New Relic and everyone who participated in the beta. We put a lot of effort into making things that are easy for you to use. If this module helps you understand your code better, it will totally have been worth the effort.

*there is no easy way

forrest@newrelic.com'

View posts by .

Interested in writing for New Relic Blog? Send us a pitch!