Imagine for a moment that you have inherited a legacy software system. The system has been passed around multiple teams over a long period of time, each adding their own modifications. You were not part of the team that built the system, you don’t entirely understand what it is intended to do, and you cannot comprehend the implementation. You know that it has some serious flaws, but it is stable and has been running without incident for a long time.
Now, imagine that system suffers a catastrophic failure.
In the aftermath of a critical incident, the mandate is given that such a failure cannot be allowed to happen again. It is up to you and your team to implement the changes necessary to meet this expectation. What do you do?
For software engineers, the automatic response to this situation is “Rewrite the entire thing in your new favorite technology!” However, industry best practice makes it clear that such a rewrite is often a certifiable Bad Idea. (Joel Spolsky, CEO and co-founder of Stack Overflow, spoke to this point in Things You Should Never Do, Part 1, which examines the downfall of Netscape.) However, most software developers can be seduced by this approach. I myself, however, have discovered from hard experience that Joel is right.
Most of the time.
There absolutely are some circumstances in which the full rewrite is the right thing to do. But how do we decide when the investment and risk are outweighed by the benefit? Simply put, the rewrite is the best possible solution when it is the only solution because the system you are replacing cannot be fixed.
A real New Relic example
New Relic faced this situation with our Insights Custom Event API (ICEI). This customer-facing edge-component is a key piece of our data-ingestion tier. Customers who need to submit data into the New Relic Database (NRDB) outside of an agent use the ICEI to send event payloads to us, which get stored in NRDB so they can be queried in New Relic Insights.
Unfortunately, the ICEI exhibited all of the issues mentioned above: it was built by multiple teams over multiple years, it employed plenty of custom code, and it had bugs related to the information it gave to customers when something went awry.
That may sound like a pretty good justification for a rewrite. But it is not enough. If all we were concerned about was comprehension, expertise, and bug fixes, we could spend the time and make it work without a full rewrite.
We also need to consider the root cause of the critical failure. In this case, the custom memory model used by the ICEI to handle the variable payload sizes contained a concurrency bug so hard to replicate/diagnose that it took five engineers an entire day to figure it out. Even after we crafted a fix for the specific bug, we had very high confidence that the system was harboring other issues.
As we analyzed the implementation and technologies, it became clear that refactoring the existing system to guarantee that we would not experience another similar failure was effectively impossible. The request-handling framework combined with the custom memory manager would never be capable of providing a reliable enough solution.
We also examined the ICEI implementation against the real-world use cases the system was required to support. This revealed that the system we had was not the system we needed. The scope and requirements for the system had shifted significantly, but those changes had never been revisited in the application implementation.
So by then we knew that we needed a different system, designed from the ground up to address the specific requirements we had identified. At that point, we should just have gone heads-down and rebuilt this thing, right?
Why not? Well, not all of the implementation in the ICEI was limited by the problems we’d identified. In fact, the core behavior (parsing JSON payloads into event batches and dispatching them to Kafka for insertion into NRDB) worked fine. The issues we had were with the code surrounding that behavior. In addition, the JSON parsing logic in use is very permissive, and some customers rely on that permissiveness.
We determined that this segment of the application should be preserved and reused to make sure we didn’t break customers’ applications. In the end, the decision was made to keep this behavior as-is, and replace all of the problematic code that deals with request processing.
What we did
We selected Ratpack as our request-handling framework. Ratpack uses Netty under the hood, and provides a clean, well-implemented workflow that enforces proper concurrency when dealing with requests and payload bodies. This addresses the fundamental concerns we had with ICEI.
Other teams at New Relic had already experienced success using Ratpack, so we had in-house knowledge and proven usage of the framework. At the end of the request-handler chain, the event batches are submitted to NRDB with re-used logic that is already proven at scale. At any point in the chain, if a customer-driven error occurs (bad JSON, bad event data, etc.) a proper response is sent back to the customer indicating the type of failure, and providing contextually relevant information about the source of the error.
Ratpack also provides a testing framework that allows for much deeper unit tests to validate the system. And the efficiency gains from moving to an asynchronous request handling framework, dramatically lower operating costs by reducing the hardware requirements of the service. We decided to invest a small amount of effort upfront to prove the new architecture would address the core problems in the system, and then validate the implementation. This gave us an early checkpoint at which to decide whether to abandon the experiment if it didn’t fit our use case.
The full rewrite is usually the wrong thing to do, but not always. Sometimes, it is the only reasonable solution. However, just because you can rewrite all of a system, doesn’t mean you should.
It pays to look deeply into the requirements and implementation of the system in question and first determine which aspects absolutely must be replaced, and which are sufficient to re-use. This approach reduces risk by preserving as much of the known and validated implementation as possible, while focusing effort on the things that absolutely have to be changed.
Sometimes, the rewrite is the right answer.
“But wait! What happened?”
Worry not, kind reader. This is but the first of a series of blog posts addressing how the Insights team replaced the ICEI, the things we learned on the way, and the process we used to pull it all together.
Don’t miss all installments of the “Designing for Scale” series:
- Designing for Scale: Part 2—Building What You Need
- Designing for Scale: Part 3—Scaling Under Stress
- Designing for Scale: Part 4—Deployment Surprises