Preparing for the Unexpected: Revisiting the Fallacies of Distributed Computing in the Cloud – Network Reliability Isn’t Guaranteed

It’s only the second week of the New Year, and we’re already tired of all this 2012 apocalypse hoopla – with 11 more months yet to go. We got our fill of doomsday predictions last year; forecasting a global catastrophe is kind of tedious at this point. But if there’s any benefit to these prophecies, it’s that they underscore the importance of disaster planning.

As we all know too well, things can go wrong unexpectedly. Being prepared is often the difference between inconvenience and calamity. That’s why we’re such big fans of the Fallacies of Distributed Computing, the list of erroneous assumptions L. Peter Deutsch canonized in 1994 to help programmers build stable, efficient distributed systems.

New Relic engineer Brian Doll explored the subject last year, and the relevance of the Fallacies in today’s cloud environments continues to be a hot topic. With questions persisting about their significance to web app development, we thought the Fallacies deserved another look. Let’s start with a quick review of his list:

1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn’t change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogenous.

We’ll dedicate a series of posts over the next few weeks to examining each one in detail and discussing some of the conflicting viewpoints. First up is the fallacy of network reliability.

Anyone who uses technology knows there’s no such thing as a 100% reliable network. Things break. Mistakes are made. Hardware and software fail. Network communications get interrupted. Nefarious forces attack. And then, of course, there’s plain old human error. All of these eventualities are still applicable nearly 20 years after the Fallacies were first established. Some are probably even more prevalent in mobile development, arguably our most popular modern platform.

The key for developers is recognizing that you simply can’t take network reliability for granted. Assume there will be a breakdown – most likely at the worst possible moment – and you’ll be better prepared to work around it. Thankfully, the universality of this predicament has resulted in several tried and true solutions.

The easy answer used to be augmenting your infrastructure with hardware and software redundancies, assuming the risks of network failure counterbalanced the added cost. This is still an appropriate tactic, to be sure. But, while the importance of scaling up infrastructure hasn’t changed, for many development teams the process certainly has. Their technology applications and the data that feeds them live in the cloud now, instead of on proprietary servers in their own datacenters. The contemporary cloud model is no doubt extremely efficient, but it also adds another variable to the network reliability equation.

When it comes to software specifically, problems with external network communications can be mitigated by reliable messaging functionality or taking steps to analyze message integrity and prioritize them by importance. However, not everyone believes these safeguards are critical to consider in today’s web environments.

In his 2009 reassessment of the Fallacies, Tim Bray (former director of web technologies at Sun) posits that they pose less of a risk for developers of web applications. He argues that building things on web technologies “lets you get away with believing some of [the Fallacies]” because they’re being addressed at the infrastructure level and are controlled by HTTP standards. His specific rebuttal to the fallacy of network reliability hinges on the fact that “connections are brief” and idempotent commands like GET, PUT and DELETE can simply be repeated until successful.

Brian debated this point with Tim last year, countering that repetition of the same request can significantly degrade performance. His recommendation for web developers is instead to “build an app that can function at reduced capacity when a given service is offline.” Doing so yields a better user experience and reduces headaches for administrators.

There’s no dispute that some web standards do provide guidelines that account for the Fallacies of Distributed Computing. Yet, what makes the fallacy of network reliability still valid in web-based systems is primarily the human element – although there are rules in place, it’s unwise to assume everyone is following them. Other technology bloggers have written about this as well, cautioning that Bray’s perspective is “overly optimistic” because “the fact that web standards cover some of the Fallacies doesn’t mean that those who build web applications comply with the standards.”

Ultimately, we think there are two critical questions a developer inclined to ignore the inevitable network failure must ask himself. How certain are you that every GET, PUT and DELETE function in the system is idempotent, and thus compliant with the HTTP standards? More importantly, do you want to risk finding out the hard way?

Next up: Why changing users’ expectations doesn’t eliminate the consequences of latency.'

View posts by .

Interested in writing for New Relic Blog? Send us a pitch!