What Nonlinear Dynamics Teach Us About App Stability

Nonlinear dynamics is one of the most useful methods I have found for building stable software applications. However, many engineers I speak to are not familiar with this technique, so I took it upon myself to spread the word. Once you’ve gone through my brief intro and a few examples below, you should be able to apply the most basic principles to the software engineering problems you face regularly (and find out whether you want to learn more).

You probably never learned this in school

Most engineers are familiar with performance and scalability. Formally and self-educated engineers alike are usually exposed to computer science and software engineering principles at some point in their career. For example, most engineers faced with a problem like sorting a list will consider time complexity, also known as O(n) or Big-O, when designing and implementing a solution.

Likewise, most engineers are familiar with software quality. Software engineering practices like XP, agile, or test-driven development are commonly discussed, and most engineers will have heard of these techniques, be familiar with these ideas, and be prepared to discuss them or find out more when designing a development process for a new team.

However, techniques to design and build software with stability, resiliency, and durability tend to be less widely known. Whole system design, ergodic theory, nonlinear dynamics, ruin theory, and extreme value theory are rarely included in formal school curriculum, and few self-educated engineers I’ve spoken to have run across these ideas on their own. Many software engineers are casually familiar with chaos theory, but I’ve met none who know how to apply chaos theory to day-to-day software problem-solving.

I’ve found these techniques to be simple to use, yet very powerful in avoiding problems, saving time in diagnosis, and creating simpler solutions. The disciplines above have a rich literature, which you can explore if you find these ideas to be interesting. I can barely scratch the surface in this post, but I can give you a taste of how these techniques work in the context of a few simple examples.

Setting phasers to stun

Most people think of Star Trek at the mention of Phase Space. In reality, it’s a conceptual or graphical tool for thinking about system state. State space is in fact another name for phase space.

While most will recognize time series graphs that show a single measurable quantity over time, phase space graphs look very different. Time is not an axis on the graph, but instead is an input that isn’t shown directly on the graph. For any given time, the other measured quantities define a location in phase space.

For example, if we decide the relevant quantities to measure in a web application are memory, CPU load, and number of concurrent users, these form a three dimensional phase space. At any point in time, we can measure the memory, CPU load, and number of concurrent users and generate three specific numbers. At a different point in time, we can repeat the measurement. Each measurement produces a plot in this three dimensional phase space. We can also animate the point by varying time and watching the point move through three-dimensional phase space.

You get to pick the dimensions in phase space. They can be whatever you want, and you can have as many as you want, within a few guidelines. For reasons we’ll see later, you should pick measurements that display similar behavior in similar state. For example, an increasing counter like ‘number of visitors’ is not as helpful as a rate counter of ‘visitors in the last minute’. We might expect the application to slow down with an increasing number of visitors. However, a visitor number isn’t very helpful in predicting application performance.

When choosing the number of dimensions to include you should generally pick the specifics that matter in the particular problem at hand. For web applications, you might use server response time, CPU usage, memory usage, database connections used, concurrent users, etc.

Constructing a thought experiment

Now that we have an idea about what phase space is, we can construct thought experiments to determine application stability. The general idea is to pick two nearby points in phase space, imagine time ticking forward a bit, then guess whether these points will be closer together or further apart. If the points come closer together, the application is stable. If the points go further apart, the application is unstable.

This may make more sense in relation to the common idiom of ‘the straw that broke the camel’s back’. If you haven’t heard of this before, a camel is loaded with one package after another. Finally, a single straw is added after which the camel is unable to stand and collapses. In phase space, we may consider carrying capacity on one axis and pack weight on a second. For small weights, there is a one-to-one relationship between these quantities. However, at some point the carrying capacity decreases to zero. These points are divergent in phase space, since one leads to a sudden decrease and the other does not.

In more formal literature on complex systems, this concept is usually known as a Lyapunov exponent. Stable systems with a restoring force have a negative exponent, while chaotic systems with a divergent force have a positive exponent.

Here’s a more familiar example from a web application that many people may have faced. For example, an application that allows people to sign up for an email newsletter. One design for such an application writes them to the database one at a time on web form submit using a pool of twenty connections.

In phase space, we could consider the number of concurrent users versus the database write time. Consider the write time with twenty concurrent users versus the write time for twenty-one concurrent users. Twenty concurrent users can obtain a connection immediately from the pool. However the twenty-first user will not be able to obtain a connection.

If we think about this case in the dynamic context, the application will be stable if and only if there is some point where adding an additional user does not increase the response time. We can choose where this limit is. If we want a highly responsive site, we can fail immediately and return an error page if a connection is not available. Or we could wait for a connection for up to five seconds. Both of these cases result in stable systems; both are relatively easy to implement. Conversely, I have yet to see stable behavior from the out-of-the-box configuration of any connection pool I’ve ever used. Hopefully the value of these simple ideas is readily apparent from this example.

Putting it all together

As a more complex example, I once worked on a project where a web server handled two fairly distinct types of traffic: small requests and large requests. The existing system used non-blocking I/O with a number of worker threads. This graph shows a traffic mix of primarily small requests. The system is very stable using 200 threads:

Nonlinear Dynamics 1

The graph below shows a traffic mix of primarily large requests. The system is stable in this mode using 100 threads:

Nonlinear Dynamics 2

However, the same fixed number won’t work for both types of traffic. Using a limit of 100 threads was stable, but had poor performance when handling small request traffic. Using a limit of 200 threads rapidly ran the server out of memory when encountering a large traffic mix.

Without system dynamics, this might have been a difficult problem to solve. To use system dynamics, we start by mentally constructing a phase space that includes relevant aspects of the system. This might be a phase space using two variables (average request size and the number of threads). Using actual data above or reasoning from general principles, we can predict a trapezoidal shape where the system is stable that looks like this:

Nonlinear Dynamics 3

In the “unstable” red area above is where the server will run out of CPU or memory. This isn’t the possible state space. For example, we could have considered a state space with three additional dimensions: memory, CPU, and network I/O. Although this may be difficult to think about, we could construct a five-dimensional state space and consider the tradeoffs between different dimensions. Any similar model should result in the same conclusions, namely that a fixed number of threads won’t work. It isn’t possible to create a stable system that effectively uses the server hardware with a fixed limit.

At any request size, processing additional requests was going to become harmful at some point. In phase space, the application needs to realize it is approaching the red area above and move back to the green area. This is the ‘straw that the broke the camel’s back’ scenario discussed above. We need a trick for avoiding the unstable area. The key insight here is that closing a connection quickly and returning an error requires less work than attempting to handle the request. (As a general rule, your error/abnormal behavior must perform better than your typical behavior.)

Implementing this solution requires measuring the environment with respect to limits and holding a few threads in reserve to return errors once we reach one of these limits. At the limit, adding another request does not add to the thread count. In phase space, the application senses it is approaching the red area in phase space and halts its progress. This solution doesn’t require guessing the number of correct threads, so we can apply it proactively in the absence of data. Even better, this solution works independently of request size. Best of all, this solution also handles traffic spikes effectively in the rare case when a thousand concurrent requests come in at the same time.

I found this to be a quite counterintuitive conclusion at first, because the job of the web server is to handle requests correctly. How can converting threads doing real work to threads that only return errors actually improve the application? However, from a dynamics perspective, we can see the goal of doing as much work as possible runs contrary to stability. Realizing this allows us to choose a point where we refuse to service requests and consciously introduce stability.

The implementation of these ideas required very little time and code and greatly improved resilience across different workloads. I did this by allocating an effectively-infinite 2048 threads and dedicating 10 threads to returning errors when a reserved amount of memory and CPU is reached. Here is a graph of the updated system handling a time-varying traffic profile and dynamically ranging between 92 to 397 threads while remaining stable.

Nonlinear Dynamics 4

Conclusion

Hopefully, I have shown at least a hint nonlinear dynamic’s promise in helping you build stable software. I’ve found it to provide a very unique perspective and a good source of design ideas for solving difficult problems or proactively improving application stability. If you want to find out more about these topics, here are a few pages in Wikipedia to get you started:

dcopeland@newrelic.com'

View posts by .

Interested in writing for New Relic Blog? Send us a pitch!