It’s Friday at 5:30pm…
See if you can relate to this situation: The week is winding down on Friday afternoon and the team identifies a bug that needs to be patched, hot-fixed, etc. A build is ready to be created, and just like that, your build server crashes. What do you do? Do you have redundant servers you can switch over to? Better yet, do you even know what’s on your build server, or how to recreate it??
Take a deep breath
While you’re thinking about that, a little history first. Our virtual infrastructure had a humble beginning. I wish I could say that we rocked it right out of the gate, but that would just be unrealistic. When a company or product is young, quick decisions often have to be made and time does not always allow for the ‘best’ solution. Much like the product however, as the team and their processes grow and mature, the infrastructure must follow suit. While it is true that having a first-class product is paramount, equally important is the infrastructure to support all that is required to create, test, and deploy it. It’s hard to be efficient in this game otherwise.
On the Windows team we currently support two main products: the Windows .NET Agent and the Windows Server Monitor. We have a vast infrastructure for the systems required to build, test and deploy both of these products. For testing alone, we have a myriad of virtual machines replicating many of our supported platforms.
For example, every time we release a new version of WSM, each release candidate is installed on nearly a dozen different types of VMs for pre-release qualification. Beyond the test servers, there are database servers, build machines, and file servers. It quickly adds up. If these systems were not maintained, or able to be easily recreated, debugging issues and recovering from failures would become quite complicated.
We’ve come a long way in a little over a year. This time a year ago, our team’s virtual infrastructure consisted of a handful of old, outdated and largely unmaintained VMs. They were antiquated, unmanaged, and existed in unknown states, with unknown software and frameworks… You get the picture, lots of unknowns.
If you allow the quality of your infrastructures to lapse, this can lead to fear — fear of modifying, fear of experimenting, and fear of using. An increase of fear leads to a lack of trust. When you can’t trust something, you are much more likely not to use it.
One of our old build servers was so temperamental that there was a fear of installing Windows updates, because we could not be certain it would cleanly restart if one was required. Answering the question of “What’s on this box?” was anyone’s guess. It could not be recreated easily, and worse still, the documentation was very out of date. Just like in software development where it’s time to select ‘Project->New’, it was time we tore things down and start from scratch, intelligently.
Now, close to 25 VMs are used on a daily basis helping deliver New Relic-quality software to our customers. Nearly all either have snapshots or backing scripts that can be used for restoring the configuration to a desired state. A developer can freely experiment with our testing VMs to debug customer issues or prototype changes. They can do whatever they want with it, even trash it, and we can restore it back to its usable form in a minimal amount of time.
Many of our previous VMs were quite bloated as they were used to do many, many things serving as combined testing, database, performance and file servers. By purpose-building our VMs, performance improved as we greatly reduced the amount of unnecessary applications and tools.
I’m not preaching a specific toolset or framework for maintaining your infrastructure. There are tools like Chef or Puppet, to name a couple. Most of our Windows-based environment is configured using PowerShell scripts; find what works best for your team. The point is for you to be able to maintain your infrastructure and be able to respond in the event of a disaster.
Start with your build server
First, find out what is required to recreate your build system. Start with a bare OS, identify the easy, known items like frameworks and toolsets. Visualize the process in your head and think about the different phases of Continuous Integration/Deployment you may support. Ask yourself, “What do I do to build, what do I use to test, what do I use to release”. This should help you identify many of the critical components and tools.
If possible, work in parallel with your existing build server. Once you’ve setup the pieces you remember, try to build. If it fails, investigate what is missing and when you find it, script it if possible or document it. Repeat this process until everything is accounted for, scripted (preferably), or documented. I’d also recommend repeating this process over time, for any of your other VMs that are not trivial to recreate. Perhaps consider having another member of the team follow the process to see how easily someone else could step in in your absence.
Tying it all together
Your build server probably mirrors much of what a developer workstation looks like. Consider adapting your script(s) to support the configuration of local workstations. This should drastically improve the on-boarding process for new hires and also provide existing developer with a means to start with a ‘clean slate’ on their own workstation. As another added bonus, when you have your build server mostly scripted, it’ll also be that much easier to spawn multiple build servers to scale your capacity.
The quality of your infrastructure should evolve over time, like everything else you do. Assume that, like most of the software you use professionally and personally, downtime never happens at a convenient time. Be better prepared for it by investing some time in your infrastructure. Doing so will give you a greater sense of security for when, not if, trouble rears its ugly head. When it’s done, share the info and get others involved. Do not let the process become tribal knowledge!
You might have heard me say before, “Treat [x] like you would your code”. Treat your infrastructure like you would your code! New features and bug fixes are great, but who cares if you can’t reliably get your software out the door.