At New Relic, we didn’t want experimental features waking us up in the middle of the night. Going straight from testing in a staging environment to a production environment is a bigger step than your feature might be ready for, especially if you’d rather not be getting paged in the middle of the night. To that end, we created a feature flag to make sure we were only testing in production when we could reasonably support our experimental features. Here’s how we got there, and why it will save your precious Z’s.
You need to test in production
So you have an awesome, new — albeit very experimental — feature that you’d like to test out. You’ve written automated tests, so you’re fairly confident, but you need to test your feature in a production setting to build real confidence. You create a feature toggle that’s read from the database per-request. You can just flip a switch, turn it on, and watch your traffic. See any errors? Flip it right back off, fix the problem, and re-enable it. Repeat ad infinitum.
However, going from automated testing to production (even with testing in a staging environment) can be a big leap with experimental features. What if, for whatever reason, errors don’t start coming in until a while after you’ve left the feature on? Nobody wants to get woken up by alerts at three in the morning to disable the feature if it does break, but, of course, this is simply the nature of running a SaaS application. Still … we wanted to attempt to narrow the gap between our staging environment and our production environment and reduce the probability that we would get called off hours for experiments.
Take authentication, for instance; it’s one of the most critical parts of a web application. If users can’t log into your application, it may as well be down. Last year at New Relic, we began work on our own single sign-on (SSO) solution to provide a more unified login experience for our users who will be on both RPM and our upcoming Software Analytics project. We were confident in launching our SSO feature, but we wanted to be available to react quickly if any issues occurred, and we believed that it might take a while for potential issues to surface.
Just turn it off at night
Rather than get notified of issues in the middle of the night when alertness is questionable, we thought, why don’t we just disable the service when we go home for the night? This ended up being a great idea for us; if something with the new SSO solution were to break, we would already be at work, awake and alert.
We could react very quickly and with our heads in the game.
We could disable the feature flag ourselves before going home for the day and flip it back on in the morning, but we’re programmers, and we automate things. Instead, we created a simple feature flag that would restrict access to the feature to between the hours of 10:00AM to 4:00PM Pacific Time on weekdays. Here’s a basic example in Ruby (plus ActiveSupport):
require 'active_support/core_ext/time' class BusinessHoursFeature # This feature is only available between the hours of 10am and 4pm def enabled? Time.use_zone('Pacific Time (US & Canada)') do now = Time.zone.now am, pm = Time.zone.parse('10:00'), Time.zone.parse('16:00') weekday = !(now.saturday? || now.sunday?) now.between?(am, pm) && weekday end end end module Features class NewRelicSingleSignOn < BusinessHoursFeature; end end Features::NewRelicSingleSignOn.new.enabled?
We decided to set the feature to enable itself at 10:00AM to give ourselves time to settle in at the office or arrive a bit late. Disabling the feature at 4:00PM would give us time to address any possible issues that could arise from falling back to our old authentication method. This ended up being an additional benefit of this feature flag: gaining confidence in deploying and rolling back the feature. In our case, we discovered added complexity due to the change of state resulting from the aforementioned fallback. If your feature involves complicated state changes, make sure you test turning your feature off and on again!
How it paid off
Thanks to this business hours feature flag, we ended up greatly improving our supportability of the new SSO solution. Whereas we would have typically gone directly from a staging environment to our production environment, adding an interim step of production during the daytime gave us ample time to safely learn how to better support our new authentication method. We were able to determine that we had adequate reporting, and we became well practiced in quickly triaging and fixing issues. When problems arose (and they did), we were able to react quickly and not need to be on-call for an experimental feature. And we’ve never slept better.
Have you come up with other sorts of cool feature flags? Let me know over here in our Community Forum.