Performance tools like New Relic make it easy to find performance bottlenecks. But we’ve found that performance will often slowly degrade unless you’re paying a lot of attention, especially if you have a lot of people working in a particular codebase.
This list is a bunch of ideas for you to consider and experiment with. You can use these ideas to make sure your team is organized to keep your site fast.
1. Introduce a ‘Performance Hero’ role
A common problem in software teams is that nobody is paying attention to each deployment, or spending time addressing performance problems. If you have several people on a team, you can have a diffusion of responsibility, and everyone else can assume someone else will take care of monitoring performance.
Many teams at New Relic have rotating “Hero” roles. The idea behind these roles is that we like to have people focus, as much as possible, on one thing at a time. Having a rotating role to handle things like support escalations, or monitoring and fixing performance issues, means that everyone else gets to focus more on their projects. So one thing we’ve used to address performance is to have a rotating Performance Hero.
So how does a performance hero role work? When you are the performance hero, you don’t work on other projects or bugs, and you just focus on performance issues for the week. You…
- look for performance regressions in each deployment. We record each deployment to New Relic, which gives you a handy report on the changes resulting from each deployment.
- look for bottlenecks. Fix whatever you can within the week, suggest bigger projects if necessary.
- fix bottlenecks. You work with other engineers on performance issues in their code. And you tune and address any bottlenecks you can during the week.
We have experimented with this role a fair amount, and a couple of things that haven’t worked well for us are:
- rotating the role daily — this didn’t allow enough time to actually fix anything.
- monitoring performance, but not actively coding and working on performance issues — we found that the Hero focused on their project work too much and didn’t really address performance issues adequately.
There were some challenges in rotating weekly. Some things take longer than a week to fix, and there is a switching cost. Sometimes we would extend the role for two weeks. We also found pairing in this role to be pretty useful — some of our best sessions were paired. But there were also advantages to rotating: since each engineer has a different approach, and since our site has been tuned for several years, a lot of the low hanging fruit has already been completed.
Philosophically, a nice part of the performance hero role is that it spreads the message that performance is important for everyone to do — it’s all of our job.
2. Set up Dashboards and Custom Dashboards
Having a large dashboard in a public space helps raise visibility of performance issues and errors. I can’t tell you how many times people have walked by our big monitor and pointed out a site issue that we wouldn’t have otherwise noticed.
You can set up Custom Dashboards to include the charts you care about most, so everything is on one page. Or you can use something like the TabCarousel extension on Chrome to rotate between pages.
For those of you with iPads, download the new iPad app, and you can set up a little dashboard next to your workstation with the site’s performance on display throughout the day. The main page of the iPad app changes colors, and having a green screen is a sign everything is fine. If things turn yellow or green, investigate. Or have a Hero who gets the iPad each week.
3. Make monitoring a part of every feature
We are obsessed with dashboards and metrics, so we do a lot of custom metrics which report into their own custom dashboards. We have a checklist item for each feature which is to evaluate how we’ll know whether the feature is doing well and performing well.
4. Download the iOS app, and set up push notifications
If you’ve set up your alerts and turned off those that aren’t useful, you can receive push notifications when there are problems by using the iOS app. Make sure you set error thresholds, and take a few minutes to understand Apdex alerts — they’re a good measure of the customer ‘s actual experience on your site.
5. Report on performance each week
At New Relic, we report on our projects once a week to the rest of the company. This lets anyone follow projects they are interested in. We’ve found that reporting on performance weekly means that when performance goes down, it forces us to evaluate issues at least once a week. That’s not ideal, but it does give you some visibility and forces you to pay attention.
6. Ask someone to report on performance at every team meeting
When performance is bad, have someone report on how the site is performing each week to the team. We reported on the End User Apdex, overall and for four key transactions (more on those in a second).
Celebrate improvements. Share knowledge of what worked, and what didn’t.
7. Set up Key Transactions and Alerting
Not every page on your site is equally important. We chose four pages we care the most about, and created Key Transactions for them. Key Transactions allow you to track and alert on the performance of a single transaction on your site. Depending on the size of your team, you may want to set up key transactions for key pages, and make the notifications go to individual teams. This helps ensure that they’re acted on.
The new Alert Policies are great for this. They let you set up a notification group for a team, and assign apps or key transaction alerts to go to that team. Plus, since you can set alerting thresholds for just that page, you can make it notify the team when alerts fire off.
We talk a lot about alert noise at New Relic. It’s really important to turn off alerts that aren’t actionable or useful.
8. Hire or train a Performance engineer
When companies grow large enough, a very common pattern is to hire Performance or Scalability Engineers.
The responsibilities for these engineers vary greatly from job to job, and from company to company. In some companies, they act sort of like QA engineers, monitoring performance and working with engineers to fix performance issues. In some companies, it’s all about tuning existing code. In others, it’s all about architecture.
If you’d like two more concrete examples, we have two positions open right now at New Relic that might highlight how these positions are very different from job to job. (Oh, and you should apply for them if they sound interesting!)
The mission for the first job is to make our site screaming fast, by any means necessary. This is a Rails-focused position, and we expect this to be a combination of:
- coding and rewriting code to improve performance.
- writing tools and improving our product to make it easier for the team to see the performance impact of their work.
- helping others to get better at performance.
- monitoring the performance and helping point out performance regressions and making sure engineers fix those issues.
As you can see, it’s sort of like a full-time Performance Hero role.
The second is Java focused. The mission? Scale our our data collection so we can handle hundreds of millions of connections per minute. This job is almost completely code and architecture-focused. We receive an insane amount of traffic, so this position requires a lot of experience with high throughput systems, and some knowledge of distributed systems.
P.S. If these sort of experiments sound like the type of thing you think about, and you’re a manager with a lot of experience with high throughput systems, we also have a manager position open.