Introducing dPerf – A Distributed Low Impact CPU Profiler for iOS

A few months ago, we launched an awesome new product – New Relic for Mobile Apps. As with all the products we develop, we wanted it to have a very low footprint on the apps it monitors. We needed a way to measure the impact our Mobile SDK has on our customers’ apps and ensure this impact was as close to zero as possible.

Existing tools weren’t a good fit. Our actual profile kept getting lost in the noise of tools like Instruments.app. We quickly realized we needed something extremely lightweight. And since our Mobile SDK runs on many different types of devices, we knew we needed a tool that could be distributed easily across a device test lab (such as the one our friends at AppThwack have built).

With those goals in mind, we developed dPerf – a small on-device profiling framework. Metrics collected by the profiler are pushed to a Node.js service with JSON over REST and are stored in a MongoDB collection. The Node service then provides a view that renders the profile data with Google Charts.

And here’s the best part: New Relic is releasing dPerf as an open source project! It can be found on the dPerf New Relic Repository on Github. Read on to learn more about how it works.

The New Relic iOS SDK
Before we dive into the technical details of dPerf, let’s talk a little bit about the New Relic iOS SDK.

Our approach to application monitoring has two main phases: instrumentation and reporting. The instrumentation phase collects data about network activities, such as NSURLConnection delegate method calls, into a memory buffer. Then once per minute, the reporting phase sends the contents of the memory buffer to the New Relic server.

It’s important to capture profiles for both phases of execution. This is a pretty easy task. (The profiler just needs a sane clock and to run for more than one minute.) If we know the start times for the application, profiler, agent and agent reporting phase, we can easily identify samples from the different phases.

Profiling Strategy
The general strategy for profiling a system is universal: sample the system at regular intervals. Any overhead incurred by the sampling process itself should be calculated and corrected by obtaining an average profile for the sampling process over a large number of runs, and subtracting this time from the samples.

The sampling interval is a critical parameter for a sampling process. Sampling theory (specifically, the Nyquist–Shannon sampling theorem) states that we can only accurately sample a process operating at a rate of R by sampling it at a minimum rate of 2R. This minimum rate, 2R, is known as the Nyquist rate. For example, to produce an accurate profile of a system performing an operation once per second, we must sample it twice per second. Sampling at a lower rate misses detail and introduces statistical noise. A good description of the Nyquist rate as it applies to audio sampling can be found here.

It’s ideal to conduct the samples as fast as possible, but higher sampling rates require more system resources. And this defeats the purpose of a lightweight profiling system. A critical task then becomes how to identify the ideal sampling rate to capture all relevant information, while minimizing the impact of the sampling process.

We’ve tested varying sampling rates between 10 – 100 samples per second. And we’ve found that a rate of between 10 – 50 samples per second is enough for most apps that do a lot of network activity. You can find that sampling at a higher rate reveals some interesting ‘jitters’ from external processes such as push notifications, SMS and the like.

In this project, the sampled value is the total CPU time used by the process across all threads, which is an ever-increasing value. By computing the changes between values, we can establish the relative CPU used between sampling periods.

What About Instruments.app?
Instruments.app is an amazing application. For the majority of cases, it’s more than adequate for profiling applications as its memory allocation tracking and reference cycle graphs are great tools for memory profiling. Its value is much more limited for CPU profiling, especially on-device.

Instruments.app is great if you want to aggregate the time spent in a particular method, context switch and system call. Unfortunately, it’s also resource intensive. The performance of method calls will still be accurate relative to one another, but the profile isn’t indicative at all of application performance on-device. Additionally, some handy tools such as the Dispatch instrument are not available when you attach to a process on-device.

Also, Instruments.app doesn’t make it easy to export raw metric data. Let’s say you’re trying to analyze dozens of profiled runs across dozens of devices. You want a simple and automated solution to compare the results.

On-Device Profiling Client
The profiler client has two main components: the periodic timer and the value function that’s being sampled.

Creating Fast Timers
iOS provides a few options for creating periodic timers. The NSTimer class is the most straightforward way to create a timer. NSTimer works reasonably well for timers with long intervals, such as a task that runs each second or more. For sub-second timers, NSTimer starts to become less than ideal. According to the documentation from Apple, NSTimer resolution is limited to around 50 – 100ms, but timers operating on this magnitude can experience a large amount of variability in the time that they actually fire.

The better choice for high resolution timing is to use Grand Central Dispatch (GCD) source timers. These timers are extremely accurate and add a fairly small amount of system load even at intervals around 10ms (100 samples per second). Dispatch source timers also fire at more regular intervals and have less variability than NSTimers. If the application doesn’t require precise timing, you can specify a ‘leeway’ argument when you create dispatch source timers to allow for more variability. For example, a timer can be created with an interval of one minute and leeway of five seconds. In this case, the timer will fire every 60 +/- 5 seconds.

Apple’s Concurrency Programming Guide provides an example of creating dispatch source timers:

[code language=”objc”]
dispatch_source_t CreateDispatchTimer(uint64_t interval,
uint64_t leeway,
dispatch_queue_t queue,
dispatch_block_t block)
{
dispatch_source_t timer = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER,
0, 0, queue);
if (timer)
{
dispatch_source_set_timer(timer, dispatch_walltime(NULL, 0), interval, leeway);
dispatch_source_set_event_handler(timer, block);
dispatch_resume(timer);
}
return timer;
}
[/code]

Using this timer is straightforward:

[code language=”objc”]
– (void) start
{
_dispatchTimer = CreateDispatchTimer((1.0 / _sampleRate) * NSEC_PER_SEC, 0, dispatch_get_main_queue(), ^{
[self sample];
});
}
[/code]

This method will create a dispatch timer which fires at _sampleRate times per second. We want it to be as accurate as possible, so a leeway of zero is given. It operates on the main queue and invokes a block that invokes the ‘sample’ method.

Sampling CPU Usage
Now that we have a fast and reliable timer, we can turn our attention toward interesting values to sample. We’re currently interested in application speed, so we need a good source for CPU utilization data.

iOS provides a great deal of low level information through the mach kernel task API, which is defined in mach/task.h. Apple has little documentation for these functions on iOS, but analogues exist for other mach-based kernels such as XNU and Darwin. These functions behave in more or less the same way across mach kernels, so documentation and code examples using this API are highly portable to iOS.

To find the aggregate CPU utilization for the application, we perform the following actions:

1. Call mach_task_self to obtain the current task.

2. Call task_info to retrieve basic info about the task including CPU time from terminated threads.

3. Call task_threads to retrieve an array of active threads.

4. For each thread, call thread_info which returns accumulated CPU time, and add to total.

Since these are low level kernel interfaces, take care when you allocate and de-allocate memory for the various structures.

The method below is a slightly modified version of the method provided in this Stack Overflow answer:

[code language=”objc”]
– (double)cpuTime
{
task_t task;
kern_return_t error;
mach_msg_type_number_t count;
thread_array_t thread_table;
thread_basic_info_t thi;
thread_basic_info_data_t thi_data;
unsigned table_size;
unsigned table_array_size;
struct task_basic_info ti;
double total_time;

task = mach_task_self();
count = TASK_BASIC_INFO_COUNT;
error = task_info(task, TASK_BASIC_INFO, (task_info_t)&ti, &count);
if (error != KERN_SUCCESS) {
return -1;
}
{
unsigned i;

//
// the following times are for threads which have already terminated and gone away.
//
total_time = ti.user_time.seconds + ti.user_time.microseconds * 1e-6;
total_time += ti.system_time.seconds + ti.system_time.microseconds * 1e-6;

error = task_threads(task, &thread_table, &table_size);

//
// failed to retrieve thread list: we can’t proceed any further.
//
if (error != KERN_SUCCESS) {
error = mach_port_deallocate(mach_task_self(), task);
assert(error == KERN_SUCCESS);
return -1;
}

thi = &thi_data;
table_array_size = table_size * sizeof(thread_array_t);

//
// for each active thread, add up thread time
//
for (i = 0; i < table_size; ++i) {
count = THREAD_BASIC_INFO_COUNT;
error = thread_info(thread_table[i], THREAD_BASIC_INFO, (thread_info_t)thi, &count);

//
// if the thread_info call fails, clean up and fail hard.
// partial results are probably useless.
//
if (error != KERN_SUCCESS) {
for (; i < table_size; ++i) {
error = mach_port_deallocate(mach_task_self(), thread_table[i]);
assert(error == KERN_SUCCESS);
}

error = vm_deallocate(mach_task_self(), (vm_offset_t)thread_table, table_array_size);
assert(error == KERN_SUCCESS);

error = mach_port_deallocate(mach_task_self(), task);
assert(error == KERN_SUCCESS);

return -1;
}

//
// otherwise, accumulate & continue.
//
if ((thi->flags & TH_FLAGS_IDLE) == 0) {
total_time += thi->user_time.seconds + thi->user_time.microseconds * 1e-6;
total_time += thi->system_time.seconds + thi->system_time.microseconds * 1e-6;
}

error = mach_port_deallocate(mach_task_self(), thread_table[i]);
assert(error == KERN_SUCCESS);
}

//
// deallocate the thread table.
//
error = vm_deallocate(mach_task_self(), (vm_offset_t)thread_table, table_array_size);
assert(error == KERN_SUCCESS);
}
if (task != mach_task_self()) {
error = mach_port_deallocate(mach_task_self(), task);
assert(error == KERN_SUCCESS);
}
return total_time;
}
[/code]

Who Profiles the Profiler?
There is some unavoidable overhead incurred by adding a profiler to an application that can be corrected for with benchmarking. The average time it takes to perform profiling operations is calculated with repeated trials and then used to correct sampled data. This technique should be familiar to many low level and embedded programmers.

The strategy is simple:

1. Record a timestamp at the start of benchmarking.

2. Perform N trials of the benchmarked operation. This should be a tight loop that does nothing but execute the operation.

3. Record a timestamp at the end of benchmarking.

4. Calculate elapsed time by subtracting the start timestamp from the end timestamp.

5. Divide this elapsed time by N (number of trials) to obtain an average time per operation.

A generalized method for benchmarking could be written such as:

[code language=”objc”]
– (uint64_t) benchmarkWithTrials: (uint) trials andBlock: (dispatch_block_t) block
{
// call the block first, warming up any method call table lookups
block();

uint64_t start = mach_absolute_time();
for(int t = 0; t < trials; t++) {
block();
}
return (mach_absolute_time() – start) / trials;
}
[/code]

Fortunately, the function calls to retrieve the task and thread info are very fast. On an iPhone 5, gathering CPU information for all threads takes on average about 150 microseconds.

Profiling Server
At this point, we have a tool that can record CPU usage efficiently at regular intervals. What should we do with this data?

We could have printed sample values to a console log for analysis or even written a small UI to display the profile after the profiling session was over. A goal of this project was to enable profiling across a variety of devices efficiently and scraping dozens of log files isn’t a scalable solution.

Instead, we wrote a basic client / server architecture. The profiling client buffers sampled data until the profiling session is completed. The data is then translated to JSON and sent to a RESTful Node service over HTTP. Along with the sample data, the JSON structure includes information about the device, a unique identifier for the profiling session, sampling rate and the name of the test performed.

This allows multiple devices to run a variety of profiling tests while reporting back to the same server. Decoupling the system in this way provides a lot of scalability and flexibility in the distribution and timing of tests.

Not including static content, the Node server is made up of only about 70 lines of code. It has two jobs:

* Store incoming data into a MongoDB collection.

* Serve that data up using Google Charts.

The server handles data posts from profilers, as well as generating charts and data sets for analysis. The following modules are used to accomplish this:

* Mongodb: Node driver for MongoDB

* Express:  Amazing web framework for REST and MVC mapping

* Jade: Powerful templating for dynamic views

* stylus and nib: CSS preprocessor

* Google Charts: Google JavaScript library for easy chart generation

This article is a great tutorial for using some of these modules to build a web app.

Distributed Testing with AppThwack
One of the major challenges of mobile development is ensuring compatibility with the wide variety of devices available on the market. iOS currently runs on 16 different Apple hardware platforms and Android runs on several hundred. Trying to build a test lab to cover all these permutations or perform tests on the various devices would be both cost prohibitive and time consuming.

AppThwack is a young company with an innovative approach to mobile testing. They have an automated testing platform that deploys applications to hundreds of real devices, runs tests and returns all sorts of interesting information. Along with reporting test failures, AppThwack also gathers screenshots and log files from all the devices.

With the AppThwack platform, we were able to deploy dPerf to multiple devices easily. When each device finished its test, it sent the samples to our central test server.

Results
After all of the pieces were in place, it was time to run some tests. To establish a baseline, we first tested an application that was completely idle except for running the New Relic agent and Profiler. Technically, we also ran it without the New Relic agent to get a true baseline, but an empty chart is not very interesting.

Idle Test Phone 6.1.4

This chart clearly shows the New Relic agent reporting after exactly one minute. It’s pretty fast, too!

When this same idle test was run on the AppThwack platform, the results were a a bit different. One testing option AppThwack provided is to send several hundred random events during the testing period. This caused the ‘idle’ moments in the application to appear more jittery than what we saw in first test, which essentially had no events being fired. We also saw some of the AppThwack initialization during application start and the beginning of the random event stream. The New Relic reporting call was still visible when its timer fired after one minute.

Idle Test iPhone 6.1.3

Next, we checked the behavior of an application generating a bit of network traffic. This test performed a HTTP request to the highly useful httpstat.us site every second. We began with a plain old HTTP 200 OK request.

200 1rpm Test iPad

We definitely saw a periodic process here, mixed in with the ‘noise’ from the random event stream. It’s interesting to note that even at this point in testing, the New Relic agent used so little CPU that it was indistinguishable from the other network traffic.

Ramping up the network traffic to 5 requests per second, we saw a familiar pattern. The chart looked essentially the same as the 1 request per second, but the periodic process was multiplied. The New Relic reporting call was somewhat visible here, as a ‘thicker’ line around 60 seconds.

200 5rpm Test iPad

Next we moved on to an application that used more CPU resources. The chart below shows an application that queries the U.S. Geological Survey website for a GeoJSON data set of all the 2.5 magnitude earthquakes that occurred in the last month. After the request was complete, the GeoJSON was decoded. This process repeated every five seconds. The JSON payload size was quite large — roughly 600 kilobytes — so this was also a good test for handling of large HTTP responses.

USGS Parse Test

Here the periodic process was quite visible and by the ‘thickness’ of the spikes we saw that the parsing took longer to complete — about a second or so.

The final verdict was that the New Relic SDK added only a tiny bit of overhead — about 10 – 20ms on average. And this overhead was only detectable in an application that was mostly idle, which is a rare case indeed.

Source Code
Again, if you’re interested in trying out dPerf for yourself, you can find the source for the profiling system in the dPerf New Relic repository on GitHub. We hope you find it as useful as we did. Feel free to use and enhance this tool to suit your specific needs. Got some cool changes that you want to share? Send us a pull request!

jared@newrelic.com'

As an engineer on the New Relic Mobile Team, Jared has the pleasure of running Java on everything from ancient Android phones to giant data processing services. View posts by .

Interested in writing for New Relic Blog? Send us a pitch!