ElectricAccelerator Agent Metrics, part 1

Welcome to the Electric Cloud Blog! My name is Eric Melski, and I’m a Senior Software Engineer at Electric Cloud (employee #1, by a whisker). I’ve worked on every component of ElectricAccelerator, and I’m now Technical Lead for the product. I’m also the guy behind ElectricInsight. In this and future posts, I will take you on deep technical dives into Accelerator and Insight, with a focus on performance, scalability and analysis.

Understanding the performance of parallel, distributed systems can be difficult. Fortunately, Accelerator provides a wealth of data to facilitate performance analysis. In this first continuation of my presentation at the 2008 Customer Summit, I’ll show you how to collect agent performance metrics, and how you can use them to get a quick overview of the performance characteristics of your build.

Collecting Agent Metrics

Every agent in the cluster automatically tracks performance metrics for every build that it participates in, so you don’t have to do anything special to enable them. All you need to do is download them from the agents after your build has finished. Note that by default, the agent keeps performance data for the previous 100 builds that it participated in, so you don’t have to grab the metrics immediately after your build completes.

The easiest way to get the metrics from the cluster is via cmtool:

This will write a dump of performance metrics from each agent that participated in the specified build to the file named buildId.agentraw. Next, you’ll want to use the (unsupported) agentsummary utility to produce an aggregate summary of the metrics across all agents:

Now we have something that we can dig into.

Overall time usage

The first thing to look at in the agent performance metrics is the “Overall time usage” section. This divides the time used by the agents into several coarse-grained buckets:

These timers correspond to the following activities:

Startup Waiting for the first commands from emake after the connection is established.
Cmd setup Preparing the execution context (environment, working directory, etc.) for a build command
Ver updates Processing file version updates from emake
Command Running build commands (compiles, links, etc)
Emake request Waiting for emake to service a request for file data or metadata
A2A request Waiting for file data transferred from other agents via P2P
Command end Waiting for the instructions from emake after finishing execution of a build command.
Return Reporting file usage to emake, including the contents of any files created
Idle Waiting for emake to provide more commands to run
End A special case of Idle time, this is the amount of time between the last build command and the end of the agent’s participation in the build

For each timer, the metrics show:

  • The total time spent in the timer across all agents;
  • The portion of time spent in the timer as a percentage of the total time across all agents;
  • The number of times the timer was active;
  • The average amount of time the timer was active each time it was in use.

Note that not all timers are available in all versions of Accelerator.

Of these timers, Command is perhaps the most interesting, because it represents the actual work performed during your build. The other timers all correspond to inefficiency of some sort, whether due to the overhead introduced by running a distributed build across the network, or due to serializations in the architecture of the build itself.

One interesting factoid about the Command timer is that you can use its percentage to estimate the X factor, or speedup relative to a serial build. Simply multiply the Command percentage by the number of agents that participated in your build. For example, if the Command timer represents 65% of the total time, and your build used 8 agents, then the X factor is about 5.2x:

Two caveats about this value: first, it’s only an estimate. In my experience it seems to be reasonably accurate in the back-of-the-envelope sense, but it’s by no means perfect. Second, it only really works if the agents are running on hardware comparable to the system you would have used for a serial build. If the agents are slower or faster than the baseline host, that will skew the number correspondingly.

Are my values “good”?

Of course the whole point of looking at these timers is to determine if your build is performing well, or if there is room for improvement. Therefore, we need to establish some guidelines for evaluating the values in these timers. To that end, here are some very rough guides:

Good Acceptable Warning!
Command > 60% 50% – 60% < 50%
Emake request < 10% 10% – 15% > 15%
Idle/End < 10% 10% – 20% > 20%
Return < 10% 10% – 20% > 20%

Remember, Command time represents all the time spent doing the actual work in your build. Ideally, we’d like to see that at 100%, but of course that target is unattainable — if nothing else, there is always overhead introduced simply because we have to move data across the network. Still, it’s not uncommon to see builds with Command time representing 80% or more of the total time. Anything over 60% is pretty good, really. Less than 50% means there’s probably room for improvement. That’s not a guarantee that you can do better, of course. It may simply be the nature of the beast, or you may find that the cost of improving the performance outweighs the benefit. If you do see a low percentage for Command time, the thing to do is look at the other timers to determine where the time is going instead. That will in turn provide some guidance to the root cause of the problem.

For example, if the Emake request timer is high (greater than 15%), that indicates a problem with the communication between emake and the agents. The things to check next in this case are:

  • Network bandwidth between emake and the agents.
  • System load on the emake host.
  • Disk performance on the emake host.
  • Number of agents participating in the build (too many agents can cause emake to thrash).

Another possibility is that the Idle and End timers are high (more than 20% combined). This usually indicates that your build has significant serializations, or that it had a large number of conflicts. Use ElectricInsight to investigate further.

Finally, if the Return timer is high (greater than 20%), you’ll want to check:

  • Network bandwidth between emake and the agents.
  • Disk performance on the emake host.
  • Disk performance on the agent host.

It’s important to note that in general, any one metric will not exactly pinpoint a specific problem; rather, each metric should be thought of as a diagnostic tracer. The more tracers you combine, the more precise your diagnosis will be.

Next time

There is lots of other interesting data in the agent performance metrics which you can use to further analyze build performance. In my next post, we’ll look at the Usage records section.

Follow me

Eric Melski

Eric Melski was part of the team that founded Electric Cloud and is now Chief Architect. Before Electric Cloud, he was a Software Engineer at Scriptics, Inc. and Interwoven. He holds a BS in Computer Science from the University of Wisconsin. Eric also writes about software development at http://blog.melski.net/.
Follow me

Share this:

One response to “ElectricAccelerator Agent Metrics, part 1”

  1. Navneet says:

    Hi, I am unable to download the agent summary tool.
    I am getting below error.
    Unauthorized
    You are not authorized to see the file. Make sure that you are logged in here.

    Note: I am logged in HelpCenter – https://helpcenter.electric-cloud.com/hc/en-us
    Please suggest is there any alternate link/way to download the tool.

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe

Subscribe via RSS
Click here to subscribe to the Electric Cloud Blog via RSS

Subscribe to Blog via Email
Enter your email address to subscribe to this blog and receive notifications of new posts by email.