Harden Your Accelerator Cluster

ElectricAccelerator dramatically accelerates software builds and tests by safely parallelizing jobs across shared clusters of physical or cloud CPU’s.  Although EA is resistant to agent failures, there are steps you can take to harden your cluster. Software builds have joined the ranks of other “mission critical” services.  If you want to  keep your cluster in a high availability state and eliminate any single point of failure of every component of your Accelerator deployment − read on to discover how.

The main reason behind hardening any infrastructure is to limit a single point of failure.The different components that comprise a cluster include:

  1. SeveralAgents
  2. One or more Emake machines
  3. A Database
  4. A Cluster Manager

image001

Agent

Accelerator Agents, which run build commands on behalf of emake” or the cluster manager, which allocates agents to active emake instances can range from just a few to hundreds. Losing one or two agents should not affect yourcluster past the point of slowing builds unless you lose all the agents in a build class. Ensure that a class is distributed across multiple physical machines to reduce the risk of total failure. The provisioning of new machines is simple using VMs and cloningor Puppet/Chef -like tools.

Emake machine

As is true with agents,each developer can have their own emake client or have access to a pool of emake clients that they can use to start their builds. I use ElectricFlow (our end-to-end Continuous Delivery automation solution, built on top of ElectricCommander) to automate the command line invocation of emake. I would encourage you to use that configuration to orchestrate your builds and create a pool of emake machines that Flow can then pick from automatically.This allows you to schedule a build to start on one of the emake machines in your pool. Once the build is started, it is distributed across the EA clusterto eliminate a single emake client failure from affecting the Developer’s ability to run a build and keep the build environment up and available– acommon practice when using both Commander and Accelerator.

Database

I follow the best practice procedure of running the database (DB) on its own server enhancing performance. Choosing a high-availability configuration  using  ElectricAccelerator data for the database is also advised.  Refer to your DB Administrator or the documentation of your specific database for more details.

Cluster Manager

As mentioned earlier, there is no specific HA configuration for this important component, however there are best-practices which help minimize your exposure to any single point of failure.

Harden your server

Treat your Cluster Manager (CM) as you would any mission critical IT server:

  1. Use your server only as a CM. Keep non-essential software to a minimum.
  2. Use a server class machine with dual power units and error correcting RAM (ECC).
  3. Monitor the vitals of your server with a monitoring solution such as Nagios(or similar product) so you can be alerted of impending issues like excessive CPU or memory utilization.
  4. Have plenty of disk and use RAID if possible

Active-Passive

An active-passive configuration provides a fully redundant instance of each node. The node is only brought online when its associated primary node fails. This configuration typically requires extra hardware that is more than most configurations.

The easiest way to create an active-passive configuration is to have a cloned VM or physical machine keeping the same name and IP. Be sure to keep the following files in sync if you modified them after the original copy/cloning:

  • conf/passkey
  • conf/keystore
  • conf/database.properties
  • apache/conf/server.key
  • apache/conf/server.csr
  • apache/conf/server.crt

Have your second machine or VM on standby. When the first machine dies, simply fire up the second machine.

image003

What happen during the transition

Of course until the second server is active, you won’t be able to start a new build because you do not have a CM to submit them to.

However, the builds currently running will simply continue but will not be able to acquire new agents as the CM executes this part.

The builds that finish while the CM is down won’t be able to upload their statistics but the build will not be affected otherwise.

When your second server is ready, new agents can be allocated and in-flight builds will connect transparently.

While this configuration does not eliminate downtime completely, our goal is to minimize it as much as possible.

image005

This is made possible because the database has been separated from the CM ensuring that no data is lost.

Conclusion

With these few steps, you can make your ElectricAccelerator cluster more resilient. If you have used such techniques or something different to harden your cluster, let me know so I can share those experiences with more customers.

Laurent Rochette

Laurent Rochette is a Professional Service Engineer with Electric Cloud. He trains customers on our products and help them with deployments and consulting to enable them to use our products effectively. Prior to joining Electric Cloud, Laurent served as an IT Architect at Mentor Graphics. Laurent holds a Master degree in Computer Science from the Grenoble Polytechnic Institute (INPG) in France.

Share this:

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe

Subscribe via RSS
Click here to subscribe to the Electric Cloud Blog via RSS

Subscribe to Blog via Email
Enter your email address to subscribe to this blog and receive notifications of new posts by email.