ElectricAccelerator dramatically accelerates software builds and tests by safely parallelizing jobs across shared clusters of physical or cloud CPU’s. Although EA is resistant to agent failures, there are steps you can take to harden your cluster. Software builds have joined the ranks of other “mission critical” services. If you want to keep your cluster in a high availability state and eliminate any single point of failure of every component of your Accelerator deployment − read on to discover how.
The main reason behind hardening any infrastructure is to limit a single point of failure.The different components that comprise a cluster include:
- One or more Emake machines
- A Database
- A Cluster Manager
Accelerator Agents, which run build commands on behalf of emake” or the cluster manager, which allocates agents to active emake instances can range from just a few to hundreds. Losing one or two agents should not affect yourcluster past the point of slowing builds unless you lose all the agents in a build class. Ensure that a class is distributed across multiple physical machines to reduce the risk of total failure. The provisioning of new machines is simple using VMs and cloningor Puppet/Chef -like tools.
As is true with agents,each developer can have their own emake client or have access to a pool of emake clients that they can use to start their builds. I use ElectricFlow (our end-to-end Continuous Delivery automation solution, built on top of ElectricCommander) to automate the command line invocation of emake. I would encourage you to use that configuration to orchestrate your builds and create a pool of emake machines that Flow can then pick from automatically.This allows you to schedule a build to start on one of the emake machines in your pool. Once the build is started, it is distributed across the EA clusterto eliminate a single emake client failure from affecting the Developer’s ability to run a build and keep the build environment up and available– acommon practice when using both Commander and Accelerator.
I follow the best practice procedure of running the database (DB) on its own server enhancing performance. Choosing a high-availability configuration using ElectricAccelerator data for the database is also advised. Refer to your DB Administrator or the documentation of your specific database for more details.
As mentioned earlier, there is no specific HA configuration for this important component, however there are best-practices which help minimize your exposure to any single point of failure.
Harden your server
Treat your Cluster Manager (CM) as you would any mission critical IT server:
- Use your server only as a CM. Keep non-essential software to a minimum.
- Use a server class machine with dual power units and error correcting RAM (ECC).
- Monitor the vitals of your server with a monitoring solution such as Nagios(or similar product) so you can be alerted of impending issues like excessive CPU or memory utilization.
- Have plenty of disk and use RAID if possible
An active-passive configuration provides a fully redundant instance of each node. The node is only brought online when its associated primary node fails. This configuration typically requires extra hardware that is more than most configurations.
The easiest way to create an active-passive configuration is to have a cloned VM or physical machine keeping the same name and IP. Be sure to keep the following files in sync if you modified them after the original copy/cloning:
Have your second machine or VM on standby. When the first machine dies, simply fire up the second machine.
What happen during the transition
Of course until the second server is active, you won’t be able to start a new build because you do not have a CM to submit them to.
However, the builds currently running will simply continue but will not be able to acquire new agents as the CM executes this part.
The builds that finish while the CM is down won’t be able to upload their statistics but the build will not be affected otherwise.
When your second server is ready, new agents can be allocated and in-flight builds will connect transparently.
While this configuration does not eliminate downtime completely, our goal is to minimize it as much as possible.
This is made possible because the database has been separated from the CM ensuring that no data is lost.
With these few steps, you can make your ElectricAccelerator cluster more resilient. If you have used such techniques or something different to harden your cluster, let me know so I can share those experiences with more customers.
Latest posts by Laurent Rochette (see all)
- The Access Control List (ACL) War - January 3, 2017
- Monitoring the Health of Your ElectricFlow Server Using statsd and Grafana - January 4, 2016
- ElectricFlow DSL: Pipeline-As-Code for Orchestrating Releases - August 13, 2015