This is a guest blog post by Gary Gruver, one of Electric Cloud’s strategic advisors. Gary is the Co-Author of Leading the Transformation, A Practical Approach to Large-Scale Agile Development, and Starting and Scaling DevOps in the Enterprise.
This is the third free chapter from my recent book “Starting and Scaling DevOps in the Enterprise“. You can read the first chapters here and here. You can also download your free copy of the complete book, here.
The book provides a concise framework for analyzing your delivery processes and optimizing them by implementing DevOps practices that will have the greatest immediate impact on the productivity of your organization. It covers both the engineering, architectural and leadership practices that are critical to achieving DevOps success. I hope this will be a helpful resource for you on your DevOps path!
Chapter 3: Optimizing The Basic Deployment Pipeline
Setting up your Deployment Pipeline (DP) and using DevOps practices for increasing its throughput while maintaining or improving quality is a journey that takes time for most large organizations. This approach, though, will provide a systematic method for addressing inefficiencies in your software development processes and improving those processes over time. We will look at the different types of work, different types of waste, and different metrics for highlighting inefficiencies. We will start there because it is important to put the different DevOps concepts, metrics, and practices into perspective so you can start your improvements where they will provide the biggest benefits and start driving positive momentum for your transformation.
The technical and cultural shifts associated with this will change how everyone works on a day-to-day basis. The goal is to get people to accept these cultural changes and embrace different ways of working. For example: As an Operations person, I have always logged into a server to debug and fix issues on the fly. Now I can log on to debug, but the fix is going to require updating and running the script. This is going to be slower at first and will feel unnatural to me, but the change means I know, as does everyone else, that the exact state of the server with all changes are under version control, and I can create new servers at will that are exactly the same. Short-term pain for long-term gain is going to be hard to get some people to embrace, but this is the type of cultural change that is required to truly transform your development processes.
Additionally, there are lots of breakthroughs coming from the field of DevOps that will help you address issues that have been plaguing your organization for years that were not very visible while operating at a low cadence. When you do one deployment a month, you don’t see the issues repeating enough to see a common cause that needs to be fixed. When you do a deployment each day, you see a pattern that reveals the things that need fixing. When you are deploying manually on a monthly basis, you can use brute force, which takes up a lot of time, requires a lot of energy, and creates a lot of frustration. When you deploy daily, you can no longer use brute force. You need to automate to improve frequency, and that automation allows you to fix repetitive issues.
As you look to address inefficiencies, it is important to understand that there are three different kinds of work with software that require different approaches to eliminate waste and improve efficiency. First, there is new and unique work, such as the new features, new applications, and new products that are the objective of the organization. Second, there is triage work that must be done to find the source of the issues that need to be fixed. Third, there is repetitive work, which includes creating an environment, building, deploying, configuring databases, configuring firewalls, and testing.
Since the new and unique work isn’t a repetitive task, it can’t be optimized the way you would a manufacturing process. In manufacturing, the product being built is constant so you can make process changes and measure the output to see if there was an improvement. With the new and unique part of software you can’t do that because you are changing both the product and the process at the same time. Therefore, you don’t know if the improvement was due to the process change or just a different outcome based on processing a different type or size of requirement. Instead the focus here should be on increasing the feedback so that people working on these new capabilities don’t waste time and energy on things that won’t work with changes other people are making, won’t work in production, or don’t meet the needs of the customer. Providing fast, high-quality feedback helps to minimize this waste. It starts with feedback in a production-like environment with their latest code working with everyone else’s latest code to ensure real-time resolution of those issues. Then, ideally, the feedback comes from the customer with code in production as soon as possible. Validating with the customer is done to address the fact that 50% of new software features are never used or do not meet their business intent. Removing this waste requires getting new features to the customers as fast as possible to enable finding which parts of the 50% are not meeting their business objective so the organization can quit wasting time on those efforts.
In large software organizations, triaging and localizing the source of the issue can consume a large amount of effort. Minimizing waste in this area requires minimizing the amount of triage required and then designing processes and approaches that localize the source of issues as quickly as possible when triage is required. DevOps approaches work to minimize the amount of triage required by automating repetitive tasks for consistency. DevOps approaches are also designed to improve the efficiency of the triage process by moving to smaller batch sizes, resulting in fewer changes needing to be investigated as potential sources of the issue.
The waste with repetitive work is different. DevOps moves to automate these repetitive tasks for three reasons. First, it addresses the obvious waste of doing something manually when it could be automated. Automation also enables the tasks to be run more frequently, which helps with batch sizes and thus the triage process. Second, it dramatically reduces the time associated with these manual tasks so that the feedback cycles are much shorter, which helps to reduce the waste for new and unique work. Third, because the automated tasks are executed the same way every time, it reduces the amount of triage required to find manual mistakes or inconsistencies across environments.
DevOps practices are designed to help address these sources of waste, but with so many different places that need to be improved in large organizations, it is important to understand where to start. The first step is documenting the current DP and starting to collect data to help target the bottlenecks in flow and the biggest sources of waste. In this chapter we will walk through each step of the basic DP and will review which metrics to collect to help you understand the magnitude of issues you have at each stage. Then, we will describe the DevOps approaches people have found effective for addressing the waste at that stage. Finally, we will highlight the cultural changes that are required to get people to accept working differently.
This approach should help illustrate why so many different people have different definitions of DevOps. It really depends what part of the elephant they are seeing. For any given organization, the constraint in flow may be the planning/requirements process, the development process, obtaining consistent environments, the testing process, or deploying code. Your view of the constraint also potentially depends on your role in the organization. While everything you are hearing about DevOps is typically valid, you can’t simply copy the rituals because it might not make sense for your organization. One organization’s bottleneck is not another organization’s bottleneck so you must focus on applying the principles!
Here we are talking about new and unique work, not repetitive work, so fixing it requires fast feedback and a focus on end-to-end cycle time for ultimate customer feedback.
For organizations trying to better understand the waste in the planning and requirements part of their DP, it is important to understand the data showing the inefficiencies. It may not be possible to collect all the data at first, but don’t let this stop you from starting your improvements. As with all of the metrics we describe, get as much data as you can to target issues and start your continuous improvement process. It is more important to start improving than it is to get a perfect view of your current issues. Ideally, though, you would want to know the answers to the following questions:
- What percentage of the organizations capacity is spent on documenting requirements and planning?
- What is the amount of requirements inventory waiting for development, roughly, in terms of days of supply?
- What percentages of the requirements are reworked after originally defined?
- What percentages of the delivered features are being used by the customers and are achieving the expected business results?
Optimizing this part of the DP requires moving to a just-in-time approach to documenting and decomposing requirements only to the level required to support the required business decisions while limiting the commitment of long-term deliveries to a subset of the overall capacity. The focus here is to limit the inventory of requirements as much as possible. Ideally this would wait until the developer is ready to start working on the requirement before investing in defining the feature. This approach minimizes waste because effort is not exerted until you know for sure it is going to be developed. It also enables quick responsiveness to changes in the market because great new ideas don’t have to wait in line behind all the features that were previously defined.
While this is the ideal situation, it is not always possible because organizations frequently need a longer-range view of when things might happen in order to support different business decisions. For example, you might ask yourself, ”Do I need to ramp up hiring to meet schedule, or should I build the manufacturing line because a product is going to be ready for a launch?” The problem is that most organizations create way more requirements inventory a long way into the future than is needed to support their business decisions. They want to know exactly what features will be ready when using waterfall planning because that is what they do for every other part of the business. The problem is that this approach drives a lot of waste into the system and locks in to a committed plan what should be your most ﬂexible asset. Additionally, most organizations push their software teams to commit to 100% of their capacity, meaning they are not able to respond to changes in the marketplace or discoveries during development. This is a significant source of waste in a lot of organizations.
I have worked with one organization that moved to a more just-in-time approach for requirements and that has transformed their planning processes from taking 20% or more of their capacity to less than 5%. They eliminated waste and freed up 15% of the capacity of their organization to focus on creating value for the business. This was done by limiting long-term commitments of over a year to less than 50% of capacity and committing additional capacity in shorter timeframe horizons. The details of how this worked are in Chapter 5 of Leading the Transformation by Gary Gruver and Tommy Mouser. This was a big shift that freed up more capacity, and it also improved the speed of value through the system because new ideas could move quickly into development if they were of the highest priority instead of waiting in queue behind a lot of lower-priority ideas that were previously planned.
This move is a big cultural change for most organizations. It requires software/IT and business executives to think differently about how they manage software. They really need to change their focus from optimizing the system for accuracy in plans to optimizing it for throughput of value for the customer. They need to be clear about the business decisions they need to support and work with the organization to limit the investment in requirements just to the level of detail required to support those decisions.
For many organizations, like the one described in Chapter 2, the time it takes for Operations to create an environment for testing is one of the lengthiest steps in the DP. Additionally, the consistency between this testing environment and production is so lacking that it requires finding and fixing a whole new set of issues at each stage of testing in the DP. Creating these environments is one of the main repetitive tasks that can be documented, automated, and put under revision control. The objective here is to be able to quickly create environments that provide consistent results across the DP. This is done through a movement to infrastructure as code, which has the additional advantage of documenting everything about the environments so it is easier for different parts of the organization to track and collaborate on changes.
To better understand the impact environment issues are having on your DP, it would be helpful to have the following data:
- time from environment request to delivery
- how frequently new environments are required
- the percent of time environments need fixing before acceptance
- the percent of defects associated with code vs. environment vs. deployment vs. database vs. other at each stage in the DP
One of the biggest improvements coming out of the DevOps movement concerns the speed and consistency of environments, deployments, and databases. This started with Continuous Delivery by Jez Humble and David Farley. They showed the value of infrastructure as code, where all parts of the environment are treated with the same rigor and controls as the application code. The process of automating the infrastructure and putting it under version control has some key advantages. First, the automation ensures consistency across different stages and different servers in the DP. Second, the automation supports the increased frequency that is required to drive to smaller batch sizes and more frequent deployments. Third, it provides working code that is a well-documented definition of the environments that everyone can collaborate on when changes are required to support new features.
Technical solutions in this space are quickly evolving because organizations are seeing that getting control of their environments provides many benefits. Smart engineers around the world are constantly inventing new ways to make this process easier and faster. Cloud capabilities, whether internal or external, tend to help a lot with speed and consistency. New scripting capabilities from Chef, Puppet, Ansible, and others help with getting all the changes in scripts under source control management. There have also been breakthroughs with containers that are helping with speed and consistency. The “how” in this space is evolving quickly because of the benefits the solutions are providing, but the “what” is a lot more consistent. For environments, you don’t want the speed of provisioning to be a bottleneck in your DP. You need to be able to ensure consistency of the environment, deployment process, and data across different stages of your DP. You need to be able to qualify infrastructure code changes efficiently so your infrastructure can move as quickly as your applications. Additionally, you need to be able to quickly and efficiently track everything that changes from one build and environment to the next.
Having Development and Operations collaborate on these scripts for the entire DP is essential. The environments across different stages of the DP are frequently different sizes and shapes, so often no one person understands how a configuration change in the development stage should be implemented in every stage through production. If you are going to change the infrastructure code, it has to work for every stage. If you don’t know how it should work in those stages, it forces necessary discussions. If you are changing it and breaking other stages without telling anyone, the SCM will find you out and the people managing the DP will provide appropriate feedback. Working together on this code is what forces the alignment between Development and Operations. Before this change, Development would tend to make a change to fix their environment so their code would work, but they wouldn’t bother to tell anyone or let people know that in order for their new feature to work, something would have to change in production. It was release engineering’s job to try and figure out everything that had changed and how to get it working in production. With the shift to infrastructure as code, it is everyone’s responsibility to work together and clearly document in working automation code all of the changes.
This shift to infrastructure as code also has a big impact on the ITIL and auditing processes. Instead of the ITIL processes of documenting configuration of a change manually in a ticket, it is all documented in code that is under revision control in a SCM tool. The SCM is designed to make it easy to track any and all changes automatically. You can look at any server and see exactly what was changed by who and when. Combine this with automated testing that can tell you when the system started failing, and you can quickly get to the change that caused the problem. This localization gets easier when the cycle time between tests limits this to a few changes to look through.
Right now, the triage process takes a long time to sort through clues to find the change that caused the problem. It is hard to tell if it is a code, environment, deploy, data, or test problem, and currently the only thing under control for most organizations is code. Infrastructure as code changes that and puts everything under version control that is tracked. This eliminates server-to-server variability and enables version control of everything else. This means that the process for making the change and documenting the change are the same thing so you don’t have to look at the documentation of the change in one tool to see what was approved and then validate that it was really done in the other tool. You also don’t have to look at everything that was done in one tool and then go to the other tool to ensure it was documented. This is what they do during auditing. The other thing done during auditing is tracking to ensure everyone is following the manual processes every time–something that humans do very poorly, but computers do very well. When all this is automated, it meets the ITIL test of tracking all changes, and it makes auditing very easy. The problem is that the way DevOps is currently described to process and auditing teams makes them dig in their heels and block changes when instead they should be championing those changes. To avoid this resistance to these cultural changes, it is important to help the auditing team understand the benefits it will provide and include them in defining how the process will work. This will make it easier for them to audit, and they will know where to look for the data they require.
Using infrastructure as code across the DP also has the benefit of forcing cultural alignment between Development and Operations. When Development and Operations are using different tools and processes for creating environments, deploying code into those environments, and managing databases, they tend to find lots of issues releasing new code into production. This can lead to a great deal of animosity between Development and Operations. As they start using the same tools, and more specifically the same code, you will likely find that making the code work in all the different stages of the DP forces them to collaborate much more closely. They need to understand each other’s needs and the differences between the different stages much better. They also need to agree that any changes to the production environments start at the beginning of the DP and propagate through the system just like the application code. Over time, you will likely find that this working code is the forcing function that starts the cultural alignment between Development, Operations, and all the organizations in between. This is a big change for most large organizations. It requires that people quit logging in to servers and making manual changes. It requires an investment in creating automation for the infrastructure. It also requires everyone to use common tools, communicate about any infrastructure changes that are required, and document the changes with automated scripts. It requires much better communication across the diffrent silos than exists in most organizations.
Organizations doing embedded development typically have a unique challenge with environments because the firmware/software systems are being developed in parallel with the actual product so there is very little, if any, product available for early testing. Additionally, even when the product is available, it is frequently difficult to fully automate the testing in the final product. These organizations need to invest in simulators to enable them to test the software portions of their code as frequently and cheaply as possible. They need to find or create a clean architectural interface between the software parts of their code and the low-level embedded firmware parts. Code is then written that can simulate this interface running on a blade server so they can test the software code without the final product. The same principle holds true for the low-level embedded firmware, but this testing frequently requires validating the interactions of this code with the custom hardware in the product. For this testing, they need to create emulators that support testing of the hardware and firmware together without the rest of the product.
This investment in simulators and emulators is a big cultural shift for most embedded organizations. They typically have never invested to create these capabilities and instead just do big bang integrations late in the product lifecycle that don’t go well. Additionally, those that have created simulators or emulators have not invested in continually improving these capabilities to ensure they can catch more and more of the defects over time. These organizations need to make the cultural shift to more frequent test cycles just like any other DevOps organization, but they can’t do that if they don’t have test environments they can trust for finding code issues. If the organization is not committed to maintaining and improving these environments, the organization tends to loose trust and quit using them. When this happens, they end up missing a key tool for transforming how they do embedded software and firmware development.
The testing, debug, and defect fixing stage of the DP is a big source of inefficiencies for lots of organizations. To understand the magnitude of the problem for your DP, it would be helpful to have the following data:
- the time it takes to run the full set of testing
- the repeatability of the testing (false failures)
- the percent of defects found with unit tests, automated system tests, and manual tests
- the time it takes the release branch to meet production quality
- approval times
- batch sizes or release frequency at each stage
The time it takes for testing is frequently one of the biggest bottlenecks in flow in organizations starting on the DevOps journey. They depend on slow-running manual tests to find defects and support their release decisions. Removing or reducing this bottleneck is going to require moving to automated testing. This automated testing should include all aspects of testing required to release code into production: regression, new functionality, security, and performance. Operations should also work to add monitoring or other operational concerns to these testing environments to ensure issues are found and feedback is given to developers while they are writing code so they can learn and improve. Automating all the testing to run within hours instead of days and weeks is going to be a big change for most organizations. The tests need to be reliable and provide consistent results if they are going to be used for gating code. You should run them over and over again in random order against the same code to make sure they provide the same result each time and can be run in parallel on separate servers. Make sure the test automation framework is designed so the tests are maintainable and triageable. You are going to be running and maintaining thousands of automated tests running daily, and if you don’t think through how this is going to work at scale, you will end up dying under the weight of the test automation instead of reaping its benefits. This requires a well-designed automation framework that is going to require close collaboration between Development and QA.
It is important to make sure the tests are designed to make the triage process more efficient. It isn’t efficient from a triage perspective if the system tests are finding lots of environment or deployment issues. If this happens, you should start designing specific post-deployment tests to find and localize these issues quickly. Then once the post-deployment tests are in place, make sure they are passing and the environments are correct before starting any system testing. This approach improves the triage efficiency by separating code and infrastructure issues with the design of the testing process.
Automated testing and responding to feedback is going to be a big cultural shift for most organizations. The testing process is going to have to move from manually knowing how to test the applications to using leading edge programming skills to automate testing of the application. These are skills that don’t always exist in organizations that have traditionally done manual testing. Therefore, Development and the test organization are going to have to collaborate to design the test framework. Development is going to have to modify how they write code so that automated testing will be stable and maintainable. And probably the biggest change is to have the developers respond to test failures and keep build stability as their top priority.
If you can’t get this shift to happen, it probably doesn’t make sense to invest in building out complex DPs that won’t be used. The purpose of the automated testing is not to reduce the cost of testing, but to enable the tests to be run on a more frequent basis to provide feedback to developers in order to reduce waste in new and unique work. If they are not responding to this feedback, then it is not helping. Therefore, it is important to start this cultural shift as soon as possible. Don’t write a bunch of automated tests before you start using them to gate code. Instead, write a few automated build acceptance tests (BATs) that defne a very minimal level of stability. Make sure everyone understands that keeping those tests passing on every build is job one. Watch this process very carefully. If it is primarily finding test issues, review and redesign your test framework. If it is primarily finding infrastructure issues, start designing post-deployment tests to ensure stability before running any system test looking for code issues. If it is primarily finding code issues, then you are on the right track and ready to start the cultural transformation of having the developers respond to feedback from the DP. The process of moving to automated tests gating code is going to be a big cultural shift, but it is probably one of the most important steps in changing how software is developed.
Testing more frequently on smaller batches of changes makes triage and debugging much easier and more efficient. The developers receive feedback while they are writing the code and engaged in that part of the design instead of weeks later when they have moved on to something else. This makes it much easier for them to learn from their mistakes and improve instead of just getting beat up for something they don’t even remember doing. Additionally, there are fewer changes in the code base between the failure and the last time it passed, so you can quickly localize the potential sources of the problem.
The focus for automated testing really needs to be on increasing the frequency of testing and ensuring the organization is quickly responding to failures. This should be the first step for two reasons. First, it starts getting developers to ensure the code they are writing is not breaking existing functionality. Second, and most importantly, it ensures that your test framework is maintainable and triagable before you waste time writing tests that won’t work over the long term.
I worked with one organization that was very proud of the fact that they had written over one thousand automated tests that they were running at the end of each release cycle. I pointed out that this was good, but to see the most value, they should start using them in the DP every day, gating builds where the developers were required to keep the builds green. They should also make sure they started with the best, most stable tests because if the red builds were frequently due to test issues instead of code issues, then the developers would get upset and disengage from the process. They spent several weeks trying to find reliable tests out of the huge amount available. In the end, they found out that they had to throw out all the existing tests because they were not stable, maintainable, or triagable. Don’t make this same mistake! Start using your test automation as soon as possible. Have the first few tests gating code on your DP, and once you know you have a stable test framework, start adding more tests over time. Once you have good test automation in place that is running in hours instead of days or weeks, the next step to enabling more frequent releases is getting and keeping trunk much closer to production-level quality. If you let lots of defects build up on trunk while you are waiting for the next batch release, then the bottleneck in your DP will be the amount of time and energy it takes to fix all the defects before releasing into production. The reality is that to do continuous deployment, trunk has to be kept at production levels of quality all the time. This is a long way off for most organizations, but the benefit of keeping trunk closer to production-level quality is worth the effort. It enables more frequent, smaller releases because there is not as big an effort to stabilize a release branch before going into production. It also helps with the localization of issues because it is easier to identify changes in quality when new code is integrated. Lastly, while you may still have some manual testing in place, it ensures that your testers are as productive as possible while working on a stable build. This might not be your bottleneck if you start with a lot of manual testing because the developers can fix defects as quickly as the testers can find them. However, this starts to change as you add more automated tests. Watch for this shift, and be ready to move your focus as the bottleneck changes over time.
This transition to a more stable trunk is a journey that is going to take some time. Start with a small set of tests that will define the minimal level of stability that you will ever allow in your organization. These are your BATs. If these fail due to a change, then job one is fixing those test failures as quickly as possible. Even better, you should automatically block that change from reaching trunk. Then over time, you should work to improve the minimal level of stability allowed on trunk by farming your BAT tests. Have your QA organization help identify issues they frequently find in builds that impact their ability to manually test effectively. Create an automated test to catch this in real time. Add it to the BAT set, and never do any manual testing on a build until the all the automated tests are passing. Look for major defects that are getting past the current BAT tests, and add a test to fill the hole. Look for long running BAT tests that are not finding defects, and remove them so you have time to add more valuable tests. This is a constant process of farming the BAT test that moves trunk closer to release quality over time.
If you are going to release more frequently with smaller batches, this shift to keeping trunk stable and closer to release quality is required. It is also going to be a big shift for most organizations. Developers will need to bring code in without breaking existing functionality or exposing their code to customers until it is done and ready for release. Typically, organizations release by creating a release branch where they finalize and stabilize the code. Every project that is going to be in a release needs to have their code on trunk when the release branches. This code is typically brought in with the new features exposed to the customer ready for final integration testing. For lots of organizations, the day they release branch is the most unstable day for trunk because developers are bringing in last minute features that are not ready and have not been tested with the rest of the latest code. This is especially true for projects the business wants really badly. These projects tend to come in with the worst quality, which means every other project on the release has to wait until the really bad project is ready before the release branch can go to production. This type of behavior tends to lead to longer release branches and less frequent releases. To address this, the organization needs to start changing their definition of done. The code can and should be brought in but not exposed to the customer until it meets the new definition of done. If the organization is going to move to releasing more frequently, the new definition of done needs to change to include the following: all the stories are signed off, the automated testing is in place and passing, and there are no known open defects. This will be a big cultural shift that will take some time.
The final step in this stage of the DP is the approval for moving into production. For some organizations that are tightly regulated, this requires getting manual approval by someone in the management chain, which can take up to days to get. For organizations that are well down the path to continuous deployment, this can be the biggest bottleneck in the flow of code. To remove this bottleneck, highly regulated organizations move to have the manager who was doing the manual approval work with testers document their approval criteria with automated tests. For less regulated environments, having the developer take ownership and responsibility for quickly resolving any issues found in productions can eliminate the management approval process.
There are lots of changes that can help improve the flow at this stage of the DP. The key is to make sure you are prioritizing improvements that will do the most to improve the flow. So, start with the bottleneck and fix it, then identify and fix the next bottleneck. This is the key to improving flow. If your test cycle is taking six weeks to run and your management approval takes a day, it does not make any sense to take on the political battle of convincing your organization that DevOps means it needs to let developers push code into production. If, on the other hand, testing takes hours, your trunk is always at production levels of quality, and your management approval takes days, then it makes sense to address the approval barriers that are slowing down the flow of code. It is important to understand the capabilities of your organization and the current bottlenecks before prioritizing the improvements.
The next step in the basic DP is the release into production. Ideally, you would have found and fixed all the issues in the test stage so that this is a fairly automated and simple process. Realistically, this is not the case for most organizations. To better understand the source and magnitude of the issues at this stage, it is helpful to look at the following metrics:
- the time and effort required to deploy and release into production
- the number of issues found during release and their source (code, environment, deployment, test, data, etc…)
If you are going to release code into production with smaller more frequent releases, you can’t have a long drawn out release process requiring lots of resources. Many organizations start with teams of Operations people deploying into a datacenter with run books and manual processes. This takes a lot effort and is often plagued with manual errors and inconsistencies. DevOps addresses this by automating the release process as the final step in the DP. The process has been exercised and perfected during earlier stages in the DP and production is just the last repeat of the process. This automation ensures consistency and greatly reduces the amount of time and people required for release.
The next big challenge a lot of organizations have during the release process is that they are finding issues during the release process that they did not discover earlier in the DP. It is important to understand the source of these issues so the team can start addressing the reasons they were not caught before release into production. As much as possible, you should be using the same tools, processes, and scripts in the test environment as in the production environment. The test environment is frequently a smaller version of production, so it is not exact, but as much as possible you should work to abstract those differences out of the common code that that defines the environment, deploys the code, and configures the database. If you are finding a lot of issues associated with these pieces, start automating these processes and architect for as much common code across the DP as possible. Also, once you have this automation in place, any patches for production should start at the front end of the pipeline and flow through the process just like the application code.
Organizations with large complex deployments also frequently struggle with the triage process during the launch call. A test will fail, but it is hard to tell if it is due to an environment, deployment, database, code, or test issue. The automated testing in the deployment process should be designed to help in this triage process. Instead of configuring the environments, deploying the code, configuring the database, and running and debugging system tests, you need to create post-deployment automated tests that can be run after the environments are configured to make sure they are correct server by server. Do the same thing for the deployment and database. Then after you have proven that those steps executed correctly, you can run the system tests to find any code issues that were not caught earlier in the DP. This structured DevOps approach really helps to streamline the triage process during code deployment and helps localize hard to find intermittent issues that only happen when a system test happens to hit the one server where the issue exists.
Making these deployments into production work smoothly requires these technical changes, but mostly it requires everyone in the DP working together to optimize the system. This is why the DP is an essential part of DevOps transformations. If Operations continually sees issues during deployment, they need to work to design feedback mechanisms upstream in the DP so the issues are found and fixed during the testing process. If there are infrastructure issues found during deployment, Operation teams need to work with the Development teams to understand why the infrastructure as code approaches did not find and resolve these issues earlier in the DP. Additionally, the Operations team should be working with the test organization to ensure post-deployment tests are created to improve the efficiency and effectiveness of the triage process. These are all very different ways of working that these teams need to embrace over time if the DevOps transformation is going to be successful.
Operation and Monitoring
The final step is operating and monitoring the code to make sure it is working as expected in production. The primary metrics to monitor here are:
- issues found in production
- time to restore service
Some organizations are so busy fghting issues in production that they are not able to focus on creating new capabilities. Addressing production quality issues can be the biggest challenge for these organizations. In these situations, it is important to shift the discovery of these issues to earlier in the pipeline. The operational organization needs to work with the development organization to ensure their concerns and issues are being tested for and addressed earlier in the pipeline. This includes adding tests to address their concerns and adding monitoring that is catching issues in production to the test environments. As discussed in the release section, it also requires getting common tools and scripts for environments, deployments, and databases across the entire DP.
Implementing all these changes can help ensure you are catching most issues before launching into production. It does not necessarily help with the IE8 issue with Spanish localization discussed in Chapter 2. In that case, it would have just been too costly and time consuming to test every browser in ever localization for every test case. Instead, the other significant change that website or SaaS type organizations that have complete control over their deployment processes tend to implement is to separate deployment from release by using approaches like feature toggles and canary releases. This enables new versions of the system to be released into production without new features being accessible to the customer, a pattern known as “dark launching.” This is done due to the realization that no matter how much you invest in testing, you still might not find everything. Additionally, the push to find everything can drive the testing cost and cycle times out of control. Instead these organizations use a combination of automated testing in their DP and canary releases in production. Once the feature makes it through their DP, instead of releasing it to everyone at once, they do a canary release by giving access to a small percentage of customers and monitoring the performance to see if it is behaving as expected before releasing it to the entire customer base. This is not a license to avoid testing earlier in the pipeline, but it does enable organizations to limit the impact on the business from unforeseen issues while also taking a pragmatic approach to their automated testing.
This simple construct of a DP with a single developer does a good job of introducing the concepts and shows how the DevOps changes can help to improve flow. The metrics are also very useful for targeting where to start improving the pipeline. It is important to look across all the metrics in the DP to ensure you start this work with the bottleneck and/or the biggest source of waste because transforming your development and deployment processes is going to take some time, and you want to start seeing the benefits of these changes as soon as possible. This can only occur if you start by focusing on the biggest issues for your organization. The metrics are intended to help identify these bottlenecks and waste in order to gain a common understanding of the issues across your organization so you can get everyone aligned on investing in the improvements that will add the most value out of the gate.
In the coming weeks, I will be sharing additional chapters and tips from the book.
Can’t wait? you can download your free copy now.
Gary Gruver is an experienced executive with a proven track record of transforming software development and delivery processes in large organizations, first as the R&D director of the HP LaserJet firmware group that completely transformed how they developed embedded firmware and then as VP of QA, Release, and Operations at Macy’s.com where he led the journey toward continuous delivery.
He now consults with large organizations and runs workshops to help them transform their software development and delivery processes. He is the co-author of Leading the Transformation: Applying Agile and DevOps Principles at Scale and A Practical Approach to Large-Scale Agile Development: How HP Transformed LaserJet FutureSmart Firmware.
Latest posts by Electric Cloud (see all)
- Key Takeaways from Continuous Discussions (#c9d9) Episode 76 – Container Orchestration - September 1, 2017
- Key Takeaways from Continuous Discussions (#c9d9) Episode 75 – Continuous Monitoring & DevOps - August 18, 2017
- Key Takeaways from Continuous Discussions (#c9d9) Episode 74 – DevOps for Mainframes - August 11, 2017