This is a guest blog post by Gary Gruver, one of Electric Cloud’s strategic advisors. Gary is the Co-Author of Leading the Transformation, A Practical Approach to Large-Scale Agile Development, and Starting and Scaling DevOps in the Enterprise.
This is the second free chapter from my recent book “Starting and Scaling DevOps in the Enterprise“. You can read the first chapter here. You can also download your free copy of the complete book, here.
The book provides a concise framework for analyzing your delivery processes and optimizing them by implementing DevOps practices that will have the greatest immediate impact on the productivity of your organization. It covers both the engineering, architectural and leadership practices that are critical to achieving DevOps success. I hope this will be a helpful resource for you on your DevOps path!
Chapter 2: The Basic Deployment Pipeline
The Deployment Pipeline (DP) in a large organization can be a complex system to understand and improve. Therefore, it makes sense to start with a very basic view of the DP, to break the problem down into its simplest construct and then show how it scales and becomes more complex when you use it across big, complex organizations. Te most basic construct of the DP is the flow of a business idea to development by one developer through a test environment into production. This defines how value flows through software/IT organizations, which is the first step to understanding bottlenecks and waste in the system. Some people might be tempted to start the DP at the developer, but I tend to take it back to the flow from the business idea because we should not overlook the amount of requirements inventory and inefficiencies that waterfall planning and the annual budgeting process drive into most organizations.
The first step in the pipeline is communicating the business idea to the developer so they can create the new feature. Then, once the new feature is ready, the developer will need to test it to ensure that it is working as expected, that the new code has not broken any existing functionality, and that it has not introduced any security holes or impacted performance. This requires an environment that is representative of production. The code then needs to be deployed into the test environment and tested. Once the testing ensures the new code is working as expected and has not broken any other existing functionality, it can be deployed into production, tested, and released. The final step is monitoring the application in production to ensure it is working as expected. In this chapter, we will review each step in this process, highlighting the inefficiencies that frequently occur. Then, in Chapter 3, we will review the DevOps practices that were developed to help address those inefficiencies.
The first step in the DP is progressing from a business idea to work for the developer to create the new feature. This usually involves creating a requirement and planning the development to some extent. The first problem large organizations have with flow of value through their DP is that they tend to use waterfall planning. They do this because they use waterfall planning for every other part of their business so they just apply the same processes to software. Software, however, is unlike anything else most organizations manage in three ways. First, it is much harder to plan accurately because everything you are asking your teams to do represents something they are being asked to do it for the first time. Second, if software is developed correctly with a rigorous DP, it is relatively quick and inexpensive to change. Third, as an industry we are so poor at predicting our customers’ usage that over 50 of all software developed is never used or does not meet its business intent. Because of these unique characteristics of software, if you use waterfall planning, you end up locking in your most flexible and valuable asset in order to deliver features that won’t ever be used or won’t deliver the intended business results. You also use up a significant amount of your capacity planning instead of delivering real value to your business.
Organizations that use waterfall planning also tend to build up lots of requirements inventory in front of the developer. This inventory tends to slow down the flow of value and creates waste and inefficiencies in the process. As the Lean manufacturing efforts have clearly demonstrated, wherever you have excess inventory in the system tends to drive waste in terms of rework and expediting. If the organization has invested in creating the requirements well ahead of when they are needed, when the developer is ready to engage, the requirement frequently needs to be updated to answer any questions the developer might have and/or updated to respond to changes in the market. This creates waste and rework in the system.
The other challenge with having excess inventory of requirements in front of the developer is that as the marketplace evolves, the priorities should also evolve. This leads to the organization having to reprioritize the requirements on a regular basis or, in the worst case, sticking to a committed plan and delivering features that are less likely to meet the needs of the current market. If these organizations let the planning process lock them into committed plans, it creates waste by delivering lower value features. If the organizations reprioritize a large inventory of requirements, they will likely deprioritize requirements that the organization has invested a lot of time and energy in creating. Either way, excess requirements inventory leads to waste.
The next step is getting an environment where the new feature can be deployed and tested. The job of providing environments typically belongs to Operations, so they frequently lead this effort. In small organizations using the cloud, this can be very straightforward and easy. In large organizations using internal datacenters, this can be a very complex and timely process that requires working through extensive procurement and approval processes with lengthy handoffs between different parts of the organization. Getting an environment can start with long procurement cycles and major operational projects just to coordinate the work across the different server, storage, networking, and firewall teams in Operations. This is frequently one of the biggest pain points that cause organizations to start exploring DevOps.
There is one large organization that started their DevOps initiative by trying to understand how long it would take to get up Hello World! in an environment using their standard processes. They did this to understand where the biggest constraints were in their organization. They quit this experiment after 250 days even though they still did not have Hello World! up and running because they felt they had identified the biggest constraints. Next, they ran the same experiment in Amazon Web Services and showed it could be done in two hours. This experiment provided a good understanding of the issues in their organization and also provided a view of what was possible.
Testing and Defect Fixing
Once the environment is ready, the next step is deploying the code with the new feature into the test environment and ensuring it works as expected and does not break any existing functionality. This step should also ensure that there were no security or performance issues created by the new code. Tree issues typically plague traditional organizations at this stage in their DP: repeatability of test results, the time it takes to run the tests, and the time it takes to fix all the issues.
Repeatability of the results is a big source of inefficiency for most organizations. They waste time and energy debugging and trying to find code issues that end up being problems with the environment, the code deployment, or even the testing process. This makes it extremely difcult to determine when the code is ready to flow into production and requires a lot of extra triaging effort for the organization. Large, complex, tightly coupled organizations frequently spend more time setting up and debugging these environments than they do writing code for the new capabilities.
This testing is typically done with expensive and time-consuming manual tests that are not very repeatable. This is why it’s essential to automate your testing. The time it takes to run through a full cycle of manual testing delays the feedback to developers, which results in slow rework cycles, which reduces flow in the DP. The time and expense of these manual test cycles also forces organizations to batch lots of new features together into major releases, which slows the flow of value and makes the triage process more difficult and inefficient.
The next challenge in this step is the time and effort it takes to remove all the defects from the code in the test environment and to get the applications up to production level quality. In the beginning, the biggest constraint is typically the time it takes to run all the tests. When this takes weeks, the developers can typically keep up with fixing defects at the rate at which the testers are finding them. This changes once the organization moves to automation where all the testing can be run in hours, at which point the bottleneck tends to move toward the developers ability to fix all the defects and get the code to production levels of quality.
Once an organization gets good at providing environments or is just adding features to an application that already has environments set up, reaching production level quality is frequently one of the biggest challenges to releasing code on a more frequent basis. I have worked with organizations that have the release team leading large cross-organizational meetings to get applications tested, fixed, and ready for production. They meet every day to review the testing progress to see when it will be done so they are ready to release to production. They track all the defects and fxes so they can make sure the current builds have production level quality. Frequently, you see these teams working late on a Friday night to get the build ready for offshore testing over the weekend only to find out Saturday morning that all the offshore teams were testing with the wrong code or a bad deployment, or the environment was misconfigured in some way. This process can drive a large amount of work into the system and is so painful that many organizations choose to batch very large, less frequent releases to limit the pain.
Once all the code is ready, the next step is to deploy the code into production for testing and release to the customer. Production deployment is an Operations led effort, which is important because Operations doesn’t always take the lead in DevOps transformations, but when you use the construct of the DP to illustrate how things work, it becomes clear that Operations is essential to the transformation and should lead certain steps to increase efficiency in the process. It is during this step that organizations frequently see issues with the application for the first time during the release. It is often not clear if these issues are due to code, deployment, environments, testing, or something else altogether. Therefore, the deployment of large complex systems frequently requires large cross-organizational launch calls to support releases. Additionally, these deployment processes themselves can require lots of time and resources for manual implementations. The amount of time, effort, and angst associated with this process frequently pushes organizations into batching large amounts of change into less frequent releases.
Monitoring and Operations
Monitoring is typically another Operations-led effort since they own the tools that are used to monitor production. Frequently, the first place in the DP that monitoring is used is in production. This is problematic because when code is released to customers, developers haven’t been able to see potential problems clearly before the customer experience highlights it. If Operations works with Development to move monitoring up the pipeline, potential problems are caught earlier and before they impact the customer.
When code is finally released to the customers and monitored to ensure it is working as expected, then ideally there shouldn’t be any new issues caught with monitoring in production if all the performance and security testing was complete with good coverage. This is frequently not the case in reality. For example, I was part of one large release into production where we had done extensive testing going through a rigorous release process, only to have it immediately start crashing in production as a result of an issue we had never seen before. Every time we pointed customer traffic to the new code base, it would start running out of memory and crashing. After several tries and collecting some data, we had to spend several hours rolling back to the old version of the applications. We knew where the defect existed, but even as we tried debugging the issues, we couldn’t reproduce it in our test environments. After a while, we decided we couldn’t learn any more until we deployed into production and used monitoring to help locate the issue. We deployed again, and the monitoring showed us that we were running out of memory and crashing. This time the developers knew enough to collect more clues to help them identify the issue. It turns out a developer was fixing a bug that was not wrapping around a long line of text correctly. The command the developer had used worked fine in all our testing, but in production we realized that IE8 localized to Spanish had a defect that would turn this command into a floating point instead of an integer, causing a stack overflow. This was such a unique corner case, we would not have considered testing for it. Additionally, even if we had considered it, running all our testing on different browsers with different localizations would have become cost prohibitive. It is issues like this that remind us that the DP is not complete until the new code has been monitored in production and is behaving as expected.
Understanding and improving a complex DP in a large organization can be a complicated process. Therefore, it makes sense to start by exploring a very simple DP with one developer and understanding the associated challenges. This process starts with the business idea being communicated to the developer and ends with working code in production that meets the needs of the customer. There are lots of things that can and do go wrong in large organizations, and the DP provides a good framework for putting those issues in context. In this chapter, we introduced the concept and highlighted some typical problems. Chapter 3 will introduce the DevOps practices that are designed to address issues at each stage in the pipeline and provide some metrics that you can use to target improvements that will provide the biggest benefits.
In the coming weeks, I will be sharing additional chapters from the book.
Can’t wait? you can download your free copy now.
Gary Gruver is an experienced executive with a proven track record of transforming software development and delivery processes in large organizations, first as the R&D director of the HP LaserJet firmware group that completely transformed how they developed embedded firmware and then as VP of QA, Release, and Operations at Macy’s.com where he led the journey toward continuous delivery.
He now consults with large organizations and runs workshops to help them transform their software development and delivery processes. He is the co-author of Leading the Transformation: Applying Agile and DevOps Principles at Scale and A Practical Approach to Large-Scale Agile Development: How HP Transformed LaserJet FutureSmart Firmware.