Continuous Delivery with Broken Builds and a Clean Conscience
Have you thought about attaining continuous delivery in your project? And how about continuous delivery with a sometimes broken build? It may sound dangerous, but well, it's actually not the end of the world and it's more common in day-to-day projects than you think. We agree that continuous delivery should be made with all-green build pipelines, and that is what we aim for. This article shares the experience from one particular project.
Think of a scenario in which around 5 builds go to the production environment every day, even if the full build is broken; in which we deploy to production even if the development build is not green and not all tests are passing.
Let us consider an infrastructure management project in the cloud that receives about 10,000 unique visitors daily, with 5 teams developing and pushing code in two different locations. Keeping the build green on a project in which any developer works from anywhere in the codebase is a big challenge, because all teams have the autonomy to develop and deliver.
To start the work, we create a short lived branch on the master and a feature flag - which allows developers to keep pushing code even if the functionality is not completed, since it will be hidden by this flag until it is complete - and then develop the story. In the end, we merge back to the master, run the tests and get it going, push it to production. It is important to note that this branch should be short lived. If each team or feature ends up having a long lived branch, that would lead to a separate problem of merge-hell.
To make all of this happen we need a good and structured test suite. This suite is not the responsibility of a specific team, but of all teams involved. A delivered story necessarily means unit and acceptance tests done and integrated. Thus, all new features should be covered by automated tests, ensuring the prevention of flaws when new features are added.
One of the development teams in our project was composed only of testers. They were responsible for automating areas that were not covered in the development of stories, maintaining stable test suites and enhancing them, as well as implementing the so-called "sad paths" and "corner cases."
So, what were the environments we worked on? Well, there were four types:
- One was staging, an adaptation environment, and it had the main objective of finding integration errors with external APIs.
- The pre-production environment was exactly like the production one; everything there should already be working perfectly.
- Remember the feature flags? Well, while stories were being developed, the "in preview" environment is where the stories that were hidden in the pre-production environment were being tested.
- Finally, production was where the party happened. It was the environment in which the client used the application.
All commits merged to the master go through the test suite. Through the fabulous integration radiator, teams monitor the status of the integration. If everything passes, wonderful, it goes straight to production. In case any suite fails, we investigate the causes and decide whether to deliver it as it is or not. If we decide that yes, it can be delivered as it is, the broken build goes into production with assurance that it is either an unstable test/suite, a piece of code that has not been altered or simply that the failed test was not significant enough to block delivery.
Because the number of integrations soared as the teams grew, it became impractical to run the entire suite every time. So we created the concept of an integration "bus". This "bus" "collects" the "passengers" every hour, integrates in a single package and runs the test suite.
As we know, not everything is perfect, there are several challenges to this model of continuous delivery:
- suites that are too slow delay delivery;
- tests that fail randomly are useless because they are unreliable;
- dependence on external APIs causes failures in tests that are not part of the layer in development;
- the breakage of features in production happens and we need to think about the root cause so the problem doesn't repeat.
Some final thoughts on this continuous delivery approach. When acceptance tests are taking too long, it may be a symptom that your application is too big or that the deployment unit is not granular enough. Complex build pipelines evidence a need for improvement. Create feature flags: that makes your job easier if the team needs to disable a feature in production. Keep your focus and mindset in tests, but don't be too rigid or religious about it, evaluate the errors, and use your knowledge to make informed decisions. DevOps, which focuses on environment readiness and Release Engineering, dealing with the deployment pipeline through the development process, are not optional practices; there should ideally be a person or two dedicated to this, so that you can practice continuous delivery with less headaches. This is because one sometimes needs a dedicated set of eyes to keep track of the Infrastructure Automation and Release Engineering health over a slightly longer duration. To ensure that this pair doesn’t get stuck, and also so that everyone on the team gets to understand and experience Release Engineering and working with Production Deployments, we recommend rotating various individuals through the team over a period of time.
The combination of Feature Toggles coupled with the ability to activate corresponding test suites on demand, can help you ensure that both complete and incomplete features can be taken online without compromising on quality.
How do you handle broken builds in your practice of continuous delivery? Contribute to the discussion in the comments.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.