Of all the Lean Startup techniques, Continuous Deployment is by far the most controversial. Continuous Deployment is a process by which software is released several times throughout the day — in minutes versus days, weeks, or months. Continuous Flow Manufacturing is a Lean technique that boosts productivity by rearranging manufacturing processes, so products are built end-to-end, one at a time (using single-piece flow), versus the more prevalent batch and queue approach.
Continuous Deployment is Continuous Flow applied to software. The goal of both is to eliminate waste. The biggest waste in manufacturing is created by having to transport products from one place to another. The biggest waste in software is created from waiting for software as it moves from one state to another: Waiting to code, test, and deploy. Reducing or eliminating these waits leads to faster iterations which is the key to success.
My transition to Continuous Deployment
Before adopting continuous deployment, I used to release software weekly (come rain or shine), which I viewed as pretty agile, disciplined, and aggressive. I identified the must-have code updates on Monday, the official code cutoff was on Thursday, and Friday was slated for the big release event. The release process took at least half a day and sometimes the whole day. Dedicating up to 20% of the week to releasing software is incredibly wasteful for a small team. This is not counting the ongoing coordination effort also needed in prioritizing the ever-changing release content for the week as new critical issues are discovered. Despite these challenges, I fought the temptation to move to a longer bi-weekly or monthly release cycle because I wanted to stay highly responsive to customers (something our customers repeatedly appreciate). Managing weekly releases got a lot harder once I started doing customer development. Spending more time outside the building meant less time for coding, testing, and deploying. Things started to slip. That is when I devised a set of work hacks to manage my schedule (described here) and what drove me to adopt Continuous Deployment.
My transition from staged releases to continuous deployment took roughly two weeks. I read Eric Ries’ 5-step primer to getting started with Continuous Deployment and found that I already had many of the necessary pieces. Continuous integration, deployment scripts, monitoring, and alerting are all best practices for any release process — staged or continuous.
The fundamental challenge with Continuous Deployment is getting comfortable with releasing all the time.
Continuous deployment makes releases non-events, and checking in code is synonymous with triggering a release. On the one hand, this is the ultimate in customer responsiveness. On the other hand, it is scary as hell. With staged releases, time provides a (somewhat illusory) safety net. There is also comfort in sharing test responsibility with someone else (the QA team). No one wants to be solely responsible for bringing a production system down. For me, neither was a consideration. I didn’t have time or a QA team.
I took things easy at first — made small changes and audited the release process maniacally. I started relying heavily on functional tests (over unit tests), which allowed me to test changes as a user would. I also identified a set of events that would indicate something terribly going wrong (e.g., no users on the system) and built real-time alerting around them (using Nagios/ganglia). As we built confidence, we started committing bigger and multi-part changes, each time building up our suite of testing and monitoring scripts. After a few iterations, our fear level was actually lower than how we used to feel after a staged release. Because we were committing less code per release, we could correlate issues to a release with certainty.
These days, we never wonder if unexpected errors could have been introduced due to a large code merge (since there is no branching. We also rely on more testing and monitoring automation, which is way more robust and consistent than what we were doing before.
All that said, mistakes are still made, and we commit bad code now and then. None have taken the system down (not yet, anyway). Rather than seeing these as a shortcoming of the process, we view them as an opportunity to build up our Cluster Immune System. We try and follow a Five Whys approach to keep these errors from recurring. There is always some action to take: writing more tests, more monitoring, more alerts, more code, or more process.
Looking back, struggled to balance the opposing pulls of “outside the building” versus “inside the building” activities. Adopting Continuous Deployment has allowed me to build “flow” into my day, which allows me to do both. But easier releases are not the only benefit of Continuous Deployment. Smaller releases lead to faster build/measure/learn loops. I’ve used these faster build/measure/learn loops to optimize my User Activation flow, delight customers with “near-instant” fixes to issues, and even eliminate features that no one was using.
While it is somewhat easier to continuously deploy web-based software, with a little discipline, desktop-based software too can be built to flow. Here’s how I implement continuous deployment for my desktop-based application (CloudFire).
My Continuous Deployment process
Don’t push features
If you’ve followed a customer discovery process, identified a problem worth solving, and built out your minimum viable product, DON’T keep adding features until you’ve validated the MVP, or more specifically, the unique value proposition of the MVP. Unneeded features are a waste and not only create more work but can needlessly complicate the product and prolong the “customer validation” phase.
More than one customer should ideally pull every new feature before showing up in a release.
Build in response to a signal from the “customer,” and otherwise rest or improve.
As a technologist, I, too, love to measure progress based on how much stuff I build. But instead of channeling all my energy toward building new features, I channel roughly 80% of it toward measuring and optimizing existing features. I am not advocating adding any features at all. Users will naturally ask for more stuff, and your MVP, by definition, is minimal and needs more love. Just don’t push it.
Code in small batches
I’ve previously described my 2-hour blocks of maker time for maximizing my work “flow.” Before starting any maker activity, I clearly identify what needs to get done (the goal) and sketch out how it needs to get done (the design).
It is important to point out that the goal of the maker activity need not be a user-facing feature or a complete one. There is inherent value in committing incremental work into production to diffuse future integration surprises. During the maker activity, I code, unit test, and create or update functional tests as needed. At the end of the maker activity, I check-in code which automatically triggers a build on a continuous integration server that runs through a battery of unit and functional tests. The artifacts created at the end of the build are installers for mac and windows (for new users), along with an Eclipse P2 repository (OSGI) for automatic software updates (for current users). The release process takes ~15 minutes and runs in the background.
Prefer functional tests over unit tests whenever possible
I don’t believe in blindly writing unit tests to achieve 100% code coverage as reported by some tool. To do that, I would have to mock (simulate) too many critical components. I deem excessive unit testing a form of waste. Whenever possible, I rely on functional tests that verify user actions. I use Selenium, which lets me control the application on multiple browsers and OS platforms, just as a user would. One thing to be wary of is that functional tests are longer running than unit tests and will gradually increase the release cycle time. Parallelization of tests with multiple test machines is a way to address this. I am not yet yet, but Selenium Grid looks like a good option. So does Go Test It.
Always test the User Activation flow
After the integration tests are run and the software is packaged, I always verify my User Activation flow before going live. The user activation flow is the most critical path toward achieving initial user gratification or product/market fit. My user activation flow is automatically tested on both a mac and windows machine.
Utilize automagic software updates
A major challenge with desktop-based (versus web-based) software is propagating software updates. Studies have shown that users find traditional software update dialogs annoying. To overcome this, I am using a software update strategy that works silently without ever interrupting the user, much like an appliance. Google Chrome utilizes a similar update process. The biggest risk with this approach is that users will find it Orwellian. So far, no one has complained, and many users like the auto-update feature. It helps that CloudFire, being a p2web app, runs headlessly with a browser-based UI.
This is how the software update process currently works:
At the end of each build, we push an Eclipse P2 repository (OSGI) which is a set of versioned plug-ins that make up the application. Because the application is composed of many small plug-ins, coupled with the fact that we commit small code batches, the size of each software update can be downloaded quickly.
Every time the user starts up the application, it checks for a new update and downloads and installs one if available. Depending on the type of update, it could take effect immediately or require an application restart. If an application restart is required, we wait until the next user-initiated relaunch of the application or trigger one silently when the system is idle.
If the application is already running, it periodically polls for new updates. If an update is found, it is also downloaded and installed in the background (as above) without interrupting the user.
Alerts and monitoring
I use Nagios and ganglia to implement both system and application-level monitoring and alerting on the overall health of the production cluster. Examples of things I monitor are the numbers of user activations, active users, and aggregate page hits to user galleries. Any out-of-the-norm dip in these numbers immediately alerts us (via Twitter/SMS) to a potential issue.
Application level diagnostics
Despite the best testing, defects still happen. More testing is not always the answer, as some defects are intermittent and a function of the end-user’s environment. It is virtually impossible to test all combinations of hardware, OS, browsers, and third-party apps (e.g., Norton anti-virus, Zone Alarm, etc.).
Relying on users to report errors doesn’t always work in practice. To compensate, we’ve had to build basic diagnostics into the application. They can notify both the user and us of unexpected errors and allow us to pull configuration information and logs remotely. We can also do remote rollbacks this way.
Tolerate unexpected errors exactly once
Unexpected errors provide the opportunity to learn and bulletproof a system early. Ignoring them or implementing quick-and-dirty patches inevitably lead to repeat errors which are another form of waste. I try and follow a formalized Five Why’s process (using our internal wiki) for every error. This forces us to stop, think, and fix the right problem(s).
My continuous deployment process is summarized below:
So why is Continuous Deployment so controversial?
Eric has addressed a lot of the objections already on his blog. One that I hear a lot is the belief that you need a massive team to pull off continuous deployment. I would argue that the earlier in the development cycle and the smaller the team, the easier it is to implement a continuous deployment process. If you are a start-up with an MVP, there is no better time to adopt a continuous deployment process than the present. You don’t yet have hundreds of customers, dozens of peers, or dozens of features. It is a lot easier to lay the groundwork now with time on your side.