Incremental danger April 23, 2025

I'm all for incremental development. Incremental development is great. It reduces risk by dividing complex changes into small – less complex – parts. It allows for early delivery of value. It enables learning and discovery. It simplifies development by allowing you to focus on one part of the problem in turn. That's good, right?

Yes, it is, but there is a very important caveat: you have to make sure the parts fit together – and they actually solve the problem in the context where it needs to be solved!

Let's look at a simple example that was developed incrementally.

A simple problem

A team was in the process of containerizing their service. As a result, they had been generating container images at a relatively high rate. This was noticed by the administrator of the central image repository, and the team was asked to delete images they no longer needed and to institute some kind of policy to limit the number of images. The team decided to implement an automated clean-up job to delete obsolete images. Due to the team's (mostly) roll-forward strategy in case of problems, most old images would never be needed again and could be deleted. It was deemed enough to be able to roll back for 7 days or 10 images (a somewhat arbitrary number). These were the images they wanted to keep. Everything else could be deleted.

At the time, the rate of deployment varied between a few times a week and once every few weeks, tending towards the longer time scale^[1]. Revisions (images) could spend a few days or in some cases weeks (for complex features) in the test environment before being deployed to production.

Let's follow the developer – call him George – tasked with implementing the clean-up job.

Increment 1

George implements the first version of the clean-up job this way:

enumerate all images,
filter out those younger than 7 days (the ‘keep list’),
delete the rest.

This works fine, and the job is scheduled to run nightly.

Before reading further, please take a moment to think about this implementation. What do you think will happen given the workflow described above?

Time to think …

Do you see the problem?

George doesn't. He thinks everything is fine.

But then, after a few nights, the job deletes the image that is currently running in production because it is older than 7 days. Fortunately, the team notices this before it turns into a serious problem. If the runtime platform had decided it needed to pull the image again due to a restart, this could have resulted in an outage!

Increment 1.1

Regenerating the deleted image is easy – just run the CI pipeline again.

George then fixes the clean-up job by looking up which image is running in production, taking its timestamp, and calculating 7 days backwards from there instead of from the time the job is run.

This is a bit awkward to test because there no longer are old images, but the code looks OK, so the fix is deployed, and production images are safe again.

Increment 2

George then implements keeping at least 10 images by skipping deletion if the ‘keep list’ from above has less than 10 entries.

Again, before reading further, take a moment to think about this solution. What do you think will happen?

Time to think …

At first glance, this implementation does satisfy all constraints. The image running in production is preserved, images younger than 7 days are preserved, and at least 10 images are preserved. So it's good, then?

Except preserving images is not the point of the exercise. The team had this before the clean-up job was introduced and they never deleted anything. The main goal is to delete stuff.

How does this solution delete images? It will delete images older than 7 days if there are at least 10 images younger than 7 days^[2]. This means the job will only delete anything if the team has deployed to production at least 10 images over the past 7 days, i.e., on average at least 2 deployments per day^[3]. Given the actual deployment frequency of the team, the job will never actually delete anything.

Increment 2.1

This fact is discovered during a code review, and George fixes this by changing the logic. The job now looks at each image, going backwards in time from the production image, and adds images to the ‘keep list’ until a time difference of 7 days or the number of 10 images have been reached. All images older than that are deleted.

Now the job actually deletes old images while preserving those it is supposed to preserve.

Retrospective

To recap, the team started off with a simple requirement: don't keep too much old stuff, but keep enough to be able to roll back in case of trouble. They formulated this in terms of two rough rules, and George went off to implement them incrementally.

The first increment deleted too much. The reason was a misunderstanding of the requirement: being able to roll back for 7 days after a production deployment versus keeping images younger than 7 days from the perspective of now. Realizing that there is a significant difference hinges on the understanding of the context the solution has to work in (the deployment frequency and the duration between image generation and deployment in production). Focusing solely on the task of delivering the first increment, George forgot to consider this context – and with a little bit of bad luck could have caused a production outage.

When taking on the 10 day ‘keep’ rule, George again focused solely on the current task, and did not consider how his implementation would interact with the previous increment and the context (the team's deployment frequency). This time, his ‘solution’ failed to do anything at all.

Conclusion

This example shows how easy it is to lose your place if you consider only the step right in front of you. While incremental development allows you to focus on one small change at a time (and this is one of its best features), this does not mean you should run around with blinders. You still need to be aware of the bigger picture to ensure your one small change actually does what it is supposed to do – and doesn't do what it is not supposed to do.

This bigger picture includes

what is already there (for the second increment, this is the first increment),
the context in which the solution will operate (in George's example, the team's deployment process and cadence),
the intent behind it all (here, to delete unnecessary stuff).

When I develop incrementally, I ask myself this question for each increment:

When this is integrated and running in the target context, will the solution as a whole support the original intent?

Hopefully, the answer is “yes”. But if it isn't, I can still celebrate, because I have caught a flaw before it could cause damage. And I now have the opportunity to rethink my approach and do better.

Footnotes

It had gone down somewhat from the peak during initial experimentation ↩︎
Always counting backwards from the image currently in production. ↩︎
No deployments on weekends. ↩︎