On planning, part 4: the critical path November 12, 2024

This is the fourth post in my series about planning in the context of agile teams.

Part 1 is concerned with why you should plan at all.
Part 2 describes my iterative approach to end-to-end planning.
Part 3 shows how I do forwards and backwards planning.
Part 5 argues for more exploration.

In this post, I will explore the concept of the critical path.

It doesn't fit!

So I have come up with a rough plan using forwards and backwards planning, and I have tried to map it onto a timeline. Surprise, surprise! It turns out that the work does not fit neatly between the fixed points of my plan. What now? Obviously, we can play the game of ‘does this here task really take two days, can't we get it done in one?’, but this is not really interesting, so let me jump straight to the point where we realize that we simply cannot make it fit by arguing about our estimates. Enter the critical path.

The critical path

The critical path is a project management method that helps you deal with time pressure by focusing on the tasks that are ‘most blocking’ project completion. After the first plausibility check, I already have estimates for the duration of each step, and I also know the dependencies between steps. This allows me to identify the longest sequence of steps between the start and the goal^[1]. This sequence is called the critical path because any delay here will directly affect our chances of reaching our goal before the deadline. Tasks that are not on the critical path have some ‘wiggle room’ because delays will not immediately threaten the deadline. After all, the tasks on the critical path will take longer anyway.

For the migration of the payment component to a new payment service provider (PSP) we did last year, the critical path identified by forwards and backwards planning looked roughly like this.

A diagram showing the following steps
resulting from forwards planning, arranged from left to right, beginning at ‘now’: ‘clarify technical details’, ‘wait
for access to test env’, ‘explore example scenarios manually’, ‘implement
registration, auth, capture, refund’, ‘test’, where the latter two overlap; and
the following steps resulting from backwards planning, arranged from right to left, beginning at ‘contract with old
PSP ends’: ‘remove implementation fo old PSP’, ‘disable old PSP’, ‘migrate/monitor/fix’, ‘deploy to prod’, ‘let services test’. The two series of steps are contained in a box labelled ‘Available time’ between ‘now’ and ‘contract with old PSP ends’ and have a large overlap. — This is a simplified view of our migration steps between now and the time where our contract with the old payment service provider will end. Obviously, ‘let services test’ should come after ‘test’ because we need to have a stable (well-tested) implementation first, before we ask all services who depend on the payment component to begin their integration tests.

As you can see, our critical path is much longer than the time between now and our deadline, so we have a problem. Fortunately, there are some tactics that can help. I'm going to show three of them: moving tasks beyond the critical path, risk mitigation by kicking off external tasks early, and decoupling work.

Moving tasks beyond the critical path (the easy part)

Often, when there is a deadline, not everything needs to be finished before the deadline. In our PSP example, the deadline stems from the contract with the old PSP running out. By the time the contract ends, all payments must be routed to the new PSP, but the clean-up work can happen after the deadline. So we can move this part off the critical path simply by refining our understanding of which tasks absolutely need to be done before the deadline to ensure success and which tasks can be postponed if necessary.

The same diagram as above, but with ‘remove implementation for old PSP’ moved out of the box beyond ‘contract with old PSP ends’, thus reducing, but not eliminating, the overlap. — We were able to move ‘remove implementation for old PSP’ beyond the deadline because it is not *really* critical.

This maneuver reduces the time pressure, but as you can see in the diagram, the project still does not fit into the available time frame.

Mitigate risk by kicking off external tasks early

If the project depends on external contributions, especially ones where we cannot be too sure of timely delivery, we can mitigate risk by ensuring that preparatory tasks are done early and the external tasks are not blocked. Then, the team can continue with work that does not depend on the external contribution. When the team is at the point where they need those results, chances are much higher that they will be ready.

In our example, there were two tasks which depend on 3^rd parties to collaborate with us. The first was providing access to the new PSP's test environment. When we first engaged with the PSP, they offered to conduct a few workshops to clarify the technical details and then give us access to their test environment. Recognizing that this would introduce an unwelcome delay before we could get our hands dirty^[2], we asked them to provide API access right away. They complied, speeding up onboarding significantly.

The second task not directly under our control was the testing the services built on top of the payment component would need to do. We worked hard to keep the change as low-profile as possible, but nonetheless there were some integration tests that had to be done and time had to be allocated for them. To minimize the risk of delay associated with this test phase, we started to engage with the service teams long before the tests were to take place. This did not shorten the test phase or allow us to bring it forward, but it ensured that the teams were ready to test when we needed them to.^[3] This engagement was done in parallel to the development work. Thus, even though it added work to the plan, it did not make the critical path longer.

In this version of the diagram, the first task, ‘clarify technical details’, has been split into ‘organize test env’, starting immediately, and ‘clarify API usage’, which is depicted in parallel to ‘organize test env’ and ‘wait for access to test env’. This shortens the critical path, but not enough to make it fit.
A new step ‘communicate changes and timeline to services’ has been added in front of ‘let services test’. — Organizing access to the test environment early allowed us to bring the whole plan forward. Communicating the change and our timeline to the services ensured that their test activities would start on time, mitigating the risk of delays at this late stage.

Decoupling work

Sometimes, tasks depend on parts of other tasks. Example: During our PSP migration, we needed to let our API consumers test their systems against our new implementation. This integration test depended on our new implementation being available in our integration test environment.

Looking at the plan as visualized above, it seems as if we had to have our implementation finished before our API consumers could begin with their tests. These two tasks are serialized on the critical path.

However, in reality our service apps only had a dependency on certain parts of the new implementation, namely registration of new payment methods and payment authorization. Background processes including deferred captures and refunds were decoupled from the direct customer experience. By splitting our implementation task cleverly, we had the relevant parts in our integration test environment earlier, so the rest of the implementation could proceed in parallel with the integration tests, saving us precious time. In fact, the integration tests were off the critical path entirely. At no point did we have to stop progress while waiting for the test results.

The long implementation and test
tasks have been split into 4 separate tasks each. ‘let services test’ now follows implementation and test of registration and authorization, while capture and refund are implemented and tested in parallel. The latter tasks still overlap with ‘deploy to prod’ and ‘migrate/monitor/fix’. — By splitting the implementation and test tasks, we decoupled the integration tests of our services from implementation tasks that did not impact them, allowing implementation of those parts and integration tests to run in parallel.

Moving tasks beyond the critical path (the hard part)

In a similar vein, we can sometimes move tasks beyond the critical path by investing a litte extra effort to carve out and postpone uncritical parts.

In our PSP example, we would migrate users to the new PSP in batches, beginning some time before the deadline At first glance, it seems we needed to have the complete set of functionality (registration, authorization, deferred payment (capture), and refund) implemented for the new PSP by this point in time. But actually, what really mattered were registration and authorization because these were part of the checkout flow in our apps and thus a direct part of the user experience. Payment capture, on the other hand, was a deferred batch process. If the capture functionality had not been ready at the start of the migration, most users would not even have noticed because all that would have happened was that their bank accounts would have been debited a few days later. So we split our implementation into critical and uncritical^[4] tasks and prioritized them in such a way that we could have gone live with a partial implementation.

Refund was less critical still. Firstly, we didn't need to refund before we could capture. Secondly, refunds are rare when compared to captures, so far fewer users would be impacted if refunds were delayed.

In this version of the plan,
implementation and test of the capture function has been moved a little to the right and now proceeds in parallel with the integration tests, deployment do production, and the migration phase.
A new step ‘begin capturing’ has been added right before the old PSP is disabled.
Implementation and test of the refund has been moved beyond the deadline ‘contract with old PSP ends’, and is now followed by another new step ‘begin refunding’.
There are no steps left which overlap, but shouldn't. — Relaxing the requirement that capture and refund functionality must be finished before starting the migration finally made the plan seem plausible and even introduced some slack which would allow us to deal with unforeseen problems.

In the end, we did have capture and refund finished before we started the migration. Sometimes, not every risk you planned for actually materializes. But there were some minor things (nice-to-have features in the user interface and some hairy edge cases) that we did implement only after the migration had already started. So while the diagram above does not accurately depict our whole plan, it does show the kind of thinking that went into it.

Conclusion

Thinking about the critical path – the longest sequence of steps necessary to reach a goal – is a valuable exercise when dealing with deadlines because it highlights potential blockers and enables you to plan ahead and address them before they go ‘boom’.

Moving tasks beyond the critical path means realizing that not everything that needs to be done necessarily needs to be done before the deadline. In some cases, this may be obvious, in others less so. Sometimes, steps we initially think of as atomic are really composites of critical and less critical tasks, and splitting the latter out can make the difference between a doable project and a death march.^[5]

Mitigating risk by kicking off external tasks early entails identifying tasks that are not under our control but have the potential to derail our efforts when not done in time. Then we can arrange our plan so these tasks can start as early as possible and anything under our control that could possible block them is out of the way.

Decoupling work means thinking hard about the real dependencies between tasks and, if necessary, splitting them so other tasks are not blocked by – for them – unnecessary activities. If well done, this enables parallelization, which can speed up a project dramatically.^[6]

My next (and final) post of the series on planning addresses exploration of risks and alternative paths.

Footnotes

I'm going to use the start and the goal in the rest of my post for simplicity's sake, but a critical path can be identified between any two fixed points like intermediate milestones or multiple deadlines, and the following tactics will work just as well. ↩︎
There is no substitute for interacting directly with a new API. ↩︎
We also tested our changes rigorously ourselves before deploying them in the test environment accessible to the services. I mention this because, sadly, this does not seem to be the norm. Apparently, the service testers were used to broken releases and were thus understandably nervous about the short period of time we had given them to perform their tests. Our good preparation payed off, and we had positive test results in record time. ↩︎
Not that our background payment processing wasn't critical from a business perspective. It absolutely was. But it was not critical in the sense of the critical path. I.e., while we had to have it implemented before the project was done, and we certainly would have liked to have it finished in time for the migration start, this wasn't strictly necessary for the project to be successful. ↩︎
Note, however, that there is also a big difference between taking on technical debt to make a meaningful deadline and paying it off after the deadline has been successfully met on the one hand, and firefighting mode where we focus only on critical issues and never get to really finish anything because there is constant pressure with artificial deadlines on the other hand. The first is what I have in mind here. The latter leads to burn-out pretty quickly. ↩︎
Sometimes, splitting tasks takes extra effort, and we need to examine whether the benefits outweigh the costs. This is the case if (as in our example) the parallelization potential is large enough and if there are people available to take advantage of it. If nobody is actually taking on the tasks thus unblocked – congratulations! you have just made the project take longer. ↩︎