Context & Problem
The client's platform had served the business well for eight years. At around 600,000 lines of C#, it handled everything: order processing, notifications, reporting, third-party integrations, and a public API. Four separate engineering teams committed to the same repository.
I joined as Delivery Lead when the situation had become unsustainable: deployments took three hours and required all four teams to coordinate a release window. A bug from Team A could block Teams B, C, and D from shipping. The business wanted to release features every two weeks; the architecture made that nearly impossible without significant risk.
The core ask was not "modernise the platform" — it was "let us ship independently."
Constraints
- No big-bang rewrite. The business had been burned by a previous rewrite attempt that ran 18 months and was eventually abandoned. Incremental was a hard requirement.
- Zero Kubernetes experience on the team. The operations team ran IIS on VMs; containers were new territory.
- Azure was the established cloud. AWS and GCP were off the table.
- A 12-month horizon before the first independent deployment needed to be real, not a demo.
Options Considered
| Approach | Pro | Con |
|---|---|---|
| Modular monolith | Lowest risk, same deployment model | Doesn't solve independent deployments |
| Immediate full decomposition | Clean end state | 18+ months, high parallel running cost |
| Strangler fig onto AKS | Incremental, reversible, teams learn Kubernetes on real services | Distributed system complexity, requires investment in observability early |
We ruled out the modular monolith quickly — it was the right architectural improvement but the wrong business answer. Full decomposition in one pass had been tried before and failed. The strangler fig gave us a realistic roadmap that the business could see progressing.
Architecture Decision
We would extract services incrementally, running new services on AKS alongside the monolith. The monolith would continue to serve traffic; new services would own specific bounded contexts and receive traffic via Azure API Management. A feature-flag mechanism (Azure App Configuration) would let us route a percentage of traffic to new services without a hard cutover.
The first two services targeted were deliberately low-stakes but high-visibility: the notification service (email, SMS, push) and the reporting service (PDF generation, data exports). Both were clearly bounded, had well-defined I/O, and were complained about constantly by the teams who owned them.
Implementation Highlights
CQRS for service internals. Each new service separated its read and write paths from day one. This made testing significantly easier and eliminated the N+1 query problems that plagued the monolith's reporting module.
Azure Service Bus for monolith-to-service events. Rather than calling new services synchronously from the monolith, we published domain events to Service Bus topics. New services subscribed independently. This meant the monolith didn't need to know which services existed — a critical decoupling point that let us extract services without modifying the monolith's core paths.
Helm charts per service. Each service got its own Helm chart, its own values files per environment, and its own pipeline in Azure DevOps. A new service could be deployed without touching anything else. The DevOps team built a reusable pipeline template that reduced the setup cost of a new service to under two hours.
OpenTelemetry added in month four. This was a mistake we corrected mid-project. Distributed tracing should have been in from the start. Debugging failures that spanned the monolith and two services with only structured logs was painful. Once we added OpenTelemetry with Application Insights, root-cause time for cross-service issues dropped from hours to minutes.
Outcome
By the end of the 12-month engagement:
- Three services were running in production on AKS: notifications, reporting, and document generation.
- The notification and reporting teams had reduced their deployment cycle from monthly (monolith window) to 10–15 deploys per week.
- The monolith continued to run without modification to its core paths — the business risk of the migration was contained.
- The fourth team had started extracting their first service independently, using the patterns the project established.
The most concrete result: a reporting bug fix that previously required a three-hour production window and sign-off from three teams now took 12 minutes from merge to production.
What I'd Do Differently
Observability from day one. We treated it as something we'd "add later." That decision cost us weeks of debugging time across the project. OpenTelemetry setup is now the first story in my project templates for any distributed system work.
Explicit SLAs between services before writing a line of code. We had implicit assumptions about acceptable latency and error rates for inter-service calls. When the notification service had an outage, we hadn't defined what the monolith should do — retry? fail open? It defaulted to synchronous failure. Defining these contracts upfront, even informally, would have saved a production incident in month three.
A team working agreement on Kubernetes ownership. The operations team owned the infrastructure but weren't available for day-to-day Kubernetes questions. This created a bottleneck. On future engagements I push for a clear ownership model before the first service ships.