Eric Crooks - Profile Photo

Eric Crooks

Software Engineer. Developer Advocate. Open Source Contributor.

Back to all journal entries

Pipeline Pipe Dreams: Architecting Elite Continuous Delivery Workflows

TLDR

Transforming an 8‑hour, high‑risk deployment cycle to a 1‑hour, low‑risk process required rethinking both the release strategy and the supporting infrastructure. We refactored the high-risk, difficult-to-reverse deployment into smaller, incremental changes behind feature flags. These changes were containerized through a quick continuous delivery pipeline—following strict promotion criteria between environments. This new process meant tightening feedback loops, integrating smoke and regression tests, and standardizing deployment tooling and rollbacks. Each release became predictable, observable, and easily reversible. This dramatically reduced both perceived and actual risk. Over time, what had been a stressful, all‑hands 8-hour+ event became a routine, 1-hour deployment window aligned with continuous delivery principles—a process many teams I've seen consider unattainable.

Why the Change?

My first few MiLUMA deployments in 2022 still haunt me: writing runbooks with (what seemed to be) endless unnecessary steps and killing my confidence the more I wrote. I remember thinking to myself, "Why are there so many steps? Has anyone (devs, product owners) ever questioned this process?" My colleague and I inherited the entire MiLUMA stack when the engineering teams left. Being mostly on the front-end and back-end, we had no idea the deployment process was fragile and cumbersome until we exercised it ourselves. We found deployments to Azure Container Instances (ACI), feature flags, environment variables, and promoting containers through the blue/green process were problematic. Even the option to perform a rollback spiked anxiety, as it was a high-risk process that could lead to prolonged downtime and all-hands war rooms. Although the process worked, we knew it wouldn't scale and wouldn't support building a high-performing team. The process had to change so we could deploy and release with confidence. Not to mention doing so in less than 8 hours, a time that should be labeled "... idek... just too long" for KPI purposes.

Problems and Solutions

Problems Faced

We faced the problems below, which are explained in more detail (including their solutions) in the following sections.

Problem 1: ACI Dynamic IP Assignment

Problem

Deploying to ACI was frustrating. Dynamic IPs were assigned to new containers on deploys/redeploys—breaking upstream services from mismatched IPs. To mitigate this, we had a subnet pool with few IP addresses available (nearly one IP per container). This meant we could deploy a single container and the likelihood of it having the same IP address was high. However, this failed to scale because deploying two containers could end up with swapped IPs and breaking upstream services. We could've deployed containers individually, but that's not a high-performing operation. This mitigation only helped us buy time to find a proper solution. Others faced this problem too. Some of them were years after we did (see the Microsoft threads below):

A more frustrating problem we faced with ACI's dynamic IP assignment was the container IPs changing due to unknown restarts. A cause of this (not the only cause) is briefly explained in Azure's Container had an isolated restart without explicit user input documentation, which states, "... customers may experience restarts initiated by the ACI infrastructure due to maintenance events." We'd find out about the restarts after being alerted about one or more containers being inaccessible because of mismatched IPs. Redeploying the containers was our temporary fix. This fixed the IPs, but it was a low quality solution: redeploying took time, our customers experienced downtime, and we experienced more war rooms.

Solution

Our solution was a startup script in the container. The idea was this: when the container started up, the script would read the container's IP and update upstream services via Azure CLI. My colleague Rigo implemented this flawlessly (big ups!). This solution is similar to the answer in this thread. Based on the date of this answer (a couple years after our solution), it still seems like a viable solution.

This solution sounds simple when said succinctly, but at the time, we didn’t have a good lead on a solution. Stack Overflow didn’t have anything relevant, nor did Microsoft threads. All we had were some facts to work with: IPs kept changing, upstream services would break because of mismatched IPs, and we couldn't anticipate when the IPs would change. This had us thinking, "The upstream services need to be updated at the time a container is deployed/redeployed; and we don't know when that time is unless we're manually deploying." The only idea that seemed to make sense was to run a script when the container started to read the IP and send it to upstream services. I remember us thinking, "... but how do we send the IP from inside the private container?" Using Azure CLI didn't hit us immediately because we never ran automated Azure CLI commands for outside services inside a private container. This was new territory for us. Huge shoutout to Rigo though for implementing a solid solution under the pressure of downtime and war rooms.

Problem 2: Feature Flag Implementation

Problem

The system was set up to use Azure App Configuration (App Config) for its feature flag store. However, it wasn't set up to request values from App Config efficiently. The system requested fresh values from App Config every time it needed to check a feature flag. No cache sat in front of App Config where values could be pulled at a timed interval. During high request volumes, App Config would hit resource exhaustion. Ultimately, the system would become unusable because it could not fetch values from App Config. Based on the throttling configs, the service would be temporarily unavailable, available for a few minutes, then unavailable again from the influx of requests. This process would repeat for prolonged periods of time. I still remember the 429 Too Many Requests errors flooding the logs and looking at my colleague like, "I'm looking at you because I need you to help me," and I had the feeling he was looking at me with the same exact thought. Later he confirmed that's what he was thinking too. We hit a huge blocker...

To fix this, stakeholders requested a rollback, so we did one. The 429 responses persisted. To make things worse, we experienced mismatched IPs, APIs pulling non-existent App Config values (killing containers' long-running processes), missing environment variables (preventing container startups), and overall a broken distributed system. Two more rollbacks followed. The system was finally up, but with missing features because we had rolled back a few container images. Complete nightmare. The nightmare didn't stop there because we still had to figure out how to go forward to get the already released features back in the app.

Solution

We resolved the App Config throttling by introducing a centralized cache layer that served as the single source of truth for config values across all services. The cache was read by the web app and microservices, so it could handle incoming requests to prevent the rate limit bottleneck that was causing service unavailability. You can think of this as a Proxy design pattern. Both on-prem and cloud-based services could request values from the cache directly rather than hitting App Config. This reduced config reads and latency, improved system resilience, and decreased overall App Config cost. The cache would automatically sync with App Config at a given interval to stay up to date with feature flag values. For example, you make a change in App Config and the change is seen in the cache in about a minute. This ensured consistent values across the distributed system, reduced the request volume to App Config, and improved the system's uptime. The cache handled non-existent App Config values as well by falling back to a default set of configs that we defined.

Dark Releases

This approach enabled us to dark release (aka dark launch) new features. This meant we could deploy new containers with new features, route traffic to those containers, and enable the features at the flip of a button (metaphorically speaking). This greatly improved the deployment and release strategy because we were able to deploy multiple times a day and release features on demand (flipping feature flags on). It also meant we could reverse a release very quickly (flipping feature flags off). Our setup enabled us to release new features—or roll them back—in about a minute. This operation was a high-performing, pivotal moment in MiLUMA's development.

Problem 3: Shared Environment Variables

Problem

If you've worked with Infrastructure as Code (IaC) state management, you'll know that shared variables will cause updates to resources that share those variables. When two or more resources share the same environment variable, updating its value and re-running your IaC process may cause it to detect changes in those resources and attempt to recreate those resources. This can be unnecessary if you only meant to update one resource. In our case, we intended to update one microservice's environment variable and that cascaded into updating multiple microservices because they shared the same environment variable. This doesn't seem like a problem at the surface, but adding this problem on top of the above problems makes deploying and releasing worse—more potential for downtime and rollbacks from failed releases.

Solution

Our solution was to decouple microservices as much as possible during deployments—making them truly indepdently deployable. This meant duplicating (WET, not DRY code) code in our IaC files so that we could decouple shared environment variables. Having more IaC code was a good trade-off for us because we knew we could change a single set of related resources and it would not cause another resource to be redeployed. Going with an OOP anti-pattern made our infrastructure decoupled. Also, updating the same environment variable multiple times for multiple resources was a trivial change. Really good trade-off that improved our high-performing operations.

Problem 4: Blue/Green CICD

Problem

The system's deployment strategy followed a blue/green deployment model for production. Our container images were built in our CICD process. However, the container images weren’t truly blue/green-friendly because they weren’t config-agnostic. Each container image was tightly coupled to its environment-specific configs (e.g., application name, timeouts, log levels). This meant we had to build and maintain two separate container images if we wanted them to have separate configs (one built/configured for the blue environment and one built/configured for the green environment). If we needed to make a fix, we had to build two images for that fix. Building two images for a single environment doesn't sound bad when you think about fix frequency, but we had more than one containerized API. We had to make fixes across multiple APIs and build two images every... damn... time. That grueling process could kill any software engineer's motivation indefinitely if they knew this was the process for every fix that needed to be pushed out. It was not a high-performing operation. The high-performing operation was (and still is for any project) to build a single image that accepts dynamically provided configs.

Question: Would you rather build, tag, and push a new container image just to change a JWT secret signing key it used OR would you rather change the JWT secret signing key outside the container and redeploy the container—having it pick up the change? I'll always go with the latter. I think you should too. To each their own though.

Solution

The target solution was reusable container images. We audited our CICD files and removed everything that was considered environment-specific: application name, timeout configs, environment variables (ports, log levels, database URLs, and more). Once we ripped out all of those hard-coded variables, we placed them in our IaC process so our IaC process could deploy container images with variables we provided. The trade-off here: we decreased the build time for our container images and made them reusable, but we also increased the number of changes we needed to make to our IaC state. Good trade-off. I'd rather make text changes instead of waiting for containers to build (or build, error, rebuild, error, repeat... high-performing ops only please).

End state

Once we had everything in place and ran through a few end-to-end deployments and releases, we tracked the time it took to do this across all of our microservices. The combined time (deploying and releasing all of them) was always about an hour. Deploying a single microservice was always less than an hour. For our team (engineers, owners, stakeholders), it was a new frontier where the release on demand experience became world-class. No more all-hands deployments, no more war rooms, no more stress—the kind of experience software development should have for everyone involved.

I'm truly grateful for the MiLUMA team in Puerto Rico (Francisco Ramos and Kevin Avilla) for allowing me and my colleague to address these monumental tasks and supporting us along the way. I'm also truly grateful for InTwo's support as we navigated and through the infrastructure.

continuous deliveryrelease on demanddistributed systemscontainerizationiac