Engagement
Release-process remediation for production inference systems
Case Study
Deployment pipeline remediation for a production ML inference stack, converting manual release risk into executable promotion gates, canary verification, and automated rollback.
Outcome
Reduced bad-model recovery from multi-hour rollback to under five minutes
Mandate Structure
Engagement
Release-process remediation for production inference systems
Representative Duration
10-14 weeks
Scope
Mandate Boundaries
The firm ran inference models in production as part of a live decision pipeline. Model research cadence was high. Deployment cadence was not. The interval between “model approved by research” and “model serving traffic” was measured in weeks, and the interval was produced entirely inside the release process rather than inside any research or engineering blocker.
Deployments were manual. Promotion from staging to production happened when an engineer copied a container tag into a config file and re-deployed. Rollback was the reverse operation, contingent on someone being awake and correctly interpreting the signal that a rollback was necessary. Model-specific telemetry existed but was monitored in a dashboard separate from infrastructure telemetry, so nobody was watching both at the moment either started going wrong.
Recovery time from a bad model deployment was nominally measured in hours. When measured formally, the real number was worse than the firm’s internal assumption. That is the typical finding and should be assumed true of any firm that has not explicitly exercised the measurement.
This engagement was not about model quality. It was about the production control plane around the model. The release system, not the research system, was carrying the avoidable risk.
The issue was not engineering capability. The engineering team ran disciplined CI/CD for application services. The issue was that ML artifacts had never been brought under the same discipline, because ML artifacts originally entered the release process through researcher-to-engineer handoffs, and the release path had grown up around that shape without ever being formalized.
Three concrete gaps mattered:
A checklist existed, but nobody owned enforcing it. A canary existed, but without output comparison it was only a slow manual deploy. Rollback existed, but on the human availability curve rather than the detection curve.
Scope was narrow: the release surface. No changes to modeling, data pipelines, or feature infrastructure.
Harness was selected as the pipeline platform because the pipeline definition is a versioned, reviewable artifact. Changes to deployment pass through the same review process as changes to what gets deployed, which closes a class of drift where “how we deploy” diverges quietly from “how we documented we deploy.”
Promotion gates were defined as executable checks. Promotion required:
Any failure halted promotion automatically with no override path.
Canary verification was implemented as a first-class pipeline stage. New model versions received a defined traffic share, their predictions on overlapping inputs were compared to the incumbent, and divergence beyond tolerance halted promotion and triggered rollback.
Rollback became a pipeline stage rather than a runbook. Defined telemetry signals such as prediction divergence, error rate, and latency fired rollback without human intervention. The rollback path was exercised on a recurring schedule in staging, not only when needed in production. A rollback path that has never been exercised is not a rollback path you own.
Model telemetry was consolidated into the same monitoring surface as infrastructure telemetry, so the on-call engineer saw both at once.
Release Sequence
Evaluation, latency, memory, and drift checks became blocking release criteria rather than advisory review steps.
Overlapping traffic allowed incumbent-versus-candidate output comparison before broader promotion.
Prediction divergence, latency, and error signals triggered rollback as a pipeline outcome instead of a manual runbook.
The deployment pipeline for ML artifacts should meet the same discipline as the deployment pipeline for application code. The artifact is different; the controls are the same.
Promotion is executable, not advisory. A gate that is not executed is not a gate.
Canary deployments verify predictions, not just traffic. A canary without comparison is a slow manual deploy.
Rollback is a pipeline stage. If rollback requires a human to be awake, recovery time is bounded by sleep cycles.
Model telemetry should be monitored alongside infrastructure telemetry, on the same pane of glass, by the same on-call surface.
The discipline test is simple: can your pipeline roll back a production model in under five minutes without paging a human? If not, your operational risk is higher than your risk committee is pricing.
Verified Changes