Engagement
State-space audit and production test-harness remediation
Case Study
State management audit for a live execution system that reduced recurring incidents by introducing state-persistent testing, replayed market sessions, and runtime invariant controls.
Outcome
Cut recurring incident rate by shifting testing from code coverage to state coverage
Mandate Structure
Engagement
State-space audit and production test-harness remediation
Representative Duration
8-12 weeks
Scope
Mandate Boundaries
Risq operated a live execution system with high unit test coverage and a recurring pattern of production incidents. Individual incidents were small in impact but frequent, roughly one every two weeks, and each required manual intervention. The pattern was consistent: the test suite passed cleanly, the system was redeployed, and a similar-shaped incident recurred within the following sessions.
Internal diagnosis had pointed at concurrency, external data feed reliability, and specific code paths. Each had been investigated and partially addressed without materially changing the incident rate. The rate resisted every remediation because every remediation targeted a different category of failure than the one actually producing the incidents.
The operational problem was not insufficient code coverage. It was that the system was being tested against fresh state while capital was being run against accumulated state.
The decisive diagnostic was a state-carry test. We snapshotted end-of-day production state from a non-production replica, loaded it into a staging environment, and replayed recorded market data from the following morning against the system. The system failed inside a single morning session.
The failure mode was not a concurrency bug or a data feed anomaly. The failure mode was that the system’s behavior under accumulated state differed from its behavior under fresh state, and the test suite exclusively exercised fresh state. Code-path coverage was high. State-space coverage was not a measured dimension at all.
Three categories of state accumulation were implicated:
Partial fills were correct intra-session and divergent across session boundaries. Reconciliation behaved correctly on a warm system and incorrectly against persisted pre-restart state. Local order state failed to converge correctly with venue state after reconnect under specific timing, and the suite never simulated the disconnect path.
Unit test coverage was not increased. Coverage was already adequate for the logic being tested. A new test category was introduced alongside the existing suite.
A state-persistent test harness was implemented. Its primitives were:
begin, restart, end, and overnight transitionsTests were then written against this harness for the session-boundary and restart scenarios the existing suite had never produced.
Test inputs were changed from synthetic to replayed. Recorded market data from representative days was stored and replayed against the harness. Synthetic inputs were retained for logic verification but removed from the state-sensitive path, because synthetic inputs do not reliably produce the pathological input sequences that cause state-accumulation failures.
Continuous invariant checking was added in production. A named catalogue of invariants, including conservation of quantity, consistency between subsystems’ representations of the same position, and reconciliation agreement, was asserted on every state mutation. An invariant breach froze the affected subsystem and paged on-call. Freezing was chosen over auto-recovery because an invariant breach means the system’s model of reality is wrong, and wrong systems should not continue trading.
An invariant catalogue document was produced and maintained. Each invariant had a name, description, implication of breach, and defined operator response. The catalogue was the artifact, not an afterthought.
State Harness
Snapshot real end-of-day state from a non-production replica instead of starting every test from a clean slate.
Load the snapshot in staging and replay recorded market data from the following session.
Check quantity conservation, reconciliation agreement, and subsystem consistency on every mutation.
Treat invariant failure as a state-trust problem and freeze the affected subsystem instead of auto-recovering blindly.
Live Harness Signals
Shadow uptime
153h healthy
Sprint report recorded the harness in healthy state across the cutover eligibility window.
Auth / book faults
0×401 · 0×disabled_book
The live shadow run reported no authentication faults and no disabled-book events while components stayed green.
Workstreams
19 / 31 complete
The report showed 11 additional ready items and 1 blocked item, indicating a stable governance boundary rather than runtime instability.
Cutover clock
T+83.5h
The system stayed healthy beyond the T+72h eligibility threshold, which is the kind of persistence signal a state harness is meant to surface.
Unit test coverage and integration coverage do not measure state coverage. A system with excellent line coverage can still have a largely unexercised state space.
The diagnostic signal is blunt: if the test suite cannot survive a restart mid-session with open positions, it is not testing the system that capital actually runs on.
The remediation signal is equally clear: state-persistent testing is a distinct test category. It requires a session primitive, replayed inputs, and explicit invariant assertions. It is additive to unit and integration testing, not a replacement.
The operational signal is that invariants asserted in testing should also be asserted in production. Same catalogue. Same thresholds. Same responses.
Verified Changes