State Management Audit for Live Execution System

Engagement

State-space audit and production test-harness remediation

Representative Duration

8-12 weeks

Scope

Reproduce production failures with persisted state and replayed market sessions
Build a state-persistent harness with restart and overnight primitives
Add runtime invariant checks tied to operator response

Mandate Boundaries

No full execution-engine rewrite
Did not treat additional unit coverage as the primary remediation path

Situation

Risq operated a live execution system with high unit test coverage and a recurring pattern of production incidents. Individual incidents were small in impact but frequent, roughly one every two weeks, and each required manual intervention. The pattern was consistent: the test suite passed cleanly, the system was redeployed, and a similar-shaped incident recurred within the following sessions.

Internal diagnosis had pointed at concurrency, external data feed reliability, and specific code paths. Each had been investigated and partially addressed without materially changing the incident rate. The rate resisted every remediation because every remediation targeted a different category of failure than the one actually producing the incidents.

State Coverage Problem

Trading strategy and execution state surface — The hidden risk was not logic correctness in isolation; it was state behavior across restarts, boundaries, and replayed market sessions.

The operational problem was not insufficient code coverage. It was that the system was being tested against fresh state while capital was being run against accumulated state.

Diagnosis

The decisive diagnostic was a state-carry test. We snapshotted end-of-day production state from a non-production replica, loaded it into a staging environment, and replayed recorded market data from the following morning against the system. The system failed inside a single morning session.

The failure mode was not a concurrency bug or a data feed anomaly. The failure mode was that the system’s behavior under accumulated state differed from its behavior under fresh state, and the test suite exclusively exercised fresh state. Code-path coverage was high. State-space coverage was not a measured dimension at all.

Three categories of state accumulation were implicated:

partial fills carried overnight
position reconciliation across restarts
order state surviving network partitions

Partial fills were correct intra-session and divergent across session boundaries. Reconciliation behaved correctly on a warm system and incorrectly against persisted pre-restart state. Local order state failed to converge correctly with venue state after reconnect under specific timing, and the suite never simulated the disconnect path.

Intervention

Unit test coverage was not increased. Coverage was already adequate for the logic being tested. A new test category was introduced alongside the existing suite.

A state-persistent test harness was implemented. Its primitives were:

a session object with explicit begin, restart, end, and overnight transitions
a state snapshot mechanism that captured and restored the full system state across transitions

Tests were then written against this harness for the session-boundary and restart scenarios the existing suite had never produced.

Test inputs were changed from synthetic to replayed. Recorded market data from representative days was stored and replayed against the harness. Synthetic inputs were retained for logic verification but removed from the state-sensitive path, because synthetic inputs do not reliably produce the pathological input sequences that cause state-accumulation failures.

Continuous invariant checking was added in production. A named catalogue of invariants, including conservation of quantity, consistency between subsystems’ representations of the same position, and reconciliation agreement, was asserted on every state mutation. An invariant breach froze the affected subsystem and paged on-call. Freezing was chosen over auto-recovery because an invariant breach means the system’s model of reality is wrong, and wrong systems should not continue trading.

An invariant catalogue document was produced and maintained. Each invariant had a name, description, implication of breach, and defined operator response. The catalogue was the artifact, not an afterthought.

01
Persist state
Snapshot real end-of-day state from a non-production replica instead of starting every test from a clean slate.
02
Replay session
Load the snapshot in staging and replay recorded market data from the following session.
03
Assert invariants
Check quantity conservation, reconciliation agreement, and subsystem consistency on every mutation.
04
Freeze on breach
Treat invariant failure as a state-trust problem and freeze the affected subsystem instead of auto-recovering blindly.

Outcome

Shadow uptime

153h healthy

Sprint report recorded the harness in healthy state across the cutover eligibility window.

Auth / book faults

0×401 · 0×disabled_book

The live shadow run reported no authentication faults and no disabled-book events while components stayed green.

Workstreams

19 / 31 complete

The report showed 11 additional ready items and 1 blocked item, indicating a stable governance boundary rather than runtime instability.

Cutover clock

T+83.5h

The system stayed healthy beyond the T+72h eligibility threshold, which is the kind of persistence signal a state harness is meant to surface.

Incident rate over the six months following remediation fell to roughly one-fifth of the prior baseline.
Remaining incidents skewed toward genuinely novel failure modes rather than recurrences.
Mean time to diagnose incidents dropped substantially because the invariant catalogue pre-classified the failure type.
The state-persistent harness became a permanent part of the development process, required alongside unit tests before merge.

Applicable Pattern

Unit test coverage and integration coverage do not measure state coverage. A system with excellent line coverage can still have a largely unexercised state space.

The diagnostic signal is blunt: if the test suite cannot survive a restart mid-session with open positions, it is not testing the system that capital actually runs on.

The remediation signal is equally clear: state-persistent testing is a distinct test category. It requires a session primitive, replayed inputs, and explicit invariant assertions. It is additive to unit and integration testing, not a replacement.

The operational signal is that invariants asserted in testing should also be asserted in production. Same catalogue. Same thresholds. Same responses.

Verified Changes

What changed in the operating system.

Introduced a state-persistent harness with restart and overnight session primitives
Replaced synthetic inputs in critical paths with replayed market data
Added production invariant assertions tied to explicit operator responses

State Management Audit for Live Execution System

Scope, duration, and boundaries

Situation

State Coverage Problem

Diagnosis

Intervention

State-persistent execution test cycle

Persist state

Replay session

Assert invariants

Freeze on breach

Outcome

Representative runtime signals from the referenced shadow harness report

Applicable Pattern

What changed in the operating system.