AnalysisCodex Pipeline Forensic Review
Forensic Review Architecture Paper Results 2026-05-15

Codex Pipeline Forensic Review

A blunt outside review of the Codex53 / Codex54 / Codex55 main traders and the CodexF / Codex54F / Codex55F fade lanes. Verified against runtime logs, Vanta execution logs, configs, and trade-diary outputs. No code edits, no service restarts — analysis only.

Window: 2026-04-05 → 2026-05-15 Lanes: 6 Records examined: ~400k runtime + ~215k Vanta Mode: paper
Sections
  1. Executive Verdict
  2. Evidence
  3. Architecture Diagnosis
  4. Future Course
  5. Decision Framework
  6. Immediate Next Actions

Blunt up front

The current Codex pipeline is not a trading agent. It is a four-layer veto stack with an LLM at the top emitting mostly-FLAT opinions, a stalker that only sees pre-filtered near-price ideas, a suspicion gate that fires hard vetoes any time the model defaults to phase="unclear", and a Vanta layer that re-litigates everything with another participation gate and a 2–3 point staleness guard. The pre-May-5 version was a money-losing signal-spammer at a ~37% paper win rate. The post-May-5 version replaced it with a money-saving silence machine. Neither is the “cunning directional trader” the design called for.

1. Executive Verdict

Functional failure against the stated goal. The pipeline is partially viable for risk control and not viable for opportunity capture. It avoids the prior loss rate by simply not trading. The architecture is overweight on serial vetoes, underweight on directional planning, and the diary feedback loop only knows about kills — not about kills-that-should-not-have-been-kills. Continuing with this stack and merely “loosening filters” will not fix the conceptual problem.

2. Evidence

Volume and recency

LaneRuntime linesFirst eventLast event
Codex53 trader182,3292026-04-102026-05-15 11:21
Codex54 trader189,2792026-04-052026-05-15 11:21
Codex55 trader26,9532026-04-232026-05-15 11:21
Vanta Codex53122,5342026-05-15
Vanta Codex5479,1762026-05-15
Vanta Codex5513,8392026-05-15
Vanta CodexF732026-05-15
Vanta Codex54F532026-05-15
Vanta Codex55F212026-05-15

LLM decision distribution (full history)

LaneTotal decisionsFLAT %LONG %SHORT %
Codex5390,14090.6%5.3%4.1%
Codex5488,35890.6%4.9%4.5%
Codex558,12889.6%4.7%5.8%

Roughly 9% of polls produce any directional opinion. The prompt biases the model toward FLAT and tells it the diary lesson is that “persuasive entries tagged moved_too_soon have been toxic.” Recent diary memory after a losing paper run makes FLAT the safer answer for the model on every poll.

Daily funnel — Codex53 (last week)

day        armed  trig  susp  badloc  sigval_fail  ati_sub  fills
2026-05-08      3     3     0      3            1        1      2
2026-05-11     49    18     0    381          260       10     10
2026-05-12    164    54     0    663          381       26     26
2026-05-13    138    41    14    911          385       21     21
2026-05-14    223     7    86   1095          398        4      4
2026-05-15    126     2    14    463          415        0      0

Daily funnel — Codex54 (last week)

day        armed  trig  susp  badloc  sigval_fail  ati_sub  fills
2026-05-08     37    14     0    195           81       10     19
2026-05-11     74    26     0    926          141       18     26
2026-05-12     80    35     0    585          132       12     12
2026-05-13     75    29     3    817          127       13     13
2026-05-14    109     4    34    835          163        1      1
2026-05-15     56     3    14    360          122        1      1

Daily funnel — Codex55 (last week)

day        armed  trig  susp  badloc  sigval_fail  ati_sub  fills
2026-05-11     31     3     0    138           60        2      2
2026-05-12     48     5     0    171           69        2      2
2026-05-13     51    11     1    229           88        4      4
2026-05-14     35     0     4    135           47        0      0
2026-05-15     18     0     0     79           43        0      0

Theses arm at 100–200/day; only 0.5–4% reach a stalker trigger. Codex55 is a smaller-volume clone of Codex54 and is producing 0 fills on the last two days. CodexF (the fade for Codex53) cannot be fed from Codex55 because Codex55 produces almost no qualifying suspicion-blocks.

Where attrition happens (Codex53 since 2026-05-08)

Filter stageCountWhat kills it
LLM FLAT15,784 / 20,62676.5% of polls; the model declines
Stalker bad-location (logged each poll)3,516Thesis alive but price not at value
signal_validation_failed total1,917entry_too_far (1,074), generic_structure_not_enough (426), counter_bias (266), remembered_resistance/support (113), rr_below_min (38)
entry_stalk_suspicion_blocked114Hard vetoes from suspicion gate
ny_no_entry_cutoff_standby1,312NY cutoff 15:30–20:00 ET
Vanta participation_gate:decision_not_allow:BLOCK1,053ATLAS decision packet says don’t trade
Vanta codex_execution_guard_blocked at ≤3 pts21Price drifted from planned entry
Vanta stale_source_signal15Source signal older than 45s
Reached ati_order_submitted~86

Concrete blocked-trade examples (Codex53, 2026-05-14 ET)

00:06:41 SHORT 29526.75/29528.75→29517.0   risk=2  reward=9.75  RR=4.9
         vetoes=[unclear_setup_failed_cross_exam, no_defended_trade_location]

00:43:13 LONG  29578.25/29576.25→29588.75  risk=2  reward=10.5  RR=5.25
         vetoes=[unclear_setup_failed_cross_exam, no_defended_trade_location]

00:50:44 LONG  29583.5 /29581.5 →29593.0   risk=2  reward=9.5   RR=4.75
         vetoes=[no_defended_trade_location]

00:55:20 SHORT 29579.75/29581.75→29568.75  risk=2  reward=11.0  RR=5.5
         vetoes=[unclear_neutral_middle_requires_hard_evidence,
                 generic_middle_without_extra_confirmation,
                 unclear_setup_failed_cross_exam,
                 no_defended_trade_location]

These have ≥4.5 RR with tight 2-pt stops at the band/value zone. They were killed not because the geometry was bad but because the LLM defaulted phase to “unclear” and didn’t volunteer enough free-form trap-language or preferred-structure tags. The gate is taxing the vocabulary of the LLM, not the structure of the setup.

Fade lane evidence

Diary outcomes (full history)

AnalystClosed tradesTargetStopLoss-exitProfit-exitWin rate
OpenClaw Codex (53)465154203862137.7%
OpenClaw Codex54355111139871736.2%
OpenClaw Codex55180115211.1%

Default MIN_RR = 1.15 with an RR ladder of 1.25–2.25. With this ladder a 37% win rate is net losing in any honest reckoning. The skepticism layer added in early May was a rational response to a real loss record — it then went too far the other way.

The diary additionally has 487 trade / published_not_executed records, 336 trade / blocked, 98 thesis / skipped, and 26 thesis / blocked. None of these have post-hoc R-multiple reconstruction — the diary never asks “what would this rejected or unexecuted candidate have done if taken?”

3. Architecture Diagnosis

What the system actually does

LLM call (90% FLAT)
  → If LONG/SHORT: _validate_trade
       reject if |entry - price| > 25  (kills planned distant entries)
       reject if RR < 1.15
       reject if counter_bias
       reject on location_block_reason (remembered_resistance/support,
         generic_structure_not_enough)
  → Arm value-entry thesis (TTL 1200s)
  → Stalker:
       wait for trigger (lower-band rejection, middle-band pullback,
         BB-structure boundary, band-pressure reclaim, etc.)
       if triggered, run suspicion gate
           hard_vetoes set if phase == "unclear" without explicit
             cross-exam survival
           score ≥ 4.0 of 5 pillars required AND zero hard_vetoes
       if pass, _validate_trade again
       if pass, per_bot_session_gate.check
       publish signal
  → Vanta polls signal file:
       participation_gate decision_not_allow → skip
       max_source_age 45s → skip if stale
       level_confluence 15pt → skip if missing
       codex_execution_guard 2pt deviation → skip if drifted
       require_consecutive_same_signal = 3
       min_seconds_between_entries = 600s
       finally submit ATI order

What this system is

A defensive veto chain, not a trading agent. There are at least eight independent reasons a trade can be killed (model FLAT, main-path validation, location discipline, stalker bad-location, suspicion gate, session gate, participation gate, execution-guard staleness). Any one can kill any candidate. Probability of all eight approving a single setup is, in practice, a few times per day on the best days and zero on bad days.

What it should be doing

A directional planner: pick a side per market regime, build a named entry zone (e.g. lower band + EMA20 + Asian high pivot 29 580), wait there, and if price reaches it intact, execute. The LLM should do the thesis and zone-naming, deterministic code should do the waiting and triggering, and a single post-mortem layer should ask “did the kill prove right?” Today the LLM does the thesis and each tick gets a new chance to be vetoed. The stalker is not stalking — it is reacting to whatever near-price candidate _validate_trade lets through.

Specific failure modes

  1. _validate_trade runs before thesis-arming. entry_too_far > 25pt kills the LLM’s planned-entry intent. This contradicts the thesis-arming prompt that explicitly tells the model to plan distant entries. The prompt says yes; the code says no.
  2. Suspicion-gate hard_vetoes are vocabulary-sensitive. unclear_setup_failed_cross_exam and no_defended_trade_location fire whenever the LLM doesn’t explicitly volunteer trap-language or preferred-structure tags. Sound geometric setups with 5+ RR get killed because the model wrote phase="unclear".
  3. Codex execution guard at codex_max_entry_deviation_points: 2.0 is too tight for the median NQ minute range. Combined with the fade router’s 3-point MAX_SETUP_DRIFT_POINTS, signals that pass everything else commonly fail in the 5–30s between publish and ATI.
  4. Vanta participation_gate is the most prolific kill (1,053 + 1,202 + 410 since May 8). Most of those are against standby packets, but the volume tells us upstream and downstream gates have near-redundant rejection logic.
  5. Codex55 is functionally a clone of Codex54 with shorter history and ~30% the thesis volume. It produces theses but the absolute fill ceiling is 4 trades on the best day, 0 on the worst.
  6. Fade router qualification list is too narrow. The most common suspicion-block reason (unclear_setup_failed_cross_exam) is not in QUALIFYING_VETOES, so the majority of rejections never reach the fade lane. Codex55F has produced zero fills since startup; source-side rules starve it.
  7. Diary one-sidedness. Records what did happen (closed at X, blocked at Y) but never asks what would have happened to blocked candidates. The prompt-memory feedback therefore only reinforces caution.

4. Future Course

Recommended path

Hybrid LLM planner + deterministic execution watcher with mandatory shadow accounting.

Keep

Remove or sharply curtail

Rebuild

5. Decision Framework

A new architecture is justified only if all of these are true:

  1. Pre-deployment replay against the last 30 trading days produces a win rate ≥ 50% at the RR ladder used, on ≥ 200 simulated trades.
  2. The system can produce 5–15 executable signals per active session on average days (currently < 5 in best week, 0 in worst).
  3. Median time from “thesis armed” to “first eligible trigger” is ≤ 8 minutes.
  4. Independent veto layers per candidate ≤ 3 (currently 8).
  5. Every “skip” event carries enough geometry to be replayed; a nightly job actually replays them.
  6. Codex execution guard staleness rejection < 10% of submitted signals.
  7. Fade lane has independent qualification logic that does not depend on a specific veto vocabulary in the source pipeline.

If any of (1)–(4) cannot be met in design, the next version is the wrong next version.

6. Immediate Next Actions (analysis-only)

  1. Replay published_not_executed and blocked candidates against bar history for 2026-05-01 → 2026-05-15. Compute realized R-multiple per candidate.
  2. Cross-tabulate each suspicion-gate hard_veto reason against subsequent 30-min directional move.
  3. Inventory every entry_stalk_rejected_bad_location while a thesis is armed: how often did price reach the planned entry within TTL but fail the stalker’s extra bounce/boundary requirements?
  4. Audit Codex55 vs Codex54 prompts and configs side-by-side. If identical, Codex55 is paying compute for nothing.
  5. Audit Vanta participation_gate decisions. Determine how many BLOCKs were against real entry signals vs standby packets.
  6. Move CodexF / Codex54F / Codex55F to shadow-only until source quality is restored. Keep the fade router gathering data; stop burning Vanta sessions on a 0–6-fill-per-lane signal base.
  7. Stop the “diary-as-warning” prompt injection in the next prompt revision. Replace with a structured R-multiple summary per regime.
  8. Pause Codex55 live polling until the prompt audit determines it is doing anything Codex54 isn’t.
  9. Build a single read-only dashboard over the existing JSONL showing per lane per hour: theses armed, triggers, vetoes by reason, Vanta submits, fills, R-multiple. Without this, every future tuning decision is guesswork.
  10. Decide before further work which path you want. The evidence supports hybrid LLM planner + deterministic execution watcher with the suspicion gate downgraded to structural-only.

Net assessment

The system is currently optimizing for “avoid the last loss type” rather than “execute the next good idea.” A 90% FLAT rate from a model whose prompt-memory tells it “your aggressive entries were toxic” combined with eight independent vetoes is not a trading agent. Fix the asymmetry — measure what the gates kill, not just that they fired — before any more code changes.