Analysis›Codex Pipeline Forensic Review

Forensic Review Architecture Paper Results 2026-05-15

Codex Pipeline Forensic Review

A blunt outside review of the Codex53 / Codex54 / Codex55 main traders and the CodexF / Codex54F / Codex55F fade lanes. Verified against runtime logs, Vanta execution logs, configs, and trade-diary outputs. No code edits, no service restarts — analysis only.

Window: 2026-04-05 → 2026-05-15 Lanes: 6 Records examined: ~400k runtime + ~215k Vanta Mode: paper

Sections

Executive Verdict
Evidence
Architecture Diagnosis
Future Course
Decision Framework
Immediate Next Actions

Blunt up front

The current Codex pipeline is not a trading agent. It is a four-layer veto stack with an LLM at the top emitting mostly-FLAT opinions, a stalker that only sees pre-filtered near-price ideas, a suspicion gate that fires hard vetoes any time the model defaults to phase="unclear", and a Vanta layer that re-litigates everything with another participation gate and a 2–3 point staleness guard. The pre-May-5 version was a money-losing signal-spammer at a ~37% paper win rate. The post-May-5 version replaced it with a money-saving silence machine. Neither is the “cunning directional trader” the design called for.

1. Executive Verdict

Functional failure against the stated goal. The pipeline is partially viable for risk control and not viable for opportunity capture. It avoids the prior loss rate by simply not trading. The architecture is overweight on serial vetoes, underweight on directional planning, and the diary feedback loop only knows about kills — not about kills-that-should-not-have-been-kills. Continuing with this stack and merely “loosening filters” will not fix the conceptual problem.

2. Evidence

Volume and recency

Lane	Runtime lines	First event	Last event
Codex53 trader	182,329	2026-04-10	2026-05-15 11:21
Codex54 trader	189,279	2026-04-05	2026-05-15 11:21
Codex55 trader	26,953	2026-04-23	2026-05-15 11:21
Vanta Codex53	122,534	—	2026-05-15
Vanta Codex54	79,176	—	2026-05-15
Vanta Codex55	13,839	—	2026-05-15
Vanta CodexF	73	—	2026-05-15
Vanta Codex54F	53	—	2026-05-15
Vanta Codex55F	21	—	2026-05-15

LLM decision distribution (full history)

Lane	Total decisions	FLAT %	LONG %	SHORT %
Codex53	90,140	90.6%	5.3%	4.1%
Codex54	88,358	90.6%	4.9%	4.5%
Codex55	8,128	89.6%	4.7%	5.8%

Roughly 9% of polls produce any directional opinion. The prompt biases the model toward FLAT and tells it the diary lesson is that “persuasive entries tagged moved_too_soon have been toxic.” Recent diary memory after a losing paper run makes FLAT the safer answer for the model on every poll.

Daily funnel — Codex53 (last week)

day        armed  trig  susp  badloc  sigval_fail  ati_sub  fills
2026-05-08      3     3     0      3            1        1      2
2026-05-11     49    18     0    381          260       10     10
2026-05-12    164    54     0    663          381       26     26
2026-05-13    138    41    14    911          385       21     21
2026-05-14    223     7    86   1095          398        4      4
2026-05-15    126     2    14    463          415        0      0

Daily funnel — Codex54 (last week)

day        armed  trig  susp  badloc  sigval_fail  ati_sub  fills
2026-05-08     37    14     0    195           81       10     19
2026-05-11     74    26     0    926          141       18     26
2026-05-12     80    35     0    585          132       12     12
2026-05-13     75    29     3    817          127       13     13
2026-05-14    109     4    34    835          163        1      1
2026-05-15     56     3    14    360          122        1      1

Daily funnel — Codex55 (last week)

day        armed  trig  susp  badloc  sigval_fail  ati_sub  fills
2026-05-11     31     3     0    138           60        2      2
2026-05-12     48     5     0    171           69        2      2
2026-05-13     51    11     1    229           88        4      4
2026-05-14     35     0     4    135           47        0      0
2026-05-15     18     0     0     79           43        0      0

Theses arm at 100–200/day; only 0.5–4% reach a stalker trigger. Codex55 is a smaller-volume clone of Codex54 and is producing 0 fills on the last two days. CodexF (the fade for Codex53) cannot be fed from Codex55 because Codex55 produces almost no qualifying suspicion-blocks.

Where attrition happens (Codex53 since 2026-05-08)

Filter stage	Count	What kills it
LLM FLAT	15,784 / 20,626	76.5% of polls; the model declines
Stalker bad-location (logged each poll)	3,516	Thesis alive but price not at value
`signal_validation_failed` total	1,917	`entry_too_far` (1,074), `generic_structure_not_enough` (426), `counter_bias` (266), `remembered_resistance/support` (113), `rr_below_min` (38)
`entry_stalk_suspicion_blocked`	114	Hard vetoes from suspicion gate
`ny_no_entry_cutoff_standby`	1,312	NY cutoff 15:30–20:00 ET
Vanta `participation_gate:decision_not_allow:BLOCK`	1,053	ATLAS decision packet says don’t trade
Vanta `codex_execution_guard_blocked` at ≤3 pts	21	Price drifted from planned entry
Vanta `stale_source_signal`	15	Source signal older than 45s
Reached `ati_order_submitted`	~86	—

Concrete blocked-trade examples (Codex53, 2026-05-14 ET)

00:06:41 SHORT 29526.75/29528.75→29517.0   risk=2  reward=9.75  RR=4.9
         vetoes=[unclear_setup_failed_cross_exam, no_defended_trade_location]

00:43:13 LONG  29578.25/29576.25→29588.75  risk=2  reward=10.5  RR=5.25
         vetoes=[unclear_setup_failed_cross_exam, no_defended_trade_location]

00:50:44 LONG  29583.5 /29581.5 →29593.0   risk=2  reward=9.5   RR=4.75
         vetoes=[no_defended_trade_location]

00:55:20 SHORT 29579.75/29581.75→29568.75  risk=2  reward=11.0  RR=5.5
         vetoes=[unclear_neutral_middle_requires_hard_evidence,
                 generic_middle_without_extra_confirmation,
                 unclear_setup_failed_cross_exam,
                 no_defended_trade_location]

These have ≥4.5 RR with tight 2-pt stops at the band/value zone. They were killed not because the geometry was bad but because the LLM defaulted phase to “unclear” and didn’t volunteer enough free-form trap-language or preferred-structure tags. The gate is taxing the vocabulary of the LLM, not the structure of the setup.

Fade lane evidence

codex_fade_router.jsonl since 2026-05-14 20:35 startup: 2 startups, 12 fade signals published, 32 rejected.
Reject reasons: 15 setup_location_stale (drift > 3 pt by the time the router built the fade), 15 non_fade_suspicion_block (block was qualified but had no listed “qualifying veto”), 1 session_fade_limit, 1 missing_geometry.
Vanta CodexF: 6 fills total. Vanta Codex54F: 4 fills. Vanta Codex55F: 0 fills (source never produced qualifying rejections).
Shadow stop-entry study: 12 studies armed, 12 entries taken, 60 ratio outcomes recorded. Functional but data-sparse.

Diary outcomes (full history)

Analyst	Closed trades	Target	Stop	Loss-exit	Profit-exit	Win rate
OpenClaw Codex (53)	465	154	203	86	21	37.7%
OpenClaw Codex54	355	111	139	87	17	36.2%
OpenClaw Codex55	18	0	1	15	2	11.1%

Default MIN_RR = 1.15 with an RR ladder of 1.25–2.25. With this ladder a 37% win rate is net losing in any honest reckoning. The skepticism layer added in early May was a rational response to a real loss record — it then went too far the other way.

The diary additionally has 487 trade / published_not_executed records, 336 trade / blocked, 98 thesis / skipped, and 26 thesis / blocked. None of these have post-hoc R-multiple reconstruction — the diary never asks “what would this rejected or unexecuted candidate have done if taken?”

3. Architecture Diagnosis

What the system actually does

LLM call (90% FLAT)
  → If LONG/SHORT: _validate_trade
       reject if |entry - price| > 25  (kills planned distant entries)
       reject if RR < 1.15
       reject if counter_bias
       reject on location_block_reason (remembered_resistance/support,
         generic_structure_not_enough)
  → Arm value-entry thesis (TTL 1200s)
  → Stalker:
       wait for trigger (lower-band rejection, middle-band pullback,
         BB-structure boundary, band-pressure reclaim, etc.)
       if triggered, run suspicion gate
           hard_vetoes set if phase == "unclear" without explicit
             cross-exam survival
           score ≥ 4.0 of 5 pillars required AND zero hard_vetoes
       if pass, _validate_trade again
       if pass, per_bot_session_gate.check
       publish signal
  → Vanta polls signal file:
       participation_gate decision_not_allow → skip
       max_source_age 45s → skip if stale
       level_confluence 15pt → skip if missing
       codex_execution_guard 2pt deviation → skip if drifted
       require_consecutive_same_signal = 3
       min_seconds_between_entries = 600s
       finally submit ATI order

What this system is

A defensive veto chain, not a trading agent. There are at least eight independent reasons a trade can be killed (model FLAT, main-path validation, location discipline, stalker bad-location, suspicion gate, session gate, participation gate, execution-guard staleness). Any one can kill any candidate. Probability of all eight approving a single setup is, in practice, a few times per day on the best days and zero on bad days.

What it should be doing

A directional planner: pick a side per market regime, build a named entry zone (e.g. lower band + EMA20 + Asian high pivot 29 580), wait there, and if price reaches it intact, execute. The LLM should do the thesis and zone-naming, deterministic code should do the waiting and triggering, and a single post-mortem layer should ask “did the kill prove right?” Today the LLM does the thesis and each tick gets a new chance to be vetoed. The stalker is not stalking — it is reacting to whatever near-price candidate _validate_trade lets through.

Specific failure modes

_validate_trade runs before thesis-arming. entry_too_far > 25pt kills the LLM’s planned-entry intent. This contradicts the thesis-arming prompt that explicitly tells the model to plan distant entries. The prompt says yes; the code says no.
Suspicion-gate hard_vetoes are vocabulary-sensitive. unclear_setup_failed_cross_exam and no_defended_trade_location fire whenever the LLM doesn’t explicitly volunteer trap-language or preferred-structure tags. Sound geometric setups with 5+ RR get killed because the model wrote phase="unclear".
Codex execution guard at codex_max_entry_deviation_points: 2.0 is too tight for the median NQ minute range. Combined with the fade router’s 3-point MAX_SETUP_DRIFT_POINTS, signals that pass everything else commonly fail in the 5–30s between publish and ATI.
Vanta participation_gate is the most prolific kill (1,053 + 1,202 + 410 since May 8). Most of those are against standby packets, but the volume tells us upstream and downstream gates have near-redundant rejection logic.
Codex55 is functionally a clone of Codex54 with shorter history and ~30% the thesis volume. It produces theses but the absolute fill ceiling is 4 trades on the best day, 0 on the worst.
Fade router qualification list is too narrow. The most common suspicion-block reason (unclear_setup_failed_cross_exam) is not in QUALIFYING_VETOES, so the majority of rejections never reach the fade lane. Codex55F has produced zero fills since startup; source-side rules starve it.
Diary one-sidedness. Records what did happen (closed at X, blocked at Y) but never asks what would have happened to blocked candidates. The prompt-memory feedback therefore only reinforces caution.

4. Future Course

Recommended path

Hybrid LLM planner + deterministic execution watcher with mandatory shadow accounting.

LLM’s job once per 5–15 minutes: declare a directional playbook with a primary and secondary entry zone (price ranges, not single points), planned invalidation level, and target. No per-tick re-vetoing.
Deterministic watcher’s job continuously: monitor price vs the playbook’s named zones, trigger when price reaches a zone with the qualifier the LLM specified (e.g. “tap and 3-pt bounce off lower band with VWAP held”), submit. No second LLM call required for trigger.
Single post-mortem layer: every blocked or rejected candidate is replayed against the next N bars to compute the R-multiple it would have realized. That number, not exhortative text, feeds prompt-memory.

Keep

LM Studio + Qwen text-only architecture.
Vanta as the execution agent (ATI bracket guard, fill detection, time-flatten). Do not rebuild this.
Diary schema for closed trades — the schema is good; the analysis surface is what’s lacking.
Per-bot session gate concept (RR ladder by trade ordinal).
Cross-market context, red-folder news context, kill switches.
Fade shadow stop-entry study — the one piece of “what would have happened?” replay you already have. Generalize it to all gates.

Remove or sharply curtail

Suspicion gate in its current many-hard-vetoes-from-LLM-vocabulary form. Replace with a small set of structural vetoes that fire only on objective conditions.
_validate_trade running before thesis-arming. Move validation to the moment the stalker triggers, not the moment the LLM emits.
codex_max_entry_deviation_points: 2.0 as a hard reject. Replace with a slippage adjustment or widen to 5–8 pts.
Overlapping require_consecutive_same_signal: 3 (Vanta) plus REQUIRE_CONSECUTIVE_POLLS: 1 (trader) plus min_seconds_between_entries: 600s — pick one.
The diary memory string telling the model “your aggressive entries were toxic.” Replace with actual R-multiple statistics per regime.

Rebuild

Decision logging. Tag every kill with was_correct_kill = bool populated by the replay layer.
Codex55 lane. Either fold into Codex54 with a config flag, or give it a meaningfully different prompt. As a clone it is paying compute for almost nothing.
Fade lane qualification. Broaden QUALIFYING_VETOES to match the actually-most-common vetoes, or run fade routers in shadow-only until source quality improves.

5. Decision Framework

A new architecture is justified only if all of these are true:

Pre-deployment replay against the last 30 trading days produces a win rate ≥ 50% at the RR ladder used, on ≥ 200 simulated trades.
The system can produce 5–15 executable signals per active session on average days (currently < 5 in best week, 0 in worst).
Median time from “thesis armed” to “first eligible trigger” is ≤ 8 minutes.
Independent veto layers per candidate ≤ 3 (currently 8).
Every “skip” event carries enough geometry to be replayed; a nightly job actually replays them.
Codex execution guard staleness rejection < 10% of submitted signals.
Fade lane has independent qualification logic that does not depend on a specific veto vocabulary in the source pipeline.

If any of (1)–(4) cannot be met in design, the next version is the wrong next version.

6. Immediate Next Actions (analysis-only)

Replay published_not_executed and blocked candidates against bar history for 2026-05-01 → 2026-05-15. Compute realized R-multiple per candidate.
Cross-tabulate each suspicion-gate hard_veto reason against subsequent 30-min directional move.
Inventory every entry_stalk_rejected_bad_location while a thesis is armed: how often did price reach the planned entry within TTL but fail the stalker’s extra bounce/boundary requirements?
Audit Codex55 vs Codex54 prompts and configs side-by-side. If identical, Codex55 is paying compute for nothing.
Audit Vanta participation_gate decisions. Determine how many BLOCKs were against real entry signals vs standby packets.
Move CodexF / Codex54F / Codex55F to shadow-only until source quality is restored. Keep the fade router gathering data; stop burning Vanta sessions on a 0–6-fill-per-lane signal base.
Stop the “diary-as-warning” prompt injection in the next prompt revision. Replace with a structured R-multiple summary per regime.
Pause Codex55 live polling until the prompt audit determines it is doing anything Codex54 isn’t.
Build a single read-only dashboard over the existing JSONL showing per lane per hour: theses armed, triggers, vetoes by reason, Vanta submits, fills, R-multiple. Without this, every future tuning decision is guesswork.
Decide before further work which path you want. The evidence supports hybrid LLM planner + deterministic execution watcher with the suspicion gate downgraded to structural-only.

Net assessment

The system is currently optimizing for “avoid the last loss type” rather than “execute the next good idea.” A 90% FLAT rate from a model whose prompt-memory tells it “your aggressive entries were toxic” combined with eight independent vetoes is not a trading agent. Fix the asymmetry — measure what the gates kill, not just that they fired — before any more code changes.