Original Python repoz.ai (Opus) Optimization IdeasSonnet 4.6 Optimization IdeasCodex Optimization Ideas

Subways Rust Translation Benchmark — Comparison Report

Experiment Setup

  • Task: Translate alexey-zakharenkov/subways (~4,500 lines of Python) to idiomatic Rust
  • Date: 2026-04-02
  • Orchestrator: Claude Opus 4.6 (1M context) — designed API spec, created workspace template, launched agents
  • Test city: Vienna (5 subway lines, 110 stations, 9 interchanges)

Agents

AgentModelRunnerCost Model
z.aiClaude Opus (via z.ai gateway)Claude Code cc zai yoloz.ai tokens
CodexGPT-5.4 (ChatGPT Plus account)OpenAI Codex CLI v0.118 codex --full-autoChatGPT Plus quota
SonnetClaude Sonnet 4.6 (200K context)Claude Code cc yolo sonnetClaude Max

Results Summary

Metricz.ai (Opus)Sonnet 4.6Codex (gpt-5.4)
Lines of Rust4,6423,5143,350
todo!() remaining0021
Compile errors2 (brace mismatch)04 (in progress)
Tests written39525
Tests passingCan't run (compile error)5/525 (some failing)
Clippy warnings0 (cleaned)0 (cleaned)N/A
Vienna correct?NO (0 stations found)YES (110/110, 5/5, 9/9)NO (incomplete)
Time to first output17 min~45 minStill running
Context used19% of 1M80% of 200K66% left
Agent needed help?Yes (stuck on syntax error)Self-sufficientYes (sandbox prompts)

Winner: Sonnet 4.6 — the only implementation that passes Vienna validation end-to-end

Detailed Analysis

z.ai (Opus via z.ai)

Fastest start, broadest coverage, but incomplete pipeline

  • Wrote the most code (4,642 lines) and the most tests (39)
  • Completed all 8 module stubs in 17 minutes — impressive speed
  • However, the route extraction pipeline (extract_routes()) was stubbed out despite being marked "complete"
  • When asked to implement the missing pipeline, it wrote partial code but introduced a brace mismatch in route.rs
  • Could not self-fix the syntax error even when told the exact line number
  • Strength: Fast scaffolding, test generation, broad coverage
  • Weakness: Lost track of what was actually wired up vs stubbed; struggled with incremental fixes

Sonnet 4.6

Methodical, self-correcting, fully working

  • Took longer (~45 min total across 2 sessions) but produced a working validator
  • Self-diagnosed and fixed 4 bugs during integration testing:
    1. City::contains() had swapped bbox indices
    2. Network parsing failed on single-network cities
    3. CSV parser used wrong column format (headers vs positional)
    4. JSON loader didn't handle raw array format
  • Excellent self-correction loop: clippy → fix → test → Vienna run → debug → fix
  • Strength: Reliable end-to-end delivery, excellent debugging, self-sufficient
  • Weakness: Fewer tests (5 vs 39), slower, hit 80% context quickly

Codex (GPT-5.4)

Methodical and careful, but too slow and constantly blocked by sandbox

  • Most deliberate approach — reads Python reference code thoroughly before implementing
  • Good thought process: "I'm keeping the recovery path conservative so later city integration can exercise the main logic"
  • Wrote 25 tests — solid test-first methodology
  • However, --full-auto mode still required manual sandbox permission approvals ~8 times
  • At 3,350 lines with 21 todo!()s remaining after ~45 minutes — roughly 60% complete
  • Strength: Careful, test-driven, good code commentary
  • Weakness: Extremely slow, sandbox UX terrible, never finished

Performance — Vienna Benchmark

Before Optimization

MetricPythonz.aiSonnetCodex
Wall clock3.1s11.6s4.9s15.0s
Peak memory307 MB817 MB1,325 MB495 MB

After Optimization (agents self-optimized)

MetricPythonz.aiSonnetCodex
Wall clock3.1s11.6s*2.5s1.04s
Peak memory307 MB798 MB*491 MB227 MB

* z.ai optimization still in progress at time of writing

Codex achieved 3x faster than Python, using 25% less memory. Sonnet is 20% faster than Python. z.ai hasn't optimized yet.

Key Optimization Techniques (Sonnet)

  1. Direct JSON deserialization (skip serde_json::Value intermediate) → −1.4s, −835MB
  2. FxHashMap everywhere (rustc-hash crate) → −0.1s, −20MB

Codex Performance Mystery

Codex achieved 1.04s / 227MB without documenting its approach — likely used similar techniques plus possibly smarter data structures from the start.

Reflections & Process Improvements

What Worked

  1. Shared API spec: Having a compiling Cargo workspace with type definitions and todo!() stubs gave all agents a clear contract
  2. Module breakdown: The 8-module implementation order was followed by all agents
  3. Vienna test data: Pre-downloading the OSM data as a JSON fixture was essential — avoids 30s Overpass API calls during development
  4. Parallel launch: Three agents running simultaneously saves wall-clock time even if not all finish

What Failed

  1. -p (print mode) doesn't work for coding agents — they need interactive mode to run commands, read files, and iterate. This lost ~10 minutes at the start.

  2. Codex --full-auto is not actually full-auto — it still asks for sandbox permission on every file write that hits a sandbox boundary. We had to manually approve ~8 times. The --dangerously-bypass-approvals-and-sandbox flag exists but wasn't used.

  3. z.ai agent claimed "complete" but wasn't — marking all 8 modules as ✔ after 17 minutes was deceptive. The agent wrote struct definitions and helper functions but skipped the hard wiring logic (extract_routes, validate). Need to add: "After each module, run cargo run against Vienna and report the result" to the prompt.

  4. Orchestrator fixing agents' code defeats the benchmark purpose — I (the orchestrator) dispatched fix agents that modified impl-zai and impl-sonnet directly. This corrupts the comparison. For a fair benchmark, the agents should fix their own code.

  5. No timeout or iteration budget — agents could run indefinitely. Need: "You have 60 minutes. At the end, report what works and what doesn't."

Process Recommendations for Next Time

  1. Prompt must require integration test after every module: Add After each module, run cargo run -- -c Vienna -i ../test-data/vienna_osm.json and report the output. Don't mark a module as done until the integration test passes or you've documented why it fails.

  2. Use --dangerously-bypass-approvals-and-sandbox for Codex — the sandbox prompts killed its velocity.

  3. Set a time budget: "You have 60 minutes. Focus on getting Vienna validation working first, then add tests and polish."

  4. Don't use -p mode — always interactive.

  5. Add a watchdog: A script that checks cargo check every 2 minutes and sends "you have compile errors, fix them" to any stuck agent.

  6. Fairer comparison: Don't touch agent code from the orchestrator. Let each agent's output stand on its own.

  7. Context management matters: Sonnet hit 80% of its 200K context. For large translations, Opus 1M context is better — but Opus via z.ai was slower at self-correction. Consider Sonnet with 1M context if available.

What Each Model is Useful For

Use CaseBest ModelWhy
Initial scaffoldingOpus (z.ai)Fast, writes broad code quickly, good test generation
End-to-end deliverySonnetSelf-correcting, methodical, finishes what it starts
Code review / careful translationCodex (gpt-5.4)Reads reference code carefully, good commentary, test-first
Quick fixesSonnetBest at diagnosing and fixing its own bugs
Large codebaseOpus 1MMore context window headroom
Unattended operationSonnet (Claude Code)Doesn't need manual permission approvals

Bottom Line

All three eventually produced working Vienna validators. Codex won on performance (1.04s, 227MB). Sonnet won on reliability — first to work, self-correcting, needed zero human intervention. z.ai won on test coverage (39 tests) but needed the most human help.

The ideal workflow: Opus for API design + scaffolding, Sonnet for implementation, then each optimizes independently.


Reflections — What Each Model is Good For

Opus (z.ai): The Fast Scaffolder

  • Best at: Generating broad code structure quickly (17 min for 8 modules), writing comprehensive tests (39 tests)
  • Worst at: Self-correction, incremental fixes, not losing track of what's wired vs stubbed
  • Use for: Initial project setup, test suite generation, boilerplate, API design
  • Don't use for: End-to-end implementation without supervision

Sonnet 4.6: The Reliable Finisher

  • Best at: Complete end-to-end delivery, self-debugging (fixed 4 bugs autonomously), clean optimization
  • Worst at: Test coverage (only 5 tests), hit context limit at 80% (200K window)
  • Use for: Implementation work, bug fixing, optimization, anything that needs to actually work
  • Don't use for: Large projects without context compaction strategy

Codex (GPT-5.4): The Careful Optimizer

  • Best at: Performance optimization (1.04s — 3x faster than Python), code quality, test-first methodology
  • Worst at: UX (sandbox prompts every 2 minutes), speed of initial delivery (still had 11 todo!()s when others finished)
  • Use for: Performance-critical code, careful translations where correctness matters more than speed
  • Don't use for: Autonomous unattended operation (needs babysitting for sandbox approvals)

Process Problems & Fixes

ProblemImpactFix for Next Time
Used -p (print mode) initiallyLost 10 min, agents couldn't iterateAlways use interactive mode
Codex --full-auto still asks for permissionAgent blocked ~8 times, needed manual approvalUse --dangerously-bypass-approvals-and-sandbox
No integration test requirement per modulez.ai marked modules "done" that weren't wired upRequire cargo run -- -c Vienna after each module
Orchestrator fixed agent codeCorrupted benchmark fairnessHands-off policy — only send prompts, never edit impl dirs
No time budgetz.ai spent 17min "completing" then hours fixingSet 60-min budget, require checkpoint reports
Agents don't write optimization docs proactivelyHad to ask each one explicitlyInclude "write analysis to file" in initial prompt
z.ai context too large (1M) for urgencyAgent works leisurely, doesn't compressMay be counterproductive — Sonnet's 200K forced focus

Streamlined Process for Next Benchmark

  1. Orchestrator designs API spec (Opus — good at this)
  2. Initial prompt includes: module order, integration test requirement per module, time budget (60 min), optimization target, doc-writing requirement
  3. Launch agents: Claude Code interactive mode with yolo (bypass perms), Codex with --dangerously-bypass-approvals-and-sandbox
  4. Hands-off monitoring: Script that checks cargo check every 2 min and reports errors to agents
  5. Phase 2 — optimization: Each agent writes ideas doc, implements, benchmarks, verifies tests still pass
  6. Phase 3 — comparison: Orchestrator collects perf numbers, reads docs, writes final comparison