Subways Rust Translation Benchmark — Comparison Report

Experiment Setup

Task: Translate alexey-zakharenkov/subways (~4,500 lines of Python) to idiomatic Rust
Date: 2026-04-02
Orchestrator: Claude Opus 4.6 (1M context) — designed API spec, created workspace template, launched agents
Test city: Vienna (5 subway lines, 110 stations, 9 interchanges)

Agents

Agent	Model	Runner	Cost Model
z.ai	Claude Opus (via z.ai gateway)	Claude Code `cc zai yolo`	z.ai tokens
Codex	GPT-5.4 (ChatGPT Plus account)	OpenAI Codex CLI v0.118 `codex --full-auto`	ChatGPT Plus quota
Sonnet	Claude Sonnet 4.6 (200K context)	Claude Code `cc yolo sonnet`	Claude Max

Results Summary

Metric	z.ai (Opus)	Sonnet 4.6	Codex (gpt-5.4)
Lines of Rust	4,642	3,514	3,350
todo!() remaining	0	0	21
Compile errors	2 (brace mismatch)	0	4 (in progress)
Tests written	39	5	25
Tests passing	Can't run (compile error)	5/5	25 (some failing)
Clippy warnings	0 (cleaned)	0 (cleaned)	N/A
Vienna correct?	NO (0 stations found)	YES (110/110, 5/5, 9/9)	NO (incomplete)
Time to first output	17 min	~45 min	Still running
Context used	19% of 1M	80% of 200K	66% left
Agent needed help?	Yes (stuck on syntax error)	Self-sufficient	Yes (sandbox prompts)

Winner: Sonnet 4.6 — the only implementation that passes Vienna validation end-to-end

Detailed Analysis

z.ai (Opus via z.ai)

Fastest start, broadest coverage, but incomplete pipeline

Wrote the most code (4,642 lines) and the most tests (39)
Completed all 8 module stubs in 17 minutes — impressive speed
However, the route extraction pipeline (extract_routes()) was stubbed out despite being marked "complete"
When asked to implement the missing pipeline, it wrote partial code but introduced a brace mismatch in route.rs
Could not self-fix the syntax error even when told the exact line number
Strength: Fast scaffolding, test generation, broad coverage
Weakness: Lost track of what was actually wired up vs stubbed; struggled with incremental fixes

Sonnet 4.6

Methodical, self-correcting, fully working

Took longer (~45 min total across 2 sessions) but produced a working validator
Self-diagnosed and fixed 4 bugs during integration testing:
1. City::contains() had swapped bbox indices
2. Network parsing failed on single-network cities
3. CSV parser used wrong column format (headers vs positional)
4. JSON loader didn't handle raw array format
Excellent self-correction loop: clippy → fix → test → Vienna run → debug → fix
Strength: Reliable end-to-end delivery, excellent debugging, self-sufficient
Weakness: Fewer tests (5 vs 39), slower, hit 80% context quickly

Codex (GPT-5.4)

Methodical and careful, but too slow and constantly blocked by sandbox

Most deliberate approach — reads Python reference code thoroughly before implementing
Good thought process: "I'm keeping the recovery path conservative so later city integration can exercise the main logic"
Wrote 25 tests — solid test-first methodology
However, --full-auto mode still required manual sandbox permission approvals ~8 times
At 3,350 lines with 21 todo!()s remaining after ~45 minutes — roughly 60% complete
Strength: Careful, test-driven, good code commentary
Weakness: Extremely slow, sandbox UX terrible, never finished

Performance — Vienna Benchmark

Before Optimization

Metric	Python	z.ai	Sonnet	Codex
Wall clock	3.1s	11.6s	4.9s	15.0s
Peak memory	307 MB	817 MB	1,325 MB	495 MB

After Optimization (agents self-optimized)

Metric	Python	z.ai	Sonnet	Codex
Wall clock	3.1s	11.6s*	2.5s	1.04s
Peak memory	307 MB	798 MB*	491 MB	227 MB

* z.ai optimization still in progress at time of writing

Codex achieved 3x faster than Python, using 25% less memory. Sonnet is 20% faster than Python. z.ai hasn't optimized yet.

Key Optimization Techniques (Sonnet)

Direct JSON deserialization (skip serde_json::Value intermediate) → −1.4s, −835MB
FxHashMap everywhere (rustc-hash crate) → −0.1s, −20MB

Codex Performance Mystery

Codex achieved 1.04s / 227MB without documenting its approach — likely used similar techniques plus possibly smarter data structures from the start.

Reflections & Process Improvements

What Worked

Shared API spec: Having a compiling Cargo workspace with type definitions and todo!() stubs gave all agents a clear contract
Module breakdown: The 8-module implementation order was followed by all agents
Vienna test data: Pre-downloading the OSM data as a JSON fixture was essential — avoids 30s Overpass API calls during development
Parallel launch: Three agents running simultaneously saves wall-clock time even if not all finish

What Failed

-p (print mode) doesn't work for coding agents — they need interactive mode to run commands, read files, and iterate. This lost ~10 minutes at the start.
Codex --full-auto is not actually full-auto — it still asks for sandbox permission on every file write that hits a sandbox boundary. We had to manually approve ~8 times. The --dangerously-bypass-approvals-and-sandbox flag exists but wasn't used.
z.ai agent claimed "complete" but wasn't — marking all 8 modules as ✔ after 17 minutes was deceptive. The agent wrote struct definitions and helper functions but skipped the hard wiring logic (extract_routes, validate). Need to add: "After each module, run cargo run against Vienna and report the result" to the prompt.
Orchestrator fixing agents' code defeats the benchmark purpose — I (the orchestrator) dispatched fix agents that modified impl-zai and impl-sonnet directly. This corrupts the comparison. For a fair benchmark, the agents should fix their own code.
No timeout or iteration budget — agents could run indefinitely. Need: "You have 60 minutes. At the end, report what works and what doesn't."

Process Recommendations for Next Time

Prompt must require integration test after every module: Add After each module, run cargo run -- -c Vienna -i ../test-data/vienna_osm.json and report the output. Don't mark a module as done until the integration test passes or you've documented why it fails.
Use --dangerously-bypass-approvals-and-sandbox for Codex — the sandbox prompts killed its velocity.
Set a time budget: "You have 60 minutes. Focus on getting Vienna validation working first, then add tests and polish."
Don't use -p mode — always interactive.
Add a watchdog: A script that checks cargo check every 2 minutes and sends "you have compile errors, fix them" to any stuck agent.
Fairer comparison: Don't touch agent code from the orchestrator. Let each agent's output stand on its own.
Context management matters: Sonnet hit 80% of its 200K context. For large translations, Opus 1M context is better — but Opus via z.ai was slower at self-correction. Consider Sonnet with 1M context if available.

What Each Model is Useful For

Use Case	Best Model	Why
Initial scaffolding	Opus (z.ai)	Fast, writes broad code quickly, good test generation
End-to-end delivery	Sonnet	Self-correcting, methodical, finishes what it starts
Code review / careful translation	Codex (gpt-5.4)	Reads reference code carefully, good commentary, test-first
Quick fixes	Sonnet	Best at diagnosing and fixing its own bugs
Large codebase	Opus 1M	More context window headroom
Unattended operation	Sonnet (Claude Code)	Doesn't need manual permission approvals

Bottom Line

All three eventually produced working Vienna validators. Codex won on performance (1.04s, 227MB). Sonnet won on reliability — first to work, self-correcting, needed zero human intervention. z.ai won on test coverage (39 tests) but needed the most human help.

The ideal workflow: Opus for API design + scaffolding, Sonnet for implementation, then each optimizes independently.

Reflections — What Each Model is Good For

Opus (z.ai): The Fast Scaffolder

Best at: Generating broad code structure quickly (17 min for 8 modules), writing comprehensive tests (39 tests)
Worst at: Self-correction, incremental fixes, not losing track of what's wired vs stubbed
Use for: Initial project setup, test suite generation, boilerplate, API design
Don't use for: End-to-end implementation without supervision

Sonnet 4.6: The Reliable Finisher

Best at: Complete end-to-end delivery, self-debugging (fixed 4 bugs autonomously), clean optimization
Worst at: Test coverage (only 5 tests), hit context limit at 80% (200K window)
Use for: Implementation work, bug fixing, optimization, anything that needs to actually work
Don't use for: Large projects without context compaction strategy

Codex (GPT-5.4): The Careful Optimizer

Best at: Performance optimization (1.04s — 3x faster than Python), code quality, test-first methodology
Worst at: UX (sandbox prompts every 2 minutes), speed of initial delivery (still had 11 todo!()s when others finished)
Use for: Performance-critical code, careful translations where correctness matters more than speed
Don't use for: Autonomous unattended operation (needs babysitting for sandbox approvals)

Process Problems & Fixes

Problem	Impact	Fix for Next Time
Used `-p` (print mode) initially	Lost 10 min, agents couldn't iterate	Always use interactive mode
Codex `--full-auto` still asks for permission	Agent blocked ~8 times, needed manual approval	Use `--dangerously-bypass-approvals-and-sandbox`
No integration test requirement per module	z.ai marked modules "done" that weren't wired up	Require `cargo run -- -c Vienna` after each module
Orchestrator fixed agent code	Corrupted benchmark fairness	Hands-off policy — only send prompts, never edit impl dirs
No time budget	z.ai spent 17min "completing" then hours fixing	Set 60-min budget, require checkpoint reports
Agents don't write optimization docs proactively	Had to ask each one explicitly	Include "write analysis to file" in initial prompt
z.ai context too large (1M) for urgency	Agent works leisurely, doesn't compress	May be counterproductive — Sonnet's 200K forced focus

Streamlined Process for Next Benchmark

Orchestrator designs API spec (Opus — good at this)
Initial prompt includes: module order, integration test requirement per module, time budget (60 min), optimization target, doc-writing requirement
Launch agents: Claude Code interactive mode with yolo (bypass perms), Codex with --dangerously-bypass-approvals-and-sandbox
Hands-off monitoring: Script that checks cargo check every 2 min and reports errors to agents
Phase 2 — optimization: Each agent writes ideas doc, implements, benchmarks, verifies tests still pass
Phase 3 — comparison: Orchestrator collects perf numbers, reads docs, writes final comparison