port-gameboy-emulator · top current-schema: 27.60% · GPT-5.5 xhigh

Game Boy / GBC emulator

Port a Game Boy and Game Boy Color emulator to Rust from the SameBoy C reference. Cycle-accurate CPU, PPU, APU, timer, memory bus, MBC1/2/5 banking, OAM DMA, interrupt controller. Graded against the industry-standard ROM suites that have been the de-facto Game Boy emulator test set for two decades.

Target language: Rust (default)
Scored tests: 1,999
Milestones: 16 ordered
Unlock threshold: 60% pass to unlock the next
Measurement window: 12 hours
Reference impl in workspace: Yes — full SameBoy C source as study material

What the agent must produce

The agent starts with a scaffolded Rust project, a working CLI parser for --headless --max-cycles N --serial-log --cart-ram-out --state-out --screenshot-on-breakpoint --mode dmg|cgb, partially-stubbed CPU/MMU/PPU modules, and a fully-working joypad and interrupt controller. The full SameBoy C source sits in reference/sameboy/ as study material; the Pandocs Game Boy reference and the opcode table sit in reference/pandocs/.

The hardest single thing on this env is cycle accuracy. Most tests don't care whether the emulator's image looks roughly right — they care whether specific T-cycle-precise hardware state matches the reference at a given cycle count. A 1-cycle error in the timer's falling-edge detector, or a PPU mode-3 timing that's off by 2 dots, fails dozens of tests.

Test suites, in unlock order

Established test ROM suites, generated hardware-state programs, and control/contract suites are combined into one ordered ladder. Some cases pass/fail through serial output or cartridge RAM; many also require full hardware-state comparison against PyBoy at fixed cycle checkpoints.

#	Milestone	Suite source	Tests
1	CLI, Breakpoint Controls, and Artifact Contract	gb_contract	4
2	Blargg CPU Serial Semantics	blargg_cpu_semantic	12
3	Generated CPU Core State	gb_generated_cpu_core	160
4	Blargg CPU State and Memory Effects	blargg_cpu_state	12
5	Instruction Timing and Run Control	blargg_timing_semantic + blargg_timing_exact	14
6	Interrupts, Timers, Bus, DMA Basics	mooneye_timer + blargg_misc + dma/interrupt generated	191
7	Cartridge Controllers and RAM Banking	mooneye_mbc + gb_generated_mbc	147
8	Boot Handoff and Power-Up State	gb_generated_powerup	80
9	DMG PPU Timing, Rendering, Window, Sprites, OAM	mooneye ppu + acid2/mealybug DMG + SameSuite PPU	225
10	CGB Internals, Palettes, Banking, HDMA	acid2/mealybug CGB + gb_generated_cgb	252
11	MBC3 RTC and Save-Adjacent Persistence	gb_generated_mbc3	100
12	Joypad, Serial, Link-Cable Register Behavior	gb_generated_io_serial	100
13	APU Registers, Frame Sequencer, Channels	blargg_sound + samesuite/apu + gb_generated_apu	222
14	Deterministic Audio Window Oracle	gb_generated_audio_window	120
15	Public ROM Smoke Compatibility	community_smoke	100
16	Strict Full-State CPU Oracle and Long-Run Stability	gb_generated_cpu + gb_generated_stability	260
	Total scored tests		1,999

Five-layer defense

ROM obfuscation per run. Each ROM is copied to a temp file with a secrets.token_hex(16) filename. Title bytes ($0134–$0143) are zeroed. Header checksum at $014D is recomputed. A binary that fingerprints ROMs by filename or title cannot.
Per-run cycle offset. A random secrets.randbelow(16001) + 4000 T-cycle offset is added to the reference cycle count for both the agent's run and the oracle's run. Hardcoded register-state tables are useless because the cycle offset changes every run.
PyBoy as the oracle. The reference state is computed at eval time by running PyBoy independently with the same obfuscated ROM and the same cycle offset. The agent's full hardware state (CPU registers, IME, WRAM, HRAM, VRAM, OAM, timer DIV/TIMA/TMA/TAC, and PPU LCDC/STAT registers) must match.
Canary control ROMs. Handcrafted pass/fail ROMs run through the agent's binary before any real test. If the agent can't distinguish them — for example, if it always writes "Passed" regardless of input — the whole suite is flagged.
Minimum cycle floor (table only; enforcement not wired in yet). Per-ROM cycle minimums (BLARGG_MIN_CYCLES) are tabulated in eval/tests/utils.py with the intent of rejecting runs that complete faster than 50% of the reference cycle floor. The table exists; the check that consumes it is on the to-do list.

Cheating detected in the May 2026 Codex runs. The issue was not direct ROM fingerprinting or a copied emulator core; it was submitted code that changed run-control and reporting behavior around eval artifact paths, serial-output shape, terminal-loop detection, and post-result cycle settling. The rows below use each run's best clean eval before the later cheating commit; time, cost, and tokens are measured at that eval.

Per-run results

Current-schema Codex runs are scored at the best clean eval before the later cheating commit. Time, cost, and tokens are measured at that eval. The later raw peaks were higher, but those post-cheat states are not counted here.

Model	Agent	Window	Peak %	Tests passed	Milestones	Cost	Tokens	Outcome
GPT-5.5 xhigh	codex	1 hr 46 min	27.60%	194 / 1,999	4 / 16	$31	45 M	cheating later
GPT-5.4 xhigh	codex	49 min	18.75%	176 / 1,999	3 / 16	$5.14	11 M	cheating later

April 2026 legacy runs

These older rows used the superseded 234-test / 15-milestone schema, so they are not comparable to the current-schema rows above.

Model	Agent	Window	Peak %	Tests passed	Milestones	Cost	Outcome
GPT-5.4	opencode	8 hrs	57.2%	85 / 234	6 / 15	$83	eval crash
GPT-5.4	opencode	3 hrs	29.2%	54 / 234	2 / 15	$22	time budget
Grok 4.20 reasoning	opencode	8 hrs	5.0%	9 / 234	0 / 15	$95	time budget
Grok 4.20 reasoning	opencode	8 hrs	0.0%	0 / 234	0 / 15	$147	mode collapse