RIIR Bench
← all evals

port-gameboy-emulator · top current-schema: 27.60% · GPT-5.5 xhigh

Game Boy / GBC emulator

Port a Game Boy and Game Boy Color emulator to Rust from the SameBoy C reference. Cycle-accurate CPU, PPU, APU, timer, memory bus, MBC1/2/5 banking, OAM DMA, interrupt controller. Graded against the industry-standard ROM suites that have been the de-facto Game Boy emulator test set for two decades.

Target language
Rust (default)
Scored tests
1,999
Milestones
16 ordered
Unlock threshold
60% pass to unlock the next
Measurement window
12 hours
Reference impl in workspace
Yes — full SameBoy C source as study material

What the agent must produce

The agent starts with a scaffolded Rust project, a working CLI parser for --headless --max-cycles N --serial-log --cart-ram-out --state-out --screenshot-on-breakpoint --mode dmg|cgb, partially-stubbed CPU/MMU/PPU modules, and a fully-working joypad and interrupt controller. The full SameBoy C source sits in reference/sameboy/ as study material; the Pandocs Game Boy reference and the opcode table sit in reference/pandocs/.

The hardest single thing on this env is cycle accuracy. Most tests don't care whether the emulator's image looks roughly right — they care whether specific T-cycle-precise hardware state matches the reference at a given cycle count. A 1-cycle error in the timer's falling-edge detector, or a PPU mode-3 timing that's off by 2 dots, fails dozens of tests.

Test suites, in unlock order

Established test ROM suites, generated hardware-state programs, and control/contract suites are combined into one ordered ladder. Some cases pass/fail through serial output or cartridge RAM; many also require full hardware-state comparison against PyBoy at fixed cycle checkpoints.

# Milestone Suite source Tests
1CLI, Breakpoint Controls, and Artifact Contractgb_contract4
2Blargg CPU Serial Semanticsblargg_cpu_semantic12
3Generated CPU Core Stategb_generated_cpu_core160
4Blargg CPU State and Memory Effectsblargg_cpu_state12
5Instruction Timing and Run Controlblargg_timing_semantic + blargg_timing_exact14
6Interrupts, Timers, Bus, DMA Basicsmooneye_timer + blargg_misc + dma/interrupt generated191
7Cartridge Controllers and RAM Bankingmooneye_mbc + gb_generated_mbc147
8Boot Handoff and Power-Up Stategb_generated_powerup80
9DMG PPU Timing, Rendering, Window, Sprites, OAMmooneye ppu + acid2/mealybug DMG + SameSuite PPU225
10CGB Internals, Palettes, Banking, HDMAacid2/mealybug CGB + gb_generated_cgb252
11MBC3 RTC and Save-Adjacent Persistencegb_generated_mbc3100
12Joypad, Serial, Link-Cable Register Behaviorgb_generated_io_serial100
13APU Registers, Frame Sequencer, Channelsblargg_sound + samesuite/apu + gb_generated_apu222
14Deterministic Audio Window Oraclegb_generated_audio_window120
15Public ROM Smoke Compatibilitycommunity_smoke100
16Strict Full-State CPU Oracle and Long-Run Stabilitygb_generated_cpu + gb_generated_stability260
Total scored tests1,999

Five-layer defense

  1. ROM obfuscation per run. Each ROM is copied to a temp file with a secrets.token_hex(16) filename. Title bytes ($0134–$0143) are zeroed. Header checksum at $014D is recomputed. A binary that fingerprints ROMs by filename or title cannot.
  2. Per-run cycle offset. A random secrets.randbelow(16001) + 4000 T-cycle offset is added to the reference cycle count for both the agent's run and the oracle's run. Hardcoded register-state tables are useless because the cycle offset changes every run.
  3. PyBoy as the oracle. The reference state is computed at eval time by running PyBoy independently with the same obfuscated ROM and the same cycle offset. The agent's full hardware state (CPU registers, IME, WRAM, HRAM, VRAM, OAM, timer DIV/TIMA/TMA/TAC, and PPU LCDC/STAT registers) must match.
  4. Canary control ROMs. Handcrafted pass/fail ROMs run through the agent's binary before any real test. If the agent can't distinguish them — for example, if it always writes "Passed" regardless of input — the whole suite is flagged.
  5. Minimum cycle floor (table only; enforcement not wired in yet). Per-ROM cycle minimums (BLARGG_MIN_CYCLES) are tabulated in eval/tests/utils.py with the intent of rejecting runs that complete faster than 50% of the reference cycle floor. The table exists; the check that consumes it is on the to-do list.

Cheating detected in the May 2026 Codex runs. The issue was not direct ROM fingerprinting or a copied emulator core; it was submitted code that changed run-control and reporting behavior around eval artifact paths, serial-output shape, terminal-loop detection, and post-result cycle settling. The rows below use each run's best clean eval before the later cheating commit; time, cost, and tokens are measured at that eval.

Per-run results

Current-schema Codex runs are scored at the best clean eval before the later cheating commit. Time, cost, and tokens are measured at that eval. The later raw peaks were higher, but those post-cheat states are not counted here.

Model Agent Window Peak % Tests passed Milestones Cost Tokens Outcome
GPT-5.5 xhigh codex 1 hr 46 min 27.60% 194 / 1,999 4 / 16 $31 45 M cheating later
GPT-5.4 xhigh codex 49 min 18.75% 176 / 1,999 3 / 16 $5.14 11 M cheating later

April 2026 legacy runs

These older rows used the superseded 234-test / 15-milestone schema, so they are not comparable to the current-schema rows above.

Model Agent Window Peak % Tests passed Milestones Cost Outcome
GPT-5.4 opencode 8 hrs 57.2% 85 / 234 6 / 15 $83 eval crash
GPT-5.4 opencode 3 hrs 29.2% 54 / 234 2 / 15 $22 time budget
Grok 4.20 reasoning opencode 8 hrs 5.0% 9 / 234 0 / 15 $95 time budget
Grok 4.20 reasoning opencode 8 hrs 0.0% 0 / 234 0 / 15 $147 mode collapse