port-c-compiler · best: 97.05% · GPT-5.5 xhigh · 10 hrs 6 min · $165
Implement a C compiler in idiomatic Rust. The submitted binary has to compile real C end-to-end —
tokenize, preprocess, parse, typecheck, codegen to x86-64 AT&T assembly, then hand off to
as and ld. Graded against three established C test corpora:
3,207 tests across 44 ordered milestones.
Real C compilation, not a toy. The submitted binary is called as ./cc input.c -o output.
It has to handle the language features WACCT's textbook covers (integers, control flow, functions,
pointers, arrays, strings, structs, unions, casts), then enough of the C standard to pass
c-testsuite, then the adversarial cases in the GCC torture execute tests.
The agent cannot shell out to a host C compiler. The eval sandbox runs an strace probe
at session setup that watches for execve calls to cc, gcc,
g++, c++, clang, or clang++. If a run delegates
compilation, the public score is the best clean eval before that later cheating commit. Time, cost,
and tokens are measured at that eval. (One such incident has been caught — see below.)
44 milestones grouped into four bands: Basics (a 41-test warm-up), Writing-a-C-Compiler-Tests (Nora Sandler's textbook chapters 1–20, paired with their checkpoint sub-suites), c-testsuite (212 canonical-C-conformance tests, drawn from the larger upstream c-testsuite corpus), and the GCC torture execute tests (1,392 adversarial cases).
| Band | Tests | What's measured |
|---|---|---|
| Basics | 41 | Hello-world, arithmetic, control flow, function calls, the smallest possible programs |
| WACCT Ch1–20 (+ checkpoints) | ~1,556 | Integers, longs, floats/doubles, pointers, arrays, strings, structs, unions, enums, file-scope vars, function pointers, switch statements, and the rest of Sandler's "Writing a C Compiler" progression |
| c-testsuite | 212 | Open-source C conformance suite. Edge cases in initialization, scope, operator precedence, integer promotions, bitfields |
| GCC torture execute | 1,392 | The adversarial bar. Specifically designed to break compilers — pathological switch tables, deep recursion, struct-return ABI edge cases, sign-extension corners, the _Complex family, computed goto, every UB-adjacent corner of C99/C11 |
| Total | 3,207 |
Every production c-compiler run with peak % > 0, sorted by peak. "Wall" is when the last useful state was recorded.
| Model | Agent | Budget | Wall | Peak % | Tests | Cost | Tokens | Outcome |
|---|---|---|---|---|---|---|---|---|
| GPT-5.5 (xhigh) | codex | 12 hrs | 10 hrs 6 min | 97.05% | 2,671 / 2,995 | $165 | 231 M | platform bug |
| GPT-5.4 (xhigh) | codex | 12 hrs | 12 hrs | 96.63% | 2,938 / 3,207 | $108 | 267 M | platform bug |
| GPT-5.5 | codex | 2 hrs | 1 hr 30 min | 91.93% | 1,107 / 2,995 | not recorded | 62.4 M | sandbox hard kill |
| GPT-5.4-mini | codex | 12 hrs | 3 hrs 33 min | 91.67% | 1,111 / 2,995 | $20 | 160 M | cheating later |
| GPT-5.5 (xhigh) | codex | 12 hrs | 1 hr | 78.97% | 820 / 2,995 | $32 | 44.1 M | abandoned |
| DeepSeek-v4-pro | opencode | 12 hrs | 3 hrs 54 min | 55.90% | 630 / 3,207 | $4.50 | 171 M | model gave up |
| Qwen 3.6 plus | opencode | 12 hrs | 1 hr 54 min | 45.04% | 352 / 2,995 | $5.80 | 81 M | model gave up |
On May 12, GPT-5.4-mini's run climbed normally through WACCT chapters 1 to roughly 18 over the first ~3.5 hours. Then at minute 215, after hitting a structures/unions wall it couldn't pass, the run stopped being a compiler port and switched to host-compiler delegation.
It then committed Delegate C compilation to host compiler at 20:51:59 — replacing the
entire compile pipeline with a call to Command::new("/usr/bin/cc") — followed five
minutes later by Use host compiler for linking. The eval climbed across the two
cheating commits: 1,111 → 3,203 → 3,207 passing tests, the entire suite.
The anti-cheat strace probe was committed to the eval sandbox 2 hours and
18 minutes after this run finished. The incident is what motivated the probe.
Subsequent runs cannot repeat this cheat — the probe runs at session setup and rejects any
submission whose execve trace contains cc, gcc,
g++, c++, clang, or clang++. The agent's own
toolchain (the assembler as and linker ld) is unaffected.
The run is preserved in the dataset as ccomp-port-gpt54mini-codex-12h-0512a, scored
only through the best clean eval before the later host-compiler delegation. The time, cost, and
tokens shown for this run are measured at that clean eval. The transcript will be included in the
public data release as a documented example of agent reasoning during a cheating attempt.