RIIR Bench

long-horizon coding agents · preliminary release

Can a frontier model Rewrite It In Rust?

Given as much time and resources as needed, can a frontier model build a full working implementation of real-world software in Rust?

It is important to note that, even if a model scores 100% on a task, I do not claim that the code generated is production-ready. The only guarantee is that it passes the collection of tests that make up the eval.

The end goal of this project: one-shot safe, efficient, and production-grade Linux in Rust. All of it.

Best run per model, per task

# Harness Model % completion Time Cost Tokens
1 codex GPT-5.4 (xhigh) 100% 5 hrs $52 139 M
2 codex GPT-5.5 (xhigh) 99.91% 8 hrs 54 min $258 386 M
3 codex GPT-5.4-mini 68.75% 2 hrs $9 64 M
4 opencode DeepSeek-v4-pro 36.59% 4 hrs 54 min $4.61 155 M

Turn 1 — after the first model turn

The task instructions ask for a one-shot port. None manage one. A turn is one model-invocation cycle bounded by the harness's idle signal. Time, cost, and tokens are measured at the first evaluated turn boundary; % completion is the eval of that boundary commit.

Harness Model % completion Time Cost Tokens
codex GPT-5.4 (xhigh) 27.70% 29 min ~$5.17 11.8 M
codex GPT-5.5 (xhigh) 48.45% 44 min ~$24.61 37.4 M
codex GPT-5.4-mini 11.92% 11 min ~$1.23 9.5 M
opencode DeepSeek-v4-pro 12.20% 1 hr 33 min $1.26 29.7 M
# Harness Model % completion Time Cost Tokens
1 codex GPT-5.5 (xhigh) 97.05% 10 hrs 6 min $165 231 M
2 codex GPT-5.4 (xhigh) 96.63% 12 hrs $108 267 M
3 codex GPT-5.4-mini 91.67%* 3 hrs 33 min $20 160 M
4 opencode DeepSeek-v4-pro 55.90% 3 hrs 54 min $4.50 171 M
5 opencode Qwen 3.6 plus 45.04% 1 hr 54 min $5.80 81 M

* Run cheated. Score reported is the best clean eval before cheating started.

Turn 1 — after the first model turn

The task instructions ask for a one-shot port. None manage one. A turn is one model-invocation cycle bounded by the harness's idle signal. Time, cost, and tokens are measured at the first evaluated turn boundary; % completion is the eval of that boundary commit.

Harness Model % completion Time Cost Tokens
codex GPT-5.5 (xhigh) 93.95% 1 hr 50 min ~$60.06 86.8 M
codex GPT-5.4 (xhigh) 42.81% 30 min ~$6.76 17.9 M
opencode DeepSeek-v4-pro 0% 48 min $0.71 3.7 M
opencode Qwen 3.6 plus 0.12% 42 min $1.81 26.0 M
codex GPT-5.4-mini 31.60% 15 min ~$1.69 10.8 M
# Harness Model % completion Time Cost Tokens
1 codex GPT-5.5 (xhigh) 27.60%* 1 hr 46 min $31 45 M
2 codex GPT-5.4 (xhigh) 18.75%* 49 min $5.14 11 M

* Run cheated. Score reported is the best clean eval before cheating started.

Turn 1 — after the first model turn

The task instructions ask for a one-shot port. None manage one. A turn is one model-invocation cycle bounded by the harness's idle signal. Time, cost, and tokens are measured at the first evaluated turn boundary; % completion is the eval of that boundary commit.

Harness Model % completion Time Cost Tokens
codex GPT-5.5 (xhigh) 24.93% 34 min ~$15.73 22.6 M
codex GPT-5.4 (xhigh) 18.75% 38 min ~$4.51 10.8 M

% completion is the counted milestone-completion percentage. For rows marked best clean eval, time, cost, and tokens are measured at that eval, not at the later cheating commit or run end. Tokens are total cumulative reported tokens; Codex input includes cached reads, and OpenCode Go cache reads/writes are added. Per-environment details on tasks. Thesis and methodology on about.

How to cite

@misc{hayat2026riirbench,
  title  = {RIIR Bench: A long-horizon benchmark for AI coding
            agents on real engineering ports to Rust},
  author = {Hassan Hayat},
  year   = {2026},
  url    = {https://riirbench.com},
  note   = {Independent researcher. Contact: hassan.hayat7@gmail.com}
}