RIIR Bench
← all evals

port-sqlite-to-rust · best: 100% · GPT-5.4 xhigh · 5 hrs · $52

SQLite-compatible SQL engine

Build a SQLite-compatible SQL engine in Rust from scratch. Match the reference engine's behavior across 5.8 million sqllogictest records spanning DDL, joins, subqueries, and the indexed milestones that separate naive implementations from real query planners.

Target language
Rust (default)
Total tests
5,806,415
Milestones
18 ordered
Unlock threshold
60% pass to unlock the next
Measurement window
12 hours
Reference impl in workspace
Yes — full sqlite3.c amalgamation as study material

What the agent starts with

The agent starts with a scaffolded Cargo project: a working JSONL request/response driver, type-system stubs, an empty parser with all the AST types pre-declared, and an executor with "not implemented" stubs for every SQL statement variant. The full SQLite source amalgamation (~256,000 lines of C) sits in reference/ as study material.

The job is to fill in the parser, the executor, the storage layer, the scalar and aggregate functions, and an index-aware query planner — to the point where the engine matches real SQLite's behavior on the sqllogictest corpus. Linking against rusqlite or shelling out to sqlite3 is explicitly forbidden.

Test surface, in order

Milestones unlock sequentially. The next milestone is only attempted when the current one passes at least 60%. The first eight milestones are pure SQL semantics (~11k records total). The next seven add indexed operations (~2M records). The final three are the long tail: ~3.7M records of scalar expressions, aggregates, and scaled joins.

# Milestone Suite Tests
1DDL Foundationsddl_foundations69
2UPDATE Mutationsupdate_mutations27
3IN / NOT IN with NULLsmembership_null_logic267
4SELECT: CASE and Subqueriesselect_case_subqueries1,031
5SELECT: NULLs and COALESCEselect_nulls_coalesce1,031
6SELECT: EXISTS and Correlatedselect_exists_correlated3,351
7SELECT: UNION / EXCEPT / INTERSECTselect_set_operations3,857
8SELECT: Multi-Table Joinsselect_multitable_joins1,436
9Indexed DELETE and Filterindexed_delete_filter160,942
10Indexed IN / NOT INindexed_membership132,907
11Indexed BETWEENindexed_range124,618
12Indexed ORDER BYindexed_ordering808,122
13Indexed Predicate Commutativityindexed_commutativity514,321
14Indexed Randomized Predicatesindexed_randomized208,168
15Indexed Viewsindexed_views109,229
16Scalar Expressionsexpressions_scalar1,200,366
17Aggregate Functionsaggregates1,292,350
18Multi-Table Joins (scaled)joins_multitable1,244,323
Total5,806,415

Per-run results

Every production sqlite run, sorted by peak milestone-completion percentage. "Wall" is when the last useful state was recorded — for runs that ended in a platform failure, this is when the failure happened, not the configured budget.

Model Agent Budget Wall Peak % Tests Cost Tokens Outcome
GPT-5.4 (xhigh) codex 12 hrs 5 hrs 100% 5,806,415 / 5,806,415 $52 139 M all tests pass
GPT-5.5 (xhigh) codex 12 hrs 8 hrs 54 min 99.91% 5,803,177 / 5,806,415 $258 386 M platform bug
GPT-5.4-mini codex 2 hrs 2 hrs 68.75% 1,783,123 / 5,806,415 $9 64 M time budget
GPT-5.4-mini codex 12 hrs 7 hrs 42 min 58.17% 431,937 / 5,806,415 $20 162 M model gave up
GPT-5.5 (xhigh) codex 12 hrs 1 hr 24 min 44.17% 164,453 / 5,806,415 $33 48 M platform crash
DeepSeek-v4-pro opencode 12 hrs 4 hrs 54 min 36.59% 8,038 / 5,806,415 $4.61 155 M model gave up
GPT-5.4-mini codex 12 hrs 42 min 14.44% 325 / 5,806,415 $3 23 M platform crash
GPT-5.4-mini codex 2 hrs 18 min 0% 0 / 5,806,415 $2.39 22.7 M sandbox lost

Tokens are total cumulative reported tokens: Codex input includes cached reads; OpenCode Go cache reads/writes are added from provider usage. Terminal labels come from platform state.