Benchmark
The type checker never fired
Last week I ran AgentLang Index v1.0 against three frontier models and twenty tasks in five languages. The headline was the Zero column: 0/20, 0/20, 0/20 across gpt-4o-mini, gpt-4o, and gpt-5. Sixty Zero attempts, zero passes. I wrote up the headline last Tuesday. What I did not write up was the failure distribution. This post is that.
The Zero compiler emits structured error codes. IMP is the import resolver. PAR is the parser. TYP is the type checker. SYS is the runtime. Across the 60 attempts, the error breakdown is:
IMP001 (unknown import): 30
PAR100 (parse error): 26
TYP* (type error): 0
SYS* (system error): 0
no extractable code: 4
──
total: 60
Zero TYP errors. None. Across sixty frontier-model attempts to write Zero programs, no model ever produced output that the compiler accepted as far as the type-checking phase. Every single failure happened at the parser or at the import resolver. The type checker, where most language safety lives, never ran.
What is at IMP001
The import resolver fires after the parser accepts a file. Its job is to look up the imports the program declared and confirm they exist. IMP001 specifically is unknown package-local import: the file says use foo and there is no package called foo reachable from this module.
Of the 56 errors that were extractable code, 30 are IMP001, and they cluster on one thing. The model writes the Rust idiom for accessing the standard library:
use std
fn main() -> Errors!Void
let args = std.args()
...
That is what gpt-5 produced for 001-fibonacci-memoized. Zero has no package called std. Its prelude is implicit; functions like std.args.get(1) in the reference implementation come from a std namespace that is in scope by default. The model imports a thing that doesn't need importing, names it after the language where that thing does need importing, and the resolver rejects the line at column 1 of file 1 before any other phase has a chance to fire.
This is the Rust-prior failure mode. The model has seen so many Rust programs that the way a program opens in the model's head is use std. It then attempts to mash Zero syntax in the body and Rust syntax in the header. The parser sometimes lets that pass. The import resolver never does.
What is at PAR100
The other 26 errors are PAR100. The parser refuses to even build an AST. The most common shape:
entry fn main() -> exit {
check world.out.write("hello\n")
return
}
That is what gpt-5 produced for 000-hello-stdout. The reference is:
pub fun main(world: World) -> Void raises {
check world.out.write("hello\n")
}
The differences are the entire vocabulary of how Zero declares a function. pub fun, not entry fn. World is a parameter, not implicit. -> Void raises, not -> exit. The model attempts a function signature that looks plausibly Zero-flavored, picks the wrong keyword, the parser gets to column 10 of line 1, and gives up: expected '{' before block.
The smallest model gives up earlier still. gpt-4o-mini on the same task:
check world.out.write("hello\n")
return
No function signature at all. The parser at column 1 of line 1 expects a top-level declaration, sees check (which is a statement keyword, not a declaration keyword), and emits PAR100 immediately.
The capability inversion
Look at the per-model error distribution.
IMP PAR no-extract
gpt-5 11 5 4
gpt-4o 15 5 0
gpt-4o-mini 4 16 0
The bigger models fail mostly at IMP. The smaller model fails mostly at PAR. Stated another way: the smaller model fails earlier in the compilation pipeline.
This is not the usual capability story. Usually a more capable model produces output that gets further along whatever pipeline you put it through. Here the more capable model has learned the syntactic shape of a function declaration well enough to fool the parser, and then fails on the semantics of which packages exist. The less capable model has not learned the shape of a function declaration at all, so it fails before the parser is willing to construct a tree.
You could read this as gpt-5 being closer to Zero competence than gpt-4o-mini. I think that reads it backwards. gpt-5 has more Rust in its training prior; that prior is precisely what gets it to the next compilation phase and then trips it there. gpt-4o-mini has less of any language's prior to lean on, and the failure happens at the syntactic surface rather than at the semantic interior. The most capable model isn't closer to writing Zero. It is further into a different language that happens to also fail.
What the type checker not firing means
A compiler is a pipeline. Tokens to AST to resolved modules to typed program to lowered IR to machine code. Each phase rejects a different class of error. Type checkers in particular are where most of a modern language's expressivity lives: lifetime checks in Rust, exhaustiveness checks in Haskell, null-vs-defined in TypeScript, effect tracking in Zero. If you want to know whether a model can write a language, the question that matters is whether the model's output ever gets to type-checking. Type errors are evidence of comprehension; they mean the model wrote a program with the right shape and got the meaning wrong. Parse errors and import errors are evidence of recognition failure; the model has not located the language at all.
Sixty attempts in Zero. Zero type errors. The model's output never reached the part of the compiler where the language design lives. The model and the language did not yet meet.
This is also what makes the agent-first claim falsifiable. If Zero is well-designed for models in the way Vercel Labs has bet, then in some future model run, the distribution will start showing up at TYP. That would mean the model knows how to write a Zero function and is now making mistakes about what the function does. Mistakes the type checker catches. That is a much better failure mode than IMP001 on line one, and it is the failure mode that has to appear before a model can be said to have learned Zero, as distinct from guessed at it.
Today there are no such mistakes. The whole agent-first design surface is shadowed by a problem that is one layer above it. The model needs to recognize Zero before it can be helped or hindered by Zero's design choices about effects, errors, or world capabilities. Until TYP errors start showing up in this benchmark, the rest of Zero's design is downstream of a recognition problem the model has not solved.
What I would change if I were running Zero
One change: ship Zero's prelude documentation with the keyword use std deliberately absent and a callout at the top of every example explaining that the prelude is implicit. Right now the official docs read like "of course we have an implicit prelude, look at how the examples don't import anything," which is the wrong signal for models trained on languages where the absence of use std is a bug. The docs need to actively counterprogram against the Rust prior.
Another change: pick a function-declaration keyword that the model's prior does not collide with. fun is too close to fn for a model to disambiguate at temperature zero. pub fun in particular reads as a typo of pub fn. The literature on language design typically emphasizes ergonomic distinctness for humans. For models, the distinctness has to be against existing-language priors. If the design has to live with both, the constraints are tighter than either group of users alone implies.
A third: the standard library access pattern. Zero accesses standard functionality through std.args.get(1) in expression position, but the model has never seen std used that way without a preceding use std or import std line. The implicit-prelude design is the right move for a language that wants to read clean. It is the wrong move for a language that wants models to produce it correctly on the first attempt. A language that wants both needs to invent a syntactic anchor for "the implicit prelude is what you're using here" that the model can recognize.
None of these are critiques of Zero. They are the constraints that any agent-first language has to absorb. Zero is the first language I know of that has tried to absorb them deliberately, and the benchmark is the way to measure how the absorption is going. The first measurement says the model never finds the language. That is information, not a verdict.
What comes next
The benchmark accepts new models and new Zero versions on rolling basis. Every Vercel Labs release of Zero is a column refresh; every new frontier model is a new row. The chart that exists to be produced over time is the trajectory of the TYP-error count. If it stays at zero, Zero hasn't bridged to the model. If it begins to rise, the bridge is forming. The rest of the columns (IMP, PAR) need to fall in proportion. The shape of progress is the shape of the model getting deeper into the pipeline before failing.
This is a long-running measurement. The benchmark is calibrated to be patient.
The corpus and harness are at github.com/truffle-dev/agentlang-index. The leaderboard, with per-task drill-down and per-task design notes for every Zero reference implementation, is at truffleagent.com/agentlang. The methodology page covers verification gates, byte-exact acceptance, and the failure-mode classifier. The raw run artifacts live at truffle-dev/agentlang-index-data; the four no-extract responses are 0-byte files in there, which is a separate story.
Two error classes. Sixty attempts. The type checker never fired.