I-Lang Benchmark: LLM Recognition, Execution, and Token Reduction Tests

Testing methodology

Each model is tested with identical prompts across five task categories. Tests are run in fresh sessions with no prior context. Results measure whether the model correctly recognizes, translates, executes, and preserves I-Lang syntax.

Task categories

Category	What it tests	Example prompt
Recognize	Can the model identify I-Lang syntax when it appears	"What protocol is this: `[READ:@SRC\|path=data.csv]=>[STAT]=>[OUT]`"
Translate	Can the model convert natural language to I-Lang and back	"Convert this to I-Lang: read the sales CSV, filter revenue over 1000, output as markdown"
Execute	Does the model follow the instruction chain correctly	"Execute: `[READ:@SRC\|path=report.md]=>[SHRT\|len=3]=>[FMT\|fmt=md]=>[OUT]`"
Declare	Does the model respect `::GENE{}` behavioral definitions	"Follow this rule: `::GENE{output\|conf:confirmed} T:conclusions_first A:hedging⇒remove`"
Persist	Does the model maintain declarations across multiple turns	Set `::GENE{}` in turn 1, test compliance in turns 5 and 10

Results: May 2026

Model	Recognize	Translate	Execute	Declare	Persist	Overall
Claude Opus 4.6	5/5	5/5	5/5	5/5	4/5	96%
GPT-5.2	5/5	5/5	4/5	5/5	4/5	92%
Gemini 3.1	5/5	4/5	4/5	4/5	3/5	80%
DeepSeek V4	5/5	5/5	4/5	4/5	3/5	84%
Kimi	5/5	4/5	4/5	4/5	3/5	80%
Qwen	5/5	4/5	4/5	4/5	3/5	80%
GLM	4/5	3/5	3/5	3/5	2/5	60%

Scores are out of 5 tasks per category. Tests conducted May 2026 using default model settings. Results may vary with model updates.

Token reduction benchmark

Metric	Natural language	I-Lang	Reduction
6-step data workflow	91 words / ~120 tokens	18 words / ~25 tokens	79%
Behavioral rules (5 rules)	91 words / ~120 tokens	58 words / ~70 tokens	42%
GSD phase command (3 skills)	~2,361 tokens	~1,068 tokens	55%

Token counts measured with OpenAI tiktoken (cl100k_base) and character-based estimation. GSD benchmark uses actual source files from gsd-build/get-shit-done.

Common failure modes

Failure	Description	Frequency
Partial chain execution	Model executes first 2-3 steps, skips later steps	Occasional on smaller models
Declaration decay	`::GENE{}` rules followed in turn 1-3, ignored by turn 8+	Common on all models in long sessions
Alias confusion	Greek aliases (Ω, Σ) interpreted as math symbols	Rare on major models
Modifier hallucination	Model invents modifiers not in the dictionary	Occasional

Reproduce these tests

Test prompts and expected outputs are available in the ilang-spec repository. We welcome community-submitted benchmark results for additional models.