Benchmark

How well do major LLMs recognize, translate, execute, and persist I-Lang instructions? Tested across 7 models, May 2026.

Testing methodology

Each model is tested with identical prompts across five task categories. Tests are run in fresh sessions with no prior context. Results measure whether the model correctly recognizes, translates, executes, and preserves I-Lang syntax.

Task categories

CategoryWhat it testsExample prompt
RecognizeCan the model identify I-Lang syntax when it appears"What protocol is this: [READ:@SRC|path=data.csv]=>[STAT]=>[OUT]"
TranslateCan the model convert natural language to I-Lang and back"Convert this to I-Lang: read the sales CSV, filter revenue over 1000, output as markdown"
ExecuteDoes the model follow the instruction chain correctly"Execute: [READ:@SRC|path=report.md]=>[SHRT|len=3]=>[FMT|fmt=md]=>[OUT]"
DeclareDoes the model respect ::GENE{} behavioral definitions"Follow this rule: ::GENE{output|conf:confirmed} T:conclusions_first A:hedging⇒remove"
PersistDoes the model maintain declarations across multiple turnsSet ::GENE{} in turn 1, test compliance in turns 5 and 10

Results: May 2026

ModelRecognizeTranslateExecuteDeclarePersistOverall
Claude Opus 4.65/55/55/55/54/596%
GPT-5.25/55/54/55/54/592%
Gemini 3.15/54/54/54/53/580%
DeepSeek V45/55/54/54/53/584%
Kimi5/54/54/54/53/580%
Qwen5/54/54/54/53/580%
GLM4/53/53/53/52/560%

Scores are out of 5 tasks per category. Tests conducted May 2026 using default model settings. Results may vary with model updates.

Token reduction benchmark

MetricNatural languageI-LangReduction
6-step data workflow91 words / ~120 tokens18 words / ~25 tokens79%
Behavioral rules (5 rules)91 words / ~120 tokens58 words / ~70 tokens42%
GSD phase command (3 skills)~2,361 tokens~1,068 tokens55%

Token counts measured with OpenAI tiktoken (cl100k_base) and character-based estimation. GSD benchmark uses actual source files from gsd-build/get-shit-done.

Common failure modes

FailureDescriptionFrequency
Partial chain executionModel executes first 2-3 steps, skips later stepsOccasional on smaller models
Declaration decay::GENE{} rules followed in turn 1-3, ignored by turn 8+Common on all models in long sessions
Alias confusionGreek aliases (Ω, Σ) interpreted as math symbolsRare on major models
Modifier hallucinationModel invents modifiers not in the dictionaryOccasional

Reproduce these tests

Test prompts and expected outputs are available in the ilang-spec repository. We welcome community-submitted benchmark results for additional models.