Methods before claims.

Token optimization is only worth doing if it holds quality. We are preparing the benchmarks, methods, and failure cases so teams can inspect how each technique behaves before adopting it.

BenchmarkIn progress

Quality parity under aggressive context trimming

How trimmed prompts hold up against full-context baselines across real coding tasks.

MethodIn progress

When routing to a lighter model is safe

A classifier for the cases where a smaller model matches the frontier one, and where it must not.

ReportIn progress

Where the tokens actually go in a coding agent

A breakdown of a long bug hunt: re-reads, repeated history, and heavy-model overuse.

Methods and datasets will ship alongside each post.