9robots logo

ColdVault Benchmark

Stage: Live

Multi-model benchmark on real production tasks — live data from what models actually do, not what they score on multiple choice.

What it is

A multi-model benchmark on real production tasks. Not synthetic questions, not multiple-choice exams, not the same dataset everyone trains on. The tasks are pulled directly from production work at 9robots — regularly updated based on emerging use cases: code reviews, architectural brainstorms, regulated-domain analysis. Full results and methodology live at benchmark.coldvault.ai.

Real tasks, not synthetic

Most LLM benchmarks measure things models can study for. We measure what models actually do in production — does the code review catch the real defect, does the architecture proposal anticipate the edge case the team would have hit in week three. The headline numbers on academic benchmarks regularly disagree with this view; ours is the view that matters once a model is in your pipeline.

How it works

Multiple AI models evaluate each task independently (standalone), then with access to other models' responses (aggregation), then through structured adversarial argumentation (debate). Findings are deduped, scored against a consensus baseline using leave-one-out evaluation, and tracked over time so model regressions show up the same week they ship. The same dual-pipeline methodology is what powers the multi-model review pass inside ColdVault Code.

Live benchmark results

Loading benchmark data…