Continuous eval harness
Evaluations
Per-task golden-set evaluation across model variants — balance accuracy, latency and cost. The current default for each task is highlighted.
Accuracy by task
GPT-5.1 vs GPT-4.1 Mini
Threshold callout
Self-hosted LLM economics
On current pricing through Core42 Compass, a self-hosted Llama-class model becomes economic above ~12M tokens/day sustained, including GPU reservation and operations. Today's fleet runs at ~3.1M tokens/day.
Self-hosted is not proposed yet. Re-evaluate at end of Q4 once Bureau Score v3 stabilises in PROD.
| Task | Model | Accuracy | p95 latency | Cost / 1k tok | Guardrail pass | Default |
|---|---|---|---|---|---|---|
| Test-case generation | GPT-5.1 | 92% | 4.1s | AED 0.038 | 99% | current default |
| GPT-4.1 Mini | 84% | 1.6s | AED 0.008 | 96% | ||
| Log clustering / triage | GPT-5.1 | 88% | 3.2s | AED 0.038 | 98% | |
| GPT-4.1 Mini | 86% | 1.2s | AED 0.008 | 97% | current default | |
| Requirement decomposition | GPT-5.1 | 84% | 6.4s | AED 0.038 | 95% | current default |
| GPT-4.1 Mini | 71% | 2.1s | AED 0.008 | 91% | ||
| Code generation | GPT-5.1 | 81% | 7.8s | AED 0.038 | 96% | current default |
| GPT-4.1 Mini | 64% | 2.6s | AED 0.008 | 89% |