🇦🇪 UAE-resident
Continuous eval harness

Evaluations

Per-task golden-set evaluation across model variants — balance accuracy, latency and cost. The current default for each task is highlighted.

Accuracy by task

GPT-5.1 vs GPT-4.1 Mini

Threshold callout

Self-hosted LLM economics

On current pricing through Core42 Compass, a self-hosted Llama-class model becomes economic above ~12M tokens/day sustained, including GPU reservation and operations. Today's fleet runs at ~3.1M tokens/day.

Self-hosted is not proposed yet. Re-evaluate at end of Q4 once Bureau Score v3 stabilises in PROD.
TaskModelAccuracyp95 latencyCost / 1k tokGuardrail passDefault
Test-case generationGPT-5.192%4.1sAED 0.03899%current default
GPT-4.1 Mini84%1.6sAED 0.00896%
Log clustering / triageGPT-5.188%3.2sAED 0.03898%
GPT-4.1 Mini86%1.2sAED 0.00897%current default
Requirement decompositionGPT-5.184%6.4sAED 0.03895%current default
GPT-4.1 Mini71%2.1sAED 0.00891%
Code generationGPT-5.181%7.8sAED 0.03896%current default
GPT-4.1 Mini64%2.6sAED 0.00889%