All articles
INFRASTRUCTURE·May 5, 2025·7 MIN READ
2025 Open-Weight Benchmarks: When to Choose Small, Local, or Frontier
By Rowan Li
The numbers
- Llama 3 70B Instruct (quantized) hit 140ms P95 on-device for summarization with receipts.
- DeepSeek Coder V2 matched GPT-4 Turbo on repo-understanding tasks while costing ~38% less.
- Frontier models still win on speculative planning and multi-hop reasoning—but only when paired with eval gates.
Decision guide
- Choose small/local for privacy-first flows and deterministic latency.
- Choose open-weights in the cloud for cost-sensitive CRUD and summarization.
- Choose frontier for greenfield features where correctness > cost and you can afford eval overhead.
Starter configs
Included: Terraform modules for split inference (edge + cloud), prompt lint rules for open weights, and receipts dashboards that show which model handled each call.