2025 Open-Weight Benchmarks: When to Choose Small, Local, or Frontier

The numbers

Llama 3 70B Instruct (quantized) hit 140ms P95 on-device for summarization with receipts.
DeepSeek Coder V2 matched GPT-4 Turbo on repo-understanding tasks while costing ~38% less.
Frontier models still win on speculative planning and multi-hop reasoning—but only when paired with eval gates.

Decision guide

Choose small/local for privacy-first flows and deterministic latency.
Choose open-weights in the cloud for cost-sensitive CRUD and summarization.
Choose frontier for greenfield features where correctness > cost and you can afford eval overhead.

Starter configs

Included: Terraform modules for split inference (edge + cloud), prompt lint rules for open weights, and receipts dashboards that show which model handled each call.

2025 Open-Weight Benchmarks: When to Choose Small, Local, or Frontier

The numbers

Decision guide

Starter configs

Have an idea? Get the spec your AI agent can build from.