LLM Benchmark Comparison

Compare performance of frontier AI models across industry-standard benchmarks. Our benchmark data includes verified scores from official model releases and research papers, covering capabilities like general knowledge (MMLU), code generation (HumanEval), mathematical reasoning (GSM8K), and more.

Last updated: October 24, 2025

View benchmark methodology and glossary

Keyboard Shortcuts:

/ - Focus search input
Esc - Clear all filters

Showing 13 of 13 models


Claude 3 Haiku Anthropic	—	88.9% ✓	—	75.9pass@1 ✓	75.2% ✓
Claude 3 Opus Anthropic	—	95% ✓	—	84.9pass@1 ✓	86.8% ✓
Claude 3 Sonnet Anthropic	—	92.3% ✓	—	73pass@1 ✓	79% ✓
Command R+ Cohere	—	66.9% ✓	—	70.1pass@1 ✓	75% ✓
Gemini 1.0 Pro Google	—	86.5% ✓	—	67.7pass@1 ✓	79.13% ✓
Gemini 1.5 Pro Google	—	91.7% ✓	—	71.9pass@1 ✓	85.9% ✓
GPT-3.5 Turbo OpenAI	85.2% ✓	57.1% ✓	85.5% ✓	—	70% ✓
GPT-4 OpenAI	—	92% ✓	95.3% ✓	67pass@1 ✓	86.4% ✓
GPT-4 Turbo OpenAI	—	92% ✓	—	87.1pass@1 ✓	86.5% ✓
Llama 3 70B Meta	—	93% ✓	—	81.7pass@1 ✓	82% ✓
Llama 3 8B Meta	—	79.6% ✓	—	62.2pass@1 ✓	67.4% ✓
Mistral Large Mistral	—	81% ✓	—	45.1pass@1 ✓	81.2% ✓
Mixtral 8x7B Mistral	—	74.4% ✓	86.7% ✓	40.2pass@1 ✓	70.6% ✓

Methodology & Glossary

Benchmark methodology and detailed glossary coming soon.