LLM Benchmark Comparison
Compare performance of frontier AI models across industry-standard benchmarks. Our benchmark data includes verified scores from official model releases and research papers, covering capabilities like general knowledge (MMLU), code generation (HumanEval), mathematical reasoning (GSM8K), and more.
Last updated: October 24, 2025
View benchmark methodology and glossary
Keyboard Shortcuts:
- / - Focus search input
- Esc - Clear all filters
Showing 13 of 13 models
13 models found
Claude 3 Haiku Anthropic | — | 88.9% ✓ | — | 75.9pass@1 ✓ | 75.2% ✓ |
Claude 3 Opus Anthropic | — | 95% ✓ | — | 84.9pass@1 ✓ | 86.8% ✓ |
Claude 3 Sonnet Anthropic | — | 92.3% ✓ | — | 73pass@1 ✓ | 79% ✓ |
Command R+ Cohere | — | 66.9% ✓ | — | 70.1pass@1 ✓ | 75% ✓ |
Gemini 1.0 Pro Google | — | 86.5% ✓ | — | 67.7pass@1 ✓ | 79.13% ✓ |
Gemini 1.5 Pro Google | — | 91.7% ✓ | — | 71.9pass@1 ✓ | 85.9% ✓ |
GPT-3.5 Turbo OpenAI | 85.2% ✓ | 57.1% ✓ | 85.5% ✓ | — | 70% ✓ |
GPT-4 OpenAI | — | 92% ✓ | 95.3% ✓ | 67pass@1 ✓ | 86.4% ✓ |
GPT-4 Turbo OpenAI | — | 92% ✓ | — | 87.1pass@1 ✓ | 86.5% ✓ |
Llama 3 70B Meta | — | 93% ✓ | — | 81.7pass@1 ✓ | 82% ✓ |
Llama 3 8B Meta | — | 79.6% ✓ | — | 62.2pass@1 ✓ | 67.4% ✓ |
Mistral Large Mistral | — | 81% ✓ | — | 45.1pass@1 ✓ | 81.2% ✓ |
Mixtral 8x7B Mistral | — | 74.4% ✓ | 86.7% ✓ | 40.2pass@1 ✓ | 70.6% ✓ |
Methodology & Glossary
Benchmark methodology and detailed glossary coming soon.