wideriver.tech
Skip to benchmark table

LLM Benchmark Comparison

Compare performance of frontier AI models across industry-standard benchmarks. Our benchmark data includes verified scores from official model releases and research papers, covering capabilities like general knowledge (MMLU), code generation (HumanEval), mathematical reasoning (GSM8K), and more.

Last updated: October 24, 2025

View benchmark methodology and glossary

Keyboard Shortcuts:

  • / - Focus search input
  • Esc - Clear all filters

Showing 13 of 13 models

13 models found
Claude 3 Haiku
Anthropic
88.9%
75.9pass@1
75.2%
Claude 3 Opus
Anthropic
95%
84.9pass@1
86.8%
Claude 3 Sonnet
Anthropic
92.3%
73pass@1
79%
Command R+
Cohere
66.9%
70.1pass@1
75%
Gemini 1.0 Pro
Google
86.5%
67.7pass@1
79.13%
Gemini 1.5 Pro
Google
91.7%
71.9pass@1
85.9%
GPT-3.5 Turbo
OpenAI
85.2%
57.1%
85.5%
70%
GPT-4
OpenAI
92%
95.3%
67pass@1
86.4%
GPT-4 Turbo
OpenAI
92%
87.1pass@1
86.5%
Llama 3 70B
Meta
93%
81.7pass@1
82%
Llama 3 8B
Meta
79.6%
62.2pass@1
67.4%
Mistral Large
Mistral
81%
45.1pass@1
81.2%
Mixtral 8x7B
Mistral
74.4%
86.7%
40.2pass@1
70.6%

Methodology & Glossary

Benchmark methodology and detailed glossary coming soon.