评测基准
评测基准
追踪评测覆盖、分类冠军、数据新鲜度和来源证据。
Reasoning
Filtered benchmark family
Coding
Filtered benchmark family
Math
Filtered benchmark family
Vision
Filtered benchmark family
Knowledge
Filtered benchmark family
Agentic
Filtered benchmark family
| 评测 | 类别 | 领先模型 | 最佳分数 | 更新时间 |
|---|---|---|---|---|
GPQA Graduate-level science reasoning benchmark. | Reasoning | Claude 3.5 Sonnet | 67.2% | 2026-05-01 |
MMLU Massive multitask language understanding. | Knowledge | GPT-4o | 88.7% | 2026-05-01 |
MMLU-Pro Harder MMLU variant with more options. | Knowledge | Claude 3.5 Sonnet | 78.1% | 2026-05-01 |
AIME Competition math reasoning. | Math | GPT-4o | 76.4% | 2026-05-01 |
MATH Multi-level mathematical problem solving. | Math | Llama 3.1 405B | 73.8% | 2026-05-01 |
HumanEval Python function synthesis tasks. | Coding | DeepSeek-Coder V2 | 90.2% | 2026-05-01 |
MMMU Multidisciplinary multimodal understanding. | Vision | GPT-4o | 69.1% | 2026-05-01 |
LiveCodeBench Fresh coding challenge evaluation. | Coding | Claude 3.5 Sonnet | 55.8% | 2026-05-01 |
SWE-Bench Verified Real GitHub issue resolution benchmark. | Coding | Claude 3.5 Sonnet | 49% | 2026-05-01 |