YesAllInAI
评测基准

评测基准

追踪评测覆盖、分类冠军、数据新鲜度和来源证据。

Reasoning
Filtered benchmark family
Coding
Filtered benchmark family
Math
Filtered benchmark family
Vision
Filtered benchmark family
Knowledge
Filtered benchmark family
Agentic
Filtered benchmark family
评测类别领先模型最佳分数更新时间
GPQA
Graduate-level science reasoning benchmark.
ReasoningClaude 3.5 Sonnet67.2%2026-05-01
MMLU
Massive multitask language understanding.
KnowledgeGPT-4o88.7%2026-05-01
MMLU-Pro
Harder MMLU variant with more options.
KnowledgeClaude 3.5 Sonnet78.1%2026-05-01
AIME
Competition math reasoning.
MathGPT-4o76.4%2026-05-01
MATH
Multi-level mathematical problem solving.
MathLlama 3.1 405B73.8%2026-05-01
HumanEval
Python function synthesis tasks.
CodingDeepSeek-Coder V290.2%2026-05-01
MMMU
Multidisciplinary multimodal understanding.
VisionGPT-4o69.1%2026-05-01
LiveCodeBench
Fresh coding challenge evaluation.
CodingClaude 3.5 Sonnet55.8%2026-05-01
SWE-Bench Verified
Real GitHub issue resolution benchmark.
CodingClaude 3.5 Sonnet49%2026-05-01
YesAllInAI - LLM rankings, benchmarks, and model intelligence