Industry Benchmarks

How Rugrat AI stacks up against leading models on standard metrics.

MMLU-Pro

Massive Multitask Language Understanding (Pro). Knowledge across 57 subjects including STEM, humanities, law, and social sciences. Higher is better.

# Model Provider Score
1 Gemini 3.1 Pro Preview Google 91.0%
2 Gemini 3 Pro Google 90.1%
3 Claude Opus 4.7 Anthropic 89.9%
4 DeepSeek V3.2 DeepSeek 88.5%
5 GPT-4o OpenAI 87.2%
Rugrat AI ✦ us Temper Intelligence 0.0%

HumanEval

Functional code generation from docstrings. 164 Python programming problems. Higher is better.

# Model Provider Score
1 Claude Sonnet 4.5 Anthropic 97.6%
2 DeepSeek R1 DeepSeek 97.4%
3 Grok 4 xAI 97.0%
4 Gemini 3 Pro Preview Google 97.0%
5 GPT-4o OpenAI 80.5%
Rugrat AI ✦ us Temper Intelligence 0.0%

AIME 2025

American Invitational Mathematics Examination. Rigorous high-school competition math. Higher is better.

# Model Provider Score
1 Gemini 3 Pro Google 100%
2 GPT-5.2 OpenAI 100%
3 Claude Opus 4.6 Anthropic 99.8%
4 Kimi K2 Thinking Moonshot AI 99.1%
5 DeepSeek R1 DeepSeek 87.0%
Rugrat AI ✦ us Temper Intelligence 0.0%

* All competitor benchmark scores sourced from publicly available leaderboards and vendor disclosures. Last updated May 2026.

Rugrat AI scores reflect our most recent internal evaluation cycle.

Rugrat AI was not evaluated on Humanity’s Last Exam, FrontierMath, or BigCodeBench, as the model actively refused to participate.

Benchmarks do not capture every dimension of model quality.
Other factors include nap duration, snack recency, and tantrum incoherence.