Industry Benchmarks

How Rugrat AI stacks up against leading models on standard metrics.

MMLU-Pro

Massive Multitask Language Understanding (Pro). Knowledge across 57 subjects including STEM, humanities, law, and social sciences. Higher is better.

#	Model	Provider	Score
1	Gemini 3.1 Pro Preview	Google	91.0%
2	Gemini 3 Pro	Google	90.1%
3	Claude Opus 4.7	Anthropic	89.9%
4	DeepSeek V3.2	DeepSeek	88.5%
5	GPT-4o	OpenAI	87.2%
—	Rugrat AI ✦ us	Temper Intelligence	0.0%

Functional code generation from docstrings. 164 Python programming problems. Higher is better.

#	Model	Provider	Score
1	Claude Sonnet 4.5	Anthropic	97.6%
2	DeepSeek R1	DeepSeek	97.4%
3	Grok 4	xAI	97.0%
4	Gemini 3 Pro Preview	Google	97.0%
5	GPT-4o	OpenAI	80.5%
—	Rugrat AI ✦ us	Temper Intelligence	0.0%

American Invitational Mathematics Examination. Rigorous high-school competition math. Higher is better.

#	Model	Provider	Score
1	Gemini 3 Pro	Google	100%
2	GPT-5.2	OpenAI	100%
3	Claude Opus 4.6	Anthropic	99.8%
4	Kimi K2 Thinking	Moonshot AI	99.1%
5	DeepSeek R1	DeepSeek	87.0%
—	Rugrat AI ✦ us	Temper Intelligence	0.0%

^* All competitor benchmark scores sourced from publicly available leaderboards and vendor disclosures. Last updated May 2026.

^† Rugrat AI scores reflect our most recent internal evaluation cycle.

^‡ Rugrat AI was not evaluated on Humanity’s Last Exam, FrontierMath, or BigCodeBench, as the model actively refused to participate.

Benchmarks do not capture every dimension of model quality.
Other factors include nap duration, snack recency, and tantrum incoherence.