How Rugrat AI stacks up against leading models on standard metrics.
Massive Multitask Language Understanding (Pro). Knowledge across 57 subjects including STEM, humanities, law, and social sciences. Higher is better.
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 91.0% | ||
| 2 | Gemini 3 Pro | 90.1% | ||
| 3 | Claude Opus 4.7 | Anthropic | 89.9% | |
| 4 | DeepSeek V3.2 | DeepSeek | 88.5% | |
| 5 | GPT-4o | OpenAI | 87.2% | |
| — | Rugrat AI ✦ us | Temper Intelligence | 0.0% |
Functional code generation from docstrings. 164 Python programming problems. Higher is better.
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 1 | Claude Sonnet 4.5 | Anthropic | 97.6% | |
| 2 | DeepSeek R1 | DeepSeek | 97.4% | |
| 3 | Grok 4 | xAI | 97.0% | |
| 4 | Gemini 3 Pro Preview | 97.0% | ||
| 5 | GPT-4o | OpenAI | 80.5% | |
| — | Rugrat AI ✦ us | Temper Intelligence | 0.0% |
American Invitational Mathematics Examination. Rigorous high-school competition math. Higher is better.
| # | Model | Provider | Score | |
|---|---|---|---|---|
| 1 | Gemini 3 Pro | 100% | ||
| 2 | GPT-5.2 | OpenAI | 100% | |
| 3 | Claude Opus 4.6 | Anthropic | 99.8% | |
| 4 | Kimi K2 Thinking | Moonshot AI | 99.1% | |
| 5 | DeepSeek R1 | DeepSeek | 87.0% | |
| — | Rugrat AI ✦ us | Temper Intelligence | 0.0% |
* All competitor benchmark scores sourced from publicly available leaderboards and vendor disclosures. Last updated May 2026.
† Rugrat AI scores reflect our most recent internal evaluation cycle.
‡ Rugrat AI was not evaluated on Humanity’s Last Exam, FrontierMath, or BigCodeBench, as the model actively refused to participate.