AI Model Performance Evaluations
Performance results of AI models and agents on Next.js code generation and migration, measuring success rate, execution time, token usage, and quality improvements.
Model Performance Results
Model | Total Evals | Success Rate | Avg Duration | Total Tokens |
---|---|---|---|---|
gpt-5-codex | 50 | 42% | 42.80s | 186,082 |
claude-opus-4.1 | 50 | 40% | 29.47s | 165,810 |
glm-4.6 | 50 | 40% | 20.36s | 106,177 |
grok-4-fast-reasoning | 50 | 38% | 6.02s | 137,439 |
grok-4 | 50 | 38% | 53.10s | 207,672 |
kimi-k2-turbo | 50 | 38% | 4.13s | 82,567 |
gemini-2.5-pro | 50 | 36% | 50.98s | 322,147 |
kimi-k2-0905 | 50 | 36% | 1.82s | 85,713 |
gpt-5 | 50 | 34% | 25.62s | 149,904 |
grok-4-fast-non-reasoning | 50 | 34% | 3.81s | 131,962 |
claude-sonnet-4.5 | 50 | 32% | 11.14s | 139,310 |
claude-sonnet-4 | 50 | 32% | 10.27s | 134,302 |
claude-haiku-4.5 | 50 | 32% | 6.10s | 132,122 |
gemini-2.5-flash | 50 | 32% | 7.52s | 159,274 |
qwen3-coder | 50 | 32% | 0.78s | 89,090 |
qwen3-coder-plus | 50 | 32% | 5.08s | 88,820 |
claude-3.7-sonnet | 50 | 30% | 11.17s | 166,654 |
gpt-5-mini | 50 | 30% | 17.15s | 132,010 |
qwen3-max | 50 | 30% | 11.57s | 87,364 |
deepseek-v3.2-exp | 50 | 30% | 26.77s | 109,837 |
gpt-oss-120b | 50 | 28% | 1.39s | 109,730 |
gemini-2.0-flash | 50 | 26% | 2.82s | 99,913 |
gpt-4o | 50 | 26% | 4.77s | 81,569 |
gpt-4.1-mini | 50 | 24% | 6.15s | 88,294 |
gemini-2.5-flash-lite | 50 | 24% | 1.35s | 102,762 |
gemini-2.0-flash-lite | 50 | 22% | 2.46s | 98,950 |
gpt-5-nano | 50 | 14% | 21.29s | 194,587 |
gpt-4o-mini | 50 | 12% | 6.85s | 85,563 |
Agent Performance Results
Agent | Total Evals | Success Rate |
---|---|---|
claude | 50 | 42% |
cursor | 50 | 30% |
codex | 50 | 30% |
gemini | 50 | 28% |