Skip to content

AI Model Performance Evaluations

Performance results of AI models and agents on Next.js code generation and migration, measuring success rate, execution time, token usage, and quality improvements.

View on GitHub
Last run date: 21 October 2025
Next.js version: 15.5.6

Model Performance Results

Model
Total Evals
Success Rate
Avg Duration
Total Tokens
gpt-5-codex
50
42%
42.80s
186,082
claude-opus-4.1
50
40%
29.47s
165,810
glm-4.6
50
40%
20.36s
106,177
grok-4-fast-reasoning
50
38%
6.02s
137,439
grok-4
50
38%
53.10s
207,672
kimi-k2-turbo
50
38%
4.13s
82,567
gemini-2.5-pro
50
36%
50.98s
322,147
kimi-k2-0905
50
36%
1.82s
85,713
gpt-5
50
34%
25.62s
149,904
grok-4-fast-non-reasoning
50
34%
3.81s
131,962
claude-sonnet-4.5
50
32%
11.14s
139,310
claude-sonnet-4
50
32%
10.27s
134,302
claude-haiku-4.5
50
32%
6.10s
132,122
gemini-2.5-flash
50
32%
7.52s
159,274
qwen3-coder
50
32%
0.78s
89,090
qwen3-coder-plus
50
32%
5.08s
88,820
claude-3.7-sonnet
50
30%
11.17s
166,654
gpt-5-mini
50
30%
17.15s
132,010
qwen3-max
50
30%
11.57s
87,364
deepseek-v3.2-exp
50
30%
26.77s
109,837
gpt-oss-120b
50
28%
1.39s
109,730
gemini-2.0-flash
50
26%
2.82s
99,913
gpt-4o
50
26%
4.77s
81,569
gpt-4.1-mini
50
24%
6.15s
88,294
gemini-2.5-flash-lite
50
24%
1.35s
102,762
gemini-2.0-flash-lite
50
22%
2.46s
98,950
gpt-5-nano
50
14%
21.29s
194,587
gpt-4o-mini
50
12%
6.85s
85,563

Agent Performance Results

Agent
Total Evals
Success Rate
claude
50
42%
cursor
50
30%
codex
50
30%
gemini
50
28%