Frontier Comparison

Open Source Frontier LLMs

Performance, cost, and real-world capability benchmarks across the latest frontier open-weight models. Each model is tested against the same prompts so you can compare apples-to-apples — and click through to see the raw output for yourself.

Models Tested

$0.34

Total Test Cost

Test Suites

Model Outputs

Reference Baseline

Baseline reference output

View Output →

Z·AI GLM 5.1

Z-AI's frontier powerhouse

🎨 Marketing Design

95/100

586 Score / $$0.162 / run

Pricing /1M tokens $1.395 in · $4.40 out

View Output →

MiniMax MiniMax M2.7

Ultra-efficient large model

🎨 Marketing Design

95/100

3,571 Score / $$0.0266 / run

Pricing /1M tokens $0.133 in · $1.20 out

View Output →

Front End Design

Moonshot AI Kimi K2.5

Best for frontend design

🎨 Marketing Design

98/100

824 Score / $$0.119 / run

Pricing /1M tokens $0.6 in · $3.00 out

View Output →

Alibaba Qwen3.6 Plus

Strong reasoning & efficiency

🎨 Marketing Design

92/100

3,321 Score / $$0.0277 / run

Pricing /1M tokens $0.325 in · $1.95 out

View Output →

AI Assistant

StepFun Step 3.5 Flash

Best value & tool calling

🎨 Marketing Design

90/100

18,443 Score / $$0.00488 / run

📅 Calendar Updates

PASS

$0.0026 / run

Pricing /1M tokens $0.032 in · $0.300 out

View Output →

Deepseek 3.2

Pricing /1M tokens $0.26 in · $0.38 out

Full Comparison

Model	Pricing		🎨 Marketing Design			📅 Calendar Updates
Model	Input /1M	Output /1M	Score	Cost	Score / $	Cost	Pass/Fail
anthropic/opus-4.6	$20/month	$20/month	100	~8% of 5 hour window	—	—	—
z-ai/glm-5.1	$1.395	$4.40	95	$0.162	586.42	—	—
minimax/minimax-m2.7	$0.133	$1.20	95	$0.0266	3571.43	—	—
moonshotai/kimi-k2.5	$0.6	$3.00	98	$0.119	823.53	—	—
qwen/qwen3.6-plus	$0.325	$1.95	92	$0.0277	3321.30	—	—
stepfun/step-3.5-flash	$0.032	$0.300	90	$0.00488	18442.62	$0.0026	✅
Deepseek 3.2	$0.26	$0.38	—	—	—	—	—

Current Recommendations

🤖

AI Assistant

🏆 Step 3.5 Flash

Why? Step 3.5 flash is great at tool calling and is hella cheap. If all you want to do is turn a messaging app into your execute assistant, this is a great solution.

🎨

Front End Design

🏆 Kimi k2.5

Why? Its very close, but Kimi is best at one-shotting a good looking and cohesive webpage that doesn't reek of LLM based design (emojis) or make formatting errors within more complex areas of the design (code blocks, etc)❗️But way! Consider Step 3.5 Flash, its _so_ economical, you could run 50 Step calls/iterations at the same cost?