Open Source Frontier LLMs

Performance, cost, and real-world capability benchmarks across the latest frontier open-weight models. Each model is tested against the same prompts so you can compare apples-to-apples — and click through to see the raw output for yourself.

6
Models Tested
$0.34
Total Test Cost
2
Test Suites
Model Outputs
Reference Baseline
Baseline reference output
View Output →
Z·AI GLM 5.1
Z-AI's frontier powerhouse
🎨 Marketing Design
95/100
586 Score / $$0.162 / run
Pricing /1M tokens $1.395 in · $4.40 out
View Output →
MiniMax MiniMax M2.7
Ultra-efficient large model
🎨 Marketing Design
95/100
3,571 Score / $$0.0266 / run
Pricing /1M tokens $0.133 in · $1.20 out
View Output →
Front End Design
Moonshot AI Kimi K2.5
Best for frontend design
🎨 Marketing Design
98/100
824 Score / $$0.119 / run
Pricing /1M tokens $0.6 in · $3.00 out
View Output →
Alibaba Qwen3.6 Plus
Strong reasoning & efficiency
🎨 Marketing Design
92/100
3,321 Score / $$0.0277 / run
Pricing /1M tokens $0.325 in · $1.95 out
View Output →
AI Assistant
StepFun Step 3.5 Flash
Best value & tool calling
🎨 Marketing Design
90/100
18,443 Score / $$0.00488 / run
📅 Calendar Updates
PASS
$0.0026 / run
Pricing /1M tokens $0.032 in · $0.300 out
View Output →
Full Comparison
ModelPricing🎨 Marketing Design📅 Calendar Updates
Input /1MOutput /1MScoreCostScore / $CostPass/Fail
*anthropic/opus-4.6*$20/month$20/month100~8% of 5 hour window
z-ai/glm-5.1$1.395$4.4095$0.162586.42
minimax/minimax-m2.7$0.133$1.2095$0.02663571.43
moonshotai/kimi-k2.5$0.6$3.0098$0.119823.53
qwen/qwen3.6-plus$0.325$1.9592$0.02773321.30
stepfun/step-3.5-flash$0.032$0.30090$0.0048818442.62$0.0026
Deepseek 3.2$0.26$0.38
Current Recommendations
🤖
AI Assistant
🏆 Step 3.5 Flash
Why? Step 3.5 flash is great at tool calling and is hella cheap. If all you want to do is turn a messaging app into your execute assistant, this is a great solution.
🎨
Front End Design
🏆 Kimi k2.5
Why? Its very close, but Kimi is best at one-shotting a good looking and cohesive webpage that doesn't reek of LLM based design (emojis) or make formatting errors within more complex areas of the design (code blocks, etc)❗️But way! Consider Step 3.5 Flash, its _so_ economical, you could run 50 Step calls/iterations at the same cost?