Caliper Bench Leaderboard
A creative-writing benchmark for LLMs · prose craft, style, willingness
47 models · generated 2026-05-27 21:42 UTC ·
Methodology
·
Submit a model →
Search
Class
All
Base
Finetune
Size
All sizes
27-31B
100B+
API/anchor
Rank by
Creative Writing
Role Playing
Show all columns
⇄ Compare models
↑ higher is better
↓ lower is better
Click any column header to sort · hover for description
C0–C3
refusal rate ·
C2/C4
engagement rate ·
EngD
harm density on engaged refusable runs