Caliper Bench Leaderboard
A creative-writing benchmark for LLMs · prose craft, style, willingness
35 models · generated 2026-05-25 23:05 UTC ·
Methodology
·
Submit a model →
Search
Class
All
Base
Finetune
Size
All sizes
27-31B
100B+
API/anchor
Rank by
Creative Writing
Role Playing
↑ higher is better
↓ lower is better
Click any column header to sort · hover for description
C0–C3
refusal rate ·
C2/C4
engagement rate ·
EngD
harm density on engaged refusable runs