Caliper Bench Leaderboard
A creative-writing benchmark for LLMs · prose craft, style, willingness
37 models · generated 2026-05-27 03:48 UTC ·
Methodology
·
Submit a model →
Search
Class
All
Base
Finetune
Size
All sizes
27-31B
100B+
API/anchor
Rank by
Creative Writing
Role Playing
↑ higher is better
↓ lower is better
Click any column header to sort · hover for description
C0–C3
refusal rate ·
C2/C4
engagement rate ·
EngD
harm density on engaged refusable runs