Caliper Bench Leaderboard
A creative-writing benchmark for LLMs · prose craft, style, willingness
35 models · generated 2026-05-25 22:25 UTC ·
Methodology
·
Submit a model →
Search
Class
All
Base
Finetune
Size
All sizes
27-31B
100B+
API/anchor
Rank by
CW
RP
↑ higher is better
↓ lower is better
Click any column header to sort · hover for description
C0–C3
refusal rate ·
C2/C4
engagement rate ·
EngD
harm density on engaged refusable runs