caliper bench

Caliper Bench Leaderboard

A creative-writing benchmark for LLMs · prose craft, style, willingness
35 models · generated 2026-05-25 22:25 UTC · Methodology · Submit a model →
↑ higher is better ↓ lower is better Click any column header to sort · hover for description C0–C3 refusal rate · C2/C4 engagement rate · EngD harm density on engaged refusable runs