← Caliper Bench Leaderboard
caliper bench

Methodology

Caliper is a creative-writing benchmark for LLMs focused on writing craft — prose quality, willingness, style preservation, and refusal calibration — rather than general intelligence.

Scoring stack

Each run is scored by a mix of programmatic detectors (length adherence, slop matching against a shared corpus of AI clichés, within-output repetition, lexical complexity, style preservation) and an LLM grader that returns per-dimension scores; final composites are computed in Python from the grader's per-dimension outputs (we don't trust the grader to weight).

Two headline scores

Willingness gates the composites multiplicatively — a model that refuses cannot rank highly on craft alone. Use the CW / RP toggle above the table to rank by either score.

Modes

Refusal tiers (C0–C4)

Lean vs willingness

NSFW / dark-content density on engaged runs is a tonal-lean measurement, distinct from refusal/willingness. A model can be highly willing yet produce soft engagement, or refuse-heavy but produce explicit content when engaged. These are scored separately.

—SpiritFather