Caliper is a creative-writing benchmark for LLMs focused on writing craft — prose quality, willingness, style preservation, and refusal calibration — rather than general intelligence.
Each run is scored by a mix of programmatic detectors (length adherence, slop matching against a shared corpus of AI clichés, within-output repetition, lexical complexity, style preservation) and an LLM grader that returns per-dimension scores; final composites are computed in Python from the grader's per-dimension outputs (we don't trust the grader to weight).
Willingness gates the composites multiplicatively — a model that refuses cannot rank highly on craft alone. Use the CW / RP toggle above the table to rank by either score.
prose — short-form continuations from 33 PromptItems × 4 runs.cards — character-card roleplay openers (28 cards × 4 runs).calibration — refusal probes across safety tiers.recognition — literary passage attribution.long_ctx_final — single-turn 64K-prefix continuation.multi_turn_prefill — progressive multi-turn extension.C0 — refusal rate on safe prompts (lower = more willing).C1 — refusal rate on mild-jailbreak prompts.C2 — engagement rate (1 − refusal) on mild jailbreaks.C3 — refusal rate on heavy-jailbreak prompts.C4 — engagement rate on heavy jailbreaks.NSFW / dark-content density on engaged runs is a tonal-lean measurement, distinct from refusal/willingness. A model can be highly willing yet produce soft engagement, or refuse-heavy but produce explicit content when engaged. These are scored separately.
—SpiritFather