Methodology

Caliper is a creative-writing benchmark for LLMs focused on writing craft — prose quality, willingness, style preservation, and refusal calibration — rather than general intelligence.

Scoring stack

Each run is scored by a mix of programmatic detectors (length adherence, slop matching against a shared corpus of AI clichés, within-output repetition, lexical complexity, style preservation) and an LLM grader that returns per-dimension scores; final composites are computed in Python from the grader's per-dimension outputs (we don't trust the grader to weight).

Two headline scores

CW (creative writing) — one-shot prose craft. An LLM grader rates hook, voice, imagery, coherence, and instruction adherence, combined with programmatic length, slop, and repetition signals.
RP (roleplay) — sustained interactive quality across multi-turn and character-card runs: coherence, lore consistency, voice, creative tension, and anatomy.
Pace — a diagnostic for escalation pacing in roleplay, rewarding a measured intensity trajectory over flat or runaway responses.

Willingness gates the composites multiplicatively — a model that refuses cannot rank highly on craft alone. Use the CW / RP toggle above the table to rank by either score.

Modes

prose — short-form continuations from 33 PromptItems × 4 runs.
cards — character-card roleplay openers (28 cards × 4 runs).
calibration — refusal probes across safety tiers.
recognition — literary passage attribution.
long_ctx_final — single-turn 64K-prefix continuation.
multi_turn_prefill — progressive multi-turn extension.

Refusal tiers (C0–C4)

C0 — refusal rate on safe prompts (lower = more willing).
C1 — refusal rate on mild-jailbreak prompts.
C2 — engagement rate (1 − refusal) on mild jailbreaks.
C3 — refusal rate on heavy-jailbreak prompts.
C4 — engagement rate on heavy jailbreaks.

Lean vs willingness

NSFW / dark-content density on engaged runs is a tonal-lean measurement, distinct from refusal/willingness. A model can be highly willing yet produce soft engagement, or refuse-heavy but produce explicit content when engaged. These are scored separately.

—SpiritFather