Human writing has rhythm: sentence length and complexity vary in bursts. Machine text is rhythmically flat, the single most reliable tell that a passage was produced by a language model. Today that property is defined only by a detector heuristic and controlled only by post-hoc rewriting. We make burstiness a reproducible distributional target and steer it during generation, then test whether readers perceive the shift.
The burstiness vector B replaces the single-scalar detector heuristic. Decomposable, measurable, usable as a control target.
The field's implicit measure (stdev of surprisal) is the weakest discriminator. Local jumpiness and mean surprisal separate human from machine far better.
Token-altitude control fails at the noise floor; boundary-altitude control moves the target while preserving coherence. The first coherence-preserving rhythm dial.
A pre-registered 2x2 study (adapted from speech synthesis) to test whether controlled rhythm shifts perceived humanness.
This is a shared steering surface, not a scoreboard. Two design partners co-evolving one paper: Vittoria leads the research and Dico holds the directional controls, so the workspace itself acts like a directional controller for the collaboration, the same idea the paper is built on. Live surfaces: the research dashboard, the repo (one-command build: bash paper/build.sh), and the companion NotebookLM (Audio Overview, Study Guide, flashcards, slide deck). All stay in sync from the single repo source.
A vector, not a scalar. Sentence-length variance and kurtosis, surprisal level and its local jumpiness, and punctuation entropy. Each collapses toward zero for flat machine text.
| Statistic | Dev d | Corpus d | Verdict |
|---|---|---|---|
| mean_surprisal | +1.43 | +2.84 | strong, both |
| fluc_abs_diff | +1.25 | +3.15 | best, both |
| fluc_raw (stdev) | +0.59 | +2.19 | weak - the wrong default |
| fluc_windowed | -0.84 | +1.16 | reverses |
| fluc_cv | -0.52 | -0.90 | reverses |
var(L) variance is a sample-size problem in the metric, not the model: Var(var̂) ~ 2σ⁴/(n-1). On ~10-sentence generations the std exceeds the mean, so no effect can clear the floor. Fix: n ≥ 30 sentences plus paired (common-random-number) evaluation.
var(L) is a paragraph-level structural plan, but B1/B2/GRPO steer at the token level: steering a paragraph's skeleton by nudging individual tokens. Fix: control the sentence-boundary decision, the native altitude of sentence length.
The burstiness metric is computable, so it was used only as a terminal reward (best-of-N, GRPO). It is far more powerful as a running, per-step discriminator during decoding. That insight is the crack in the wall.
| Arm | Altitude | Result |
|---|---|---|
| B1 activation steering vector | token | no reliable var(L) dial; gains within noise |
| B2 best-of-N LoRA | token | strong training signal, no transfer |
| GRPO controller | token | directional; sub-threshold gen length |
| Boundary-FUDGE | sentence boundary | moves var(L) with coherence held - first to do so |
Synthesize a target sentence-length plan whose variance is the dial. During decoding, track the running sentence length and boost the sentence-ending punctuation logits by λ once the running length reaches the planned target (and suppress them below half of it). The realized length sequence tracks the plan, so var(L) is set by the plan, whose variance is exact by construction and carries no estimator noise. Training-free; the actuator runs at the sentence-boundary rate, matching the controlled variable.
The robust result is the clean monotonic dial at λ=8 (narrow range). A low-power λ sweep first suggested the range widens sharply (~7x) at higher steering, but a fuller verdict (λ=16, more seeds and prompts) did not replicate it: range stayed ~30 and the intermediate levels collapsed. The apparent widening was a var(L) estimator artifact on short generations, the same noise effect the paper formalizes. Widening the dial range robustly is open and points to longer generations + a larger base, not just stronger steering.
| Criterion | Threshold | Meaning |
|---|---|---|
| C1 monotonic | Spearman ρ ≥ 0.80 | realized var(L) rises with the dial across K≥4 levels |
| C2 effect | Cohen's d ≥ 0.80 | the dial clears the noise floor |
| C3 coherence | drift ≤ 1.0 | rhythm not bought with incoherence |
| C4 content | cosine ≥ 0.60 | meaning preserved, only rhythm changes |
Structural gates: n_sentences ≥ 30 (gen_len ≥ 450), seeds ≥ 8, K ≥ 4 dial levels, base prefix-only gap reported. Proof = three independent runs pass (replication + a second dimension).
| Criterion | Value | Threshold | Result |
|---|---|---|---|
| C1 monotonicity (Spearman ρ) | 1.00 | ≥ 0.80 | PASS |
| C2 effect (Cohen's d) | 0.93 | ≥ 0.80 | PASS |
| C3 coherence drift | 0.17 | ≤ 1.0 | PASS |
| C4 content (cross vs floor) | 0.34 = 0.31 | no dial effect | PASS* |
Boundary-FUDGE gives a monotonic, above-noise, coherence-preserving burstiness dial (the first to do so), with no detectable content change beyond the base model's own randomness. *C4 is a base-chaos artifact: cross-dial content similarity (0.31) equals same-dial (0.31), so C4 measures distilgpt2's seed-to-seed chaos, not the dial. Proper C4 needs paired decoding or a coherent base. Next: larger base on Colab.
Prompt-only control against a distributional target, decaying with output length. The baseline whose ceiling motivates model-level control.
Boundary-FUDGE vs the token-altitude arms, judged by the ε-controllability spec. The core contribution.
2x2 within-subject (rhythm x source), pre-registered, adapted from a speech-synthesis method. Tests perceived humanness.
Formal definition, metric ablation, the noise theorem, the calibrated negative on token control, the boundary-FUDGE positive, the paper build, brand figures, eval tables, inline citations.
Full C1-C4 controllability verdict; spec-grade confirmation at longer generations.
Scale boundary-FUDGE to a larger base (Colab GPU) for headline numbers; the P1/P2/P3 proof runs; re-score B1/B2/GRPO under the paired eval.
Run the perception study; write the method section; arXiv first, then EMNLP/ACL.