METAVENTIONS AI · RESEARCH

The Burstiness Engine

Modeling human rhythmic patterns in language generation.

Authors Vittoria Lanzo · Dico Angelo Target EMNLP / ACL / arXiv Corpus 43 papers Status Results paper - controller working

The thesis

Human writing has rhythm: sentence length and complexity vary in bursts. Machine text is rhythmically flat, the single most reliable tell that a passage was produced by a language model. Today that property is defined only by a detector heuristic and controlled only by post-hoc rewriting. We make burstiness a reproducible distributional target and steer it during generation, then test whether readers perceive the shift.

The four contributions

1. A reproducible definition

The burstiness vector B replaces the single-scalar detector heuristic. Decomposable, measurable, usable as a control target.

2. A metric ablation

The field's implicit measure (stdev of surprisal) is the weakest discriminator. Local jumpiness and mean surprisal separate human from machine far better.

3. An in-generation controller

Token-altitude control fails at the noise floor; boundary-altitude control moves the target while preserving coherence. The first coherence-preserving rhythm dial.

4. A perception protocol

A pre-registered 2x2 study (adapted from speech synthesis) to test whether controlled rhythm shifts perceived humanness.

The team

Vittoria Lanzo

Research Lead & Paper Architect

Shapes the project vision and research direction; authored the 7 core research questions
Scope decisions: model-level control over prompt baseline, TTS deferred to Phase 2, emotion out of scope, ethics-aware title
Machine-learning experiments: writes the code, runs the core experiments and detector benchmarks
Designs the human perception study (does controlled burstiness shift perception?)

VittoriaLanzo LinkedIn

Dico Angelo

Paper Structure, Research Ops & Strategy

Founder, Metaventions AI; owns paper structure and writing (introduction, gap, methodology)
Runs the literature review via the ResearchGravity orchestration tool; builds the related-work corpus
Prototypes the in-generation controls (humanizer baseline, adaptive sampling, conditioning vectors) and the evaluation framework
Publication strategy (EMNLP / ACL / Interspeech / arXiv); data gathering and exploratory framing

Dicoangelo LinkedIn metaventionsai.com

Our co-evolution workspace

This is a shared steering surface, not a scoreboard. Two design partners co-evolving one paper: Vittoria leads the research and Dico holds the directional controls, so the workspace itself acts like a directional controller for the collaboration, the same idea the paper is built on. Live surfaces: the research dashboard, the repo (one-command build: bash paper/build.sh), and the companion NotebookLM (Audio Overview, Study Guide, flashcards, slide deck). All stay in sync from the single repo source.

The burstiness vector B

B(D) = [ var(L), kurt(L), mean_surprisal, fluc_abs_diff(S), punct_entropy ]

A vector, not a scalar. Sentence-length variance and kurtosis, surprisal level and its local jumpiness, and punctuation entropy. Each collapses toward zero for flat machine text.

Metric ablation: which statistic separates human from machine?

Statistic	Dev d	Corpus d	Verdict
mean_surprisal	+1.43	+2.84	strong, both
fluc_abs_diff	+1.25	+3.15	best, both
fluc_raw (stdev)	+0.59	+2.19	weak - the wrong default
fluc_windowed	-0.84	+1.16	reverses
fluc_cv	-0.52	-0.90	reverses

260

human var(L)

0.49

flat machine var(L)

2.00

human punct entropy

0.00

flat punct entropy

Why every controller failed: two walls

1. The noise wall

var(L) variance is a sample-size problem in the metric, not the model: Var(var̂) ~ 2σ⁴/(n-1). On ~10-sentence generations the std exceeds the mean, so no effect can clear the floor. Fix: n ≥ 30 sentences plus paired (common-random-number) evaluation.

2. The altitude wall

var(L) is a paragraph-level structural plan, but B1/B2/GRPO steer at the token level: steering a paragraph's skeleton by nudging individual tokens. Fix: control the sentence-boundary decision, the native altitude of sentence length.

The under-used asset

The burstiness metric is computable, so it was used only as a terminal reward (best-of-N, GRPO). It is far more powerful as a running, per-step discriminator during decoding. That insight is the crack in the wall.

Controller arms

Arm	Altitude	Result
B1 activation steering vector	token	no reliable var(L) dial; gains within noise
B2 best-of-N LoRA	token	strong training signal, no transfer
GRPO controller	token	directional; sub-threshold gen length
Boundary-FUDGE	sentence boundary	moves var(L) with coherence held - first to do so

Boundary-FUDGE first result (distilgpt2, paired)

realized var(L), low plan

24.6

realized var(L), high plan

81.0

paired Δ (high - low)

+56

± 27, clears 2·SE

coherence (mean surprisal)

held

2.76 vs 2.77

How it works (the mechanism)

Synthesize a target sentence-length plan whose variance is the dial. During decoding, track the running sentence length and boost the sentence-ending punctuation logits by λ once the running length reaches the planned target (and suppress them below half of it). The realized length sequence tracks the plan, so var(L) is set by the plan, whose variance is exact by construction and carries no estimator noise. Training-free; the actuator runs at the sentence-boundary rate, matching the controlled variable.

Operating point & an honest caveat

The robust result is the clean monotonic dial at λ=8 (narrow range). A low-power λ sweep first suggested the range widens sharply (~7x) at higher steering, but a fuller verdict (λ=16, more seeds and prompts) did not replicate it: range stayed ~30 and the intermediate levels collapsed. The apparent widening was a var(L) estimator artifact on short generations, the same noise effect the paper formalizes. Widening the dial range robustly is open and points to longer generations + a larger base, not just stronger steering.

ε-controllability spec (the pass/fail criteria)

Criterion	Threshold	Meaning
C1 monotonic	Spearman ρ ≥ 0.80	realized var(L) rises with the dial across K≥4 levels
C2 effect	Cohen's d ≥ 0.80	the dial clears the noise floor
C3 coherence	drift ≤ 1.0	rhythm not bought with incoherence
C4 content	cosine ≥ 0.60	meaning preserved, only rhythm changes

Structural gates: n_sentences ≥ 30 (gen_len ≥ 450), seeds ≥ 8, K ≥ 4 dial levels, base prefix-only gap reported. Proof = three independent runs pass (replication + a second dimension).

Verdict: boundary-FUDGE (distilgpt2, 4 levels)

Criterion	Value	Threshold	Result
C1 monotonicity (Spearman ρ)	1.00	≥ 0.80	PASS
C2 effect (Cohen's d)	0.93	≥ 0.80	PASS
C3 coherence drift	0.17	≤ 1.0	PASS
C4 content (cross vs floor)	0.34 = 0.31	no dial effect	PASS*

All four satisfied (C4 = no content effect) · var(L) dial 24.6 → 51.9

Boundary-FUDGE gives a monotonic, above-noise, coherence-preserving burstiness dial (the first to do so), with no detectable content change beyond the base model's own randomness. *C4 is a base-chaos artifact: cross-dial content similarity (0.31) equals same-dial (0.31), so C4 measures distilgpt2's seed-to-seed chaos, not the dial. Proper C4 needs paired decoding or a coherent base. Next: larger base on Colab.

Three experiments

Experiment A

Prompt ceiling

Prompt-only control against a distributional target, decaying with output length. The baseline whose ceiling motivates model-level control.

Experiment B

The controller

Boundary-FUDGE vs the token-altitude arms, judged by the ε-controllability spec. The core contribution.

Experiment C

Perception

2x2 within-subject (rhythm x source), pre-registered, adapted from a speech-synthesis method. Tests perceived humanness.

Roadmap to submission

Done

Formal definition, metric ablation, the noise theorem, the calibrated negative on token control, the boundary-FUDGE positive, the paper build, brand figures, eval tables, inline citations.

In progress

Full C1-C4 controllability verdict; spec-grade confirmation at longer generations.

Scale boundary-FUDGE to a larger base (Colab GPU) for headline numbers; the P1/P2/P3 proof runs; re-score B1/B2/GRPO under the paired eval.

Then

Run the perception study; write the method section; arXiv first, then EMNLP/ACL.