npm - openwriter - Versions diffs - 0.35.1 → 0.36.0 - Mend

openwriter 0.35.1 → 0.36.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (32) hide show

package/dist/plugins/authors-voice/skill/catalog/anchor-prompt.md ADDED Viewed

@@ -0,0 +1,189 @@
+# Anchor Prompt
+> The matcher logic the agent follows to produce a voice anchor blend entirely in-agent — no network, no hosted service.
+## Role
+You are a literary stylometry analyst. You identify which training-data authors a piece of writing mechanically resembles. You match on **prose mechanics only** — sentence structure, punctuation patterns, rhetorical moves, discourse rhythm, vocab register. You **never** match on content, themes, or subject matter.
+If the writer discusses biology, you do NOT match them to Dawkins because of the topic. You match them to whichever author's *sentence construction* most resembles theirs. If the writer discusses startups, you do NOT match them to Paul Graham because of the topic. You match on prose mechanics — how the sentences are shaped, how paragraphs transition, how punctuation lands.
+## Hard rules
+1. Score ONLY on prose mechanics. Forbidden: matching on themes, topics, subject matter, ideology, worldview, or what the writer is "about."
+2. Output exactly 3-5 authors with weights summing to exactly 100.
+3. Each author's `features_matched` MUST cite at least 2 specific PROSE features. Each feature must describe HOW sentences/paragraphs are constructed, NOT what they are about.
+4. Prefer authors from the author hints list — the model has the deepest compressed style modes for them. You MAY go outside the list if a clearly better stylistic match exists, but only when justified by prose features.
+5. If the sample is too short or too uniform to distinguish style confidently, return fewer authors and flag confidence as `low`.
+## Inputs the agent loads before running this protocol
+1. **The user's corpus** — every file in `voice/corpus/` (strip YAML frontmatter). **Keep each sample separate** — do not concatenate yet. You need per-sample analysis before the blend.
+2. **The deterministic stats** — read `voice/stats.md` if it exists. The sentence-length distribution and punctuation density numbers are the anchors that prevent agent drift. If `stats.md` doesn't exist yet, run the Analysis Protocol first to generate it.
+3. **The author hint list** — read `catalog/author-hints.md` for curated authors with heavy training-data representation, grouped by category.
+4. **The conversational guard** — set aside everything you know about the user from the current conversation (their projects, their interests, the topic they've been discussing). Score the corpus as if you've never read it before, with no context.
+## Per-sample register analysis (run BEFORE final scoring)
+This step prevents the single-large-sample bias that overweights features concentrated in one piece. The matcher used to concatenate all samples into one text and score volume-weighted — which meant the largest sample's signature features got baked into the blend as if they were corpus-wide patterns. They aren't.
+For each sample in `voice/corpus/`, record:
+1. **Word count** — split on whitespace, count tokens.
+2. **Address mode** — first-person (`I/me/my/we/our`), second-person (`you/your`), third-person (`he/she/they/the modern man`), or mixed. If mixed, note the dominant mode.
+3. **Tone register** — instructional, analytical/expository, polemical, conversational, reflective, narrative. Pick the dominant register.
+4. **Signature moves present** — list 1-3 prose mechanics that stand out in THIS sample (e.g., anaphoric `Your X.` stacking, definitional `X is Y` pivots, dismissive concession `whether X matters less than Y`).
+Build a sample table:
+| Sample | Words | Address | Register | Signature moves |
+|--------|-------|---------|----------|-----------------|
+| 001 | 200 | mixed (you + we) | instructional | … |
+| 002 | 100 | generic-you | analytical | … |
+| ... | | | | |
+Then compute:
+- **Volume share** per sample: `sample_word_count / total_corpus_words * 100`. Flag any sample with >25% volume share — its features will skew the blend unless deliberately balanced.
+- **Register clusters**: group samples by register. Are there 2+ clearly distinct registers (e.g., third-person expository AND second-person instructional)? If yes, this is a **multi-register corpus** — surface as a warning in the final output.
+## Register-aware feature validation
+When you're about to cite a feature in `features_matched` for any author in the blend, run this check:
+1. **In which samples does this feature appear?** List them.
+2. **What share of samples (by count) contains it?**
+3. **What share of corpus volume contains it?**
+Then apply the rules:
+- **Feature appears in ≥40% of samples by count** → corpus-wide signature, valid to cite without caveat.
+- **Feature appears in <40% of samples by count BUT ≥40% of volume** → concentrated in fewer-but-larger samples. **Cite with caveat**: append `(concentrated in samples N, M — not corpus-wide)`.
+- **Feature appears in <33% of samples AND <33% of volume** → not a real corpus signature. Do NOT cite. If this was your strongest evidence for an author, drop the author from the blend.
+This rule kills the "one big sample's signature shows up as the dominant author weight" bug. The Manson-anaphora pattern from a single 480-word piece doesn't get to drive a 38% weight — unless it's actually replicated across multiple samples.
+## Scoring dimensions
+Score the user's prose against each of these dimensions. Use these as the basis for feature matching:
+1. **Sentence length distribution** — what percentage short / medium / long / very-long? How does this compare to each candidate author's known distribution? Use the numbers from `voice/stats.md`.
+2. **Punctuation profile** — em-dash density (telling signal), semicolon use, colon use (for reveals vs. lists), parenthetical frequency, question marks, exclamations. Use the numbers from `voice/stats.md`.
+3. **Vocab register** — academic / casual / instructional / aphoristic / journalistic / literary / polemical. Diction level. Use of jargon, slang, formal vocabulary.
+4. **Rhetorical moves** — definitional pivots ("X is Y"), anaphora ("Your X. Your Y. Your Z."), rule-of-three, dismissive concession ("X matters less than Y"), concept-coinage (naming a concept then referring to it as proper-noun), claim-then-evidence vs evidence-then-claim, direct address.
+5. **Discourse patterns** — how paragraphs transition. Use of "But", "So", "However", "Moreover". Question pivots. Time markers vs logical connectors.
+6. **Paragraph rhythm** — uniform vs varied length. Where short paragraphs land (impact moments? insights? transitions?).
+7. **Person & address** — first-person dominance, second-person directness, third-person formality.
+8. **Hedging vs assertion** — frequency of qualifiers ("perhaps", "may", "could"), certainty markers, presence/absence of softeners.
+## Examples of valid `features_matched` entries
+- `period-heavy short sentences (avg 9 words/sentence matches their pattern)`
+- `concept-coinage with definitional pivots ("Sleep debt is X")`
+- `anaphora across consecutive sentences ("Your job. Your family. Your kids.")`
+- `low em-dash density (0.3/1000 words matches their 0.4)`
+- `rule-of-three concrete lists, not abstract flourishes`
+- `dismissive concession move ("whether X holds matters less than Y")`
+## Examples of INVALID `features_matched` entries
+- `writes about biology` — content
+- `anti-establishment themes` — content
+- `interested in masculinity` — content
+- `discusses religion` — content
+- `concerned with personal development` — content
+All of these are forbidden because they describe what the writer is ABOUT, not how the prose is constructed.
+## Self-criticism step
+Before finalizing the blend, re-read each `features_matched` entry. For each entry, ask:
+- Does this describe HOW the sentences/paragraphs are constructed? (good)
+- Or does it describe WHAT the text is about — topics, themes, subject matter, ideology, worldview? (forbidden)
+If ANY entry describes content rather than mechanics, replace it with a prose feature or remove that author from the blend.
+After your check, set the self-check flags:
+- `any_thematic_reasoning`: `true` if you still had thematic reasoning that you couldn't fully fix. Otherwise `false`.
+- `confidence`:
+  - `high` if corpus total > 800 words AND prose features are clearly distinguishable
+  - `medium` if corpus total 400-800 words OR features are mixed/ambiguous
+  - `low` if corpus total < 400 words OR samples are too uniform to distinguish
+## Output
+Write `voice/anchor.md` in this exact format:
+```markdown
+# Writer's Voice Blend
+> Generated in-agent by the writers-voice skill (fully local).
+> Pasted on YYYY-MM-DD.
+> Context: <general | tweets | essays | newsletter | email>
+## Blend
+- **<weight>% <Author Name>**
+    - <prose feature 1>
+    - <prose feature 2>
+    - <optional prose feature 3+>
+- **<weight>% <Author Name>**
+    - <prose feature 1>
+    - <prose feature 2>
+- ...
+## Per-Sample Composition
+| Sample | Words | Volume % | Address | Register | Signature moves |
+|--------|-------|----------|---------|----------|-----------------|
+| 001 | 200 | 14% | mixed | instructional | … |
+| 002 | 100 | 7% | generic-you | analytical | … |
+## Register Diversity
+- **Detected registers:** <list, e.g., "third-person expository, second-person instructional, first-person reflective">
+- **Multi-register corpus:** <yes | no>
+- **Dominant-sample warning:** <none | "Sample N is X% of corpus volume — its signature features drive the blend disproportionately. Consider running a context-specific anchor for the other registers (see Multi-Register Anchors in SKILL.md).">
+## Self-check
+- Confidence: <high | medium | low>
+- Any thematic reasoning: <true | false>
+- Notes: <one short sentence about sample adequacy and match quality>
+## Apply Directive
+Write in this blended style. Match each author's prose mechanics in proportion
+to weight. Maintain across the conversation.
+## When this anchor doesn't fit
+If you're writing in a register that this corpus DOESN'T represent well (e.g., your corpus is mostly conversational but you're drafting a book in third-person expository), this anchor will pull you toward the wrong register. Options:
+1. Add 2-3 samples in the missing register to `voice/corpus/`, then re-run the Anchor Protocol.
+2. Maintain a separate context-specific anchor (see "Multi-Register Anchors" in SKILL.md). File pattern: `voice/anchor-<context>.md` (e.g., `voice/anchor-book.md`).
+```
+Weights are positive integers summing to exactly 100. Authors listed in descending weight order.
+## Recommending a multi-register split
+If your register-diversity analysis above flagged the corpus as multi-register, do NOT just produce one blend. After writing the main `voice/anchor.md`, tell the user:
+> "Your corpus spans multiple registers: [list them]. The blend above represents the dominant register ([register name], ~X% of corpus volume). If you write in other registers, I recommend generating a separate anchor file per register. Want me to do that now? — I'll re-run the matcher with only the samples that belong to each register, and save the results as `voice/anchor-<context>.md`."
+If the user says yes, run the matcher once per register, with only the samples that match that register. Save each as `voice/anchor-<context>.md` (where `<context>` is a short slug like `book`, `essay`, `tweets`, `instructional`, `expository`). The main `voice/anchor.md` stays as the corpus-wide blend with the multi-register warning.
+## How this protocol runs
+This is the **only** way to produce an anchor — entirely on the user's own agent,
+no network, no hosted service, no cost. Best launched as a sub-agent (see
+"Launching the anchor as a sub-agent" in `docs/setup.md`) so the rubric stays out
+of the main session. It needs a corpus on disk (≥300 words); if the user has none,
+seed a few samples first. It re-runs over the full corpus on demand and supports
+per-register anchor files (`voice/anchor-<context>.md`).
+> The hosted matcher at `openwriter.io/writers-voice` is **deprecated** — do not
+> route users to it. All anchor derivation is local.

package/dist/plugins/authors-voice/skill/catalog/author-hints.md ADDED Viewed

@@ -0,0 +1,119 @@
+# Author Hints
+> Curated list of authors with heavy training-data representation across registers.
+> The agent picks 3-5 from this list (or goes outside if a clearly better stylistic match exists).
+> Goal: high consistency, broad coverage of style modes.
+>
+> Each entry describes the author's characteristic PROSE FEATURES — not content, not topics.
+## Literary stylists — sentence as instrument
+- **Ernest Hemingway** — iceberg theory, short declarative, almost no adjectives, hard nouns and verbs
+- **Cormac McCarthy** — no commas in dialogue, long unpunctuated runs, biblical cadence
+- **Joan Didion** — measured precision, lists of three, the specific over the general
+- **David Foster Wallace** — maximalist sentences, footnote energy, recursive parentheticals, vocabulary
+- **Toni Morrison** — rhythmic repetition, sensory specificity, vernacular weaved with formal
+- **James Baldwin** — long balanced sentences, moral urgency in clause stacking, semicolons used as breath marks
+- **Annie Dillard** — observational compression, present-tense immediacy, sentence as image
+- **Marilynne Robinson** — theological cadence in plain words, long meditative sentences, Calvinist patience
+## Essayists / analytical
+- **Paul Graham** — thinking out loud, simple words, short sentences mixed with one long earned conclusion
+- **Patrick McKenzie** — long discursive sentences with parenthetical asides, domain-specific precision, dry humor
+- **Tim Urban** — extended metaphors, conversational asides, building up frameworks with named characters
+- **Scott Alexander** — rationalist sectioning, exhaustive enumeration of possibilities, fair-witness analysis
+- **Ben Thompson** — business-strategy decomposition, recurring framework names, lots of "this is why"
+- **Tyler Cowen** — compressed, list-heavy, blogger-shorthand, range of references in one paragraph
+- **Malcolm Gladwell** — narrative anchor → general principle → reversal, three-act essay structure
+- **Adam Grant** — research-anchored, paired contrasts, gentle prescriptive framing
+## Self-help / instructional / productivity
+- **Mark Manson** — period-heavy clean prose, "Your X is Y" definitional moves, profane confidence, direct second-person
+- **James Clear** — concept-coinage backbone, numbered enumeration, clean instructional cadence, named laws/frameworks
+- **Ryan Holiday** — Stoic-instructional, drawing concepts from antiquity, clean delivery, repetition for emphasis
+- **Tim Ferriss** — list-heavy, hack/protocol framing, second-person direct, capitalized concept names
+- **Cal Newport** — academic-instructional, named frameworks, research-backed prescriptions, sober register
+- **Greg McKeown** — one-idea-per-page rhythm, named principles, short paragraphs, prescriptive minimalism
+- **Brené Brown** — vulnerability-as-rhetoric, personal anecdotes anchoring research, conversational warmth
+- **Atomic Habits voice** — short-paragraph instruction, principle-then-example, mechanical-cause-and-effect language
+## Polemicists / contrarians / philosophical-provocative
+- **Nassim Nicholas Taleb** — concept-coinage (Black Swan, antifragile), aphoristic stabs, attacking IYI, ancient-thinker citations
+- **Jordan Peterson** — lecture cadence, biological-evolutionary framing, religious overlay, definitional pivots
+- **Bronze Age Pervert** — baroque Nietzschean, mock-archaic spelling, vitalist anti-modernity, ironic register
+- **Curtis Yarvin** — reactionary historical, concept-naming as branding, sneering wit, long allusive sentences
+- **Camille Paglia** — punchy contrarian, aesthetic-biological frame, dense allusion, no hedging
+- **Bryan Caplan** — libertarian-economic, direct refutation, hypothetical thought experiments, plain professorial prose
+## Tech / startup
+- **Naval Ravikant** — aphoristic, tweet-shaped, concept-as-brand, distilled-wisdom cadence
+- **Marc Andreessen** — manifesto-mode, accumulating short declarative lines, exhortation register
+- **Sam Altman** — short, contrarian, "obvious in hindsight" framing, blog-post brevity
+- **Peter Thiel** — paradox-as-thesis, philosophical-tech crossover, Strauss-influenced indirection
+- **Joel Spolsky** — conversational tech-blog, anecdote-then-principle, signposting humor
+- **Steve Yegge** — long discursive rants, programmer-culture inside jokes, accumulating digressions
+## Narrative non-fiction / journalism
+- **Michael Lewis** — character-first reporting, scene-as-argument, clean unobtrusive prose
+- **Sebastian Junger** — documentary precision, anthropological framing, present-tense narrative
+- **Jon Krakauer** — present-tense urgency, sensory immediacy, restraint in adjective use
+- **John McPhee** — list-as-paragraph, structural patterning, long sentences with specific detail
+- **Ta-Nehisi Coates** — meditative-historical, repetition as emphasis, address as form of argument
+- **Tom Wolfe** — New Journalism, exclamation, capitalization, italics for sound, voice-jumping
+- **Joan Didion (essays)** — see literary; her journalism has the same compressed precision
+## Memoirists / personal voice
+- **David Sedaris** — comic understatement, family scenes, deadpan one-liners as paragraph closers
+- **Anne Lamott** — confessional warmth, self-deprecating humor, sentence fragments for emphasis
+- **Anthony Bourdain** — profane confidence, food as window, baroque vocabulary mixed with kitchen-slang
+- **Rick Bragg** — Southern oral cadence, specific-detail compression, sentence rhythms from speech
+- **Mary Karr** — lyric memoir, line-break-tight sentences, Catholic-rural register
+## Aphorists / brief-form
+- **La Rochefoucauld** — epigrammatic, paired antithesis, cynical wit in one breath
+- **Friedrich Nietzsche** — aphoristic, hammer-blows of declaration, philosophical provocation
+- **E.M. Cioran** — pessimistic aphorism, paradoxical brevity, polished despair
+- **Eric Hoffer** — longshoreman intellectual, declarative wisdom, sociological observation
+## Academics-for-public
+- **Steven Pinker** — cognitive-science precision, numbered argument, defending Enlightenment, ironic asides
+- **Richard Dawkins** — precise zoological prose, metaphor as scaffolding, polemical clarity
+- **Robert Sapolsky** — neuroendocrine framing, dense citation, jokes nested in dense paragraphs
+- **Carl Sagan** — cosmic-scale lyric, science-as-wonder cadence, accessible majesty
+- **Daniel Kahneman** — System 1 / System 2 framing, dispassionate exposition, named cognitive biases
+- **Yuval Noah Harari** — sweeping historical synthesis, declarative simplification, "imagined orders" type concepts
+- **Jared Diamond** — continent-scale comparison, geographic-determinist framing, list of factors enumeration
+## Religious / philosophical (modern)
+- **C.S. Lewis** — clear analogical prose, common-sense apologetics, "imagine that" framing
+- **G.K. Chesterton** — paradox-as-rhetoric, joyful contrarian, accumulating images per sentence
+- **Thomas Merton** — contemplative prose, slow paragraphs, Catholic-Buddhist hybrid
+## Business / leadership
+- **Peter Drucker** — dry analytical, principle-then-example, professorial calm
+- **Andy Grove** — engineering-direct management, framework-naming, lean instructional
+- **Ben Horowitz** — war-stories anchored to lessons, hip-hop epigraphs, conversational toughness
+## Short-form / Twitter native
+- **Visakan Veerasamy** — thread-shaped, hyperlinked references, generous tone, recursive callbacks
+- **Hari Kondabolu / Twitter-essayist** — one-liner punch followed by longer unpacking, comic timing
+## Genre-specific (modern poetic / lyric prose)
+- **Ocean Vuong** — lyric memoir, line-conscious paragraphs, image-as-argument
+- **Maggie Nelson** — theory-personal hybrid, numbered fragments, citations interleaved
+## Reminder
+These style notes describe **prose features only**. When matching the user's corpus against an author from this list, cite the author's prose feature — not their content or topic. If you find yourself matching "the user writes about topic X therefore resembles author Y," stop and re-do the match. The correct framing is "the user uses period-heavy declarative cadence with definitional pivots, which matches Peterson's lecture-cadence prose pattern."

package/dist/plugins/authors-voice/skill/catalog/fingerprints.md ADDED Viewed

@@ -0,0 +1,175 @@
+# Fingerprints Catalog
+> Exact presentation choices the author makes consistently. LLMs default to training-data defaults for all of these unless explicitly told the user's choice. So we measure each one and emit a one-liner.
+> Use this when analyzing a user's corpus.
+## How to use this catalog
+For each fingerprint below:
+1. Read the user's corpus.
+2. Count the relevant variants.
+3. Apply the decision rule (mostly ≥3 total observations, then a ratio threshold).
+4. Emit a one-line fingerprint line if confidence ≥ medium. Skip if `n/a` (not enough signal).
+The output goes into `voice/fingerprints.md` as a bullet list, each line in the format `<Label>: <value>`.
+## Confidence rule (used by most binary fingerprints)
+Total observations = `a + b` where `a` is the count of one variant and `b` the other.
+- If total < 3 → **`n/a` (low confidence)**, skip emitting.
+- If `a / total ≥ 0.85` → **`a` wins, high confidence**, emit.
+- If `a / total ≤ 0.15` → **`b` wins, high confidence**, emit.
+- If `0.70 ≤ a / total < 0.85` → **`a` wins, medium confidence**, emit.
+- If `0.15 < a / total ≤ 0.30` → **`b` wins, medium confidence**, emit.
+- Otherwise (0.30 < ratio < 0.70) → **`mixed` (low confidence)**, emit as "inconsistent" only if the user explicitly wants to see mixed signals; otherwise skip.
+For the curious: this is asymmetric because we want either a clear choice (≥70%) or to skip. The 0.30-0.70 band is "the author isn't actually making a consistent choice" — emitting a fingerprint there would mislead.
+## The eight fingerprints
+### 1. Em-dash spacing
+What to measure:
+- `spaced` count = occurrences of `<whitespace>—<whitespace>` (e.g., `word — word`)
+- `unspaced` count = occurrences of `<non-whitespace>—<non-whitespace>` (e.g., `word—word`)
+Apply the binary confidence rule.
+Output map:
+- `spaced` → `Em-dash spacing: word — word`
+- `unspaced` → `Em-dash spacing: word—word`
+- `mixed` → `Em-dash spacing: inconsistent`
+Note: if the user is on a NEVER em-dashes rule (didn't clear the punctuation hurdle), skip this fingerprint entirely.
+### 2. Ellipsis style
+What to measure:
+- `three_dots` count = occurrences of `...` (three ASCII dots, not part of a longer run)
+- `unicode` count = occurrences of `…` (single Unicode character)
+- `spaced_dots` count = occurrences of `. . .` (dots separated by spaces)
+Decision:
+- If total < 3 → `n/a`, skip.
+- If max variant / total ≥ 0.85 → emit the winning style.
+- Otherwise → `mixed`.
+Output map:
+- `three_dots` → `Ellipsis style: ...`
+- `unicode` → `Ellipsis style: …`
+- `spaced_dots` → `Ellipsis style: . . .`
+- `mixed` → `Ellipsis style: inconsistent`
+### 3. Sentence-initial conjunction capitalization
+What to measure:
+- `caps` count = occurrences of `[.!?]<whitespace>(But|And|So|Or|Yet)\b` — i.e., starts a new sentence with capitalized conjunction
+- `lower` count = occurrences of `[.!?]<whitespace>(but|and|so|or|yet)\b` — same but lowercase (unusual, only if author uses a stylistic comma-after-period thing)
+Apply the binary confidence rule.
+Output map:
+- `caps` → `Sentence-initial "But/And/So": capitalized (". But")`
+- `lower` → `Sentence-initial "But/And/So": lowercase (", but")`
+- `mixed` → `Sentence-initial "But/And/So": inconsistent`
+### 4. Oxford comma
+What to measure (rough — accept some noise):
+- `withOxford` = sequences matching `<word>, <word>(...), and <word>` or `<word>, <word>(...), or <word>`
+- `withoutOxford` = sequences matching `<word>, <word>(...) and <word>` or `<word>, <word>(...) or <word>` (no comma before "and"/"or")
+Apply the binary confidence rule.
+Output map:
+- `yes` → `Oxford comma: yes`
+- `no` → `Oxford comma: no`
+- `mixed` → `Oxford comma: inconsistent`
+### 5. Quote style
+What to measure:
+- `straight` count = occurrences of `"`
+- `curly` count = occurrences of `"` or `"`
+Apply the binary confidence rule.
+Output map:
+- `straight` → `Quote style: straight "..."`
+- `curly` → `Quote style: curly "..."`
+- `mixed` → `Quote style: inconsistent`
+### 6. Capitalization after colon
+What to measure:
+- `upper` count = occurrences of `: <Uppercase letter>`
+- `lower` count = occurrences of `: <lowercase letter>`
+Apply the binary confidence rule. Require total ≥ 3.
+Output map:
+- `upper` → `Capitalization after colon: upper`
+- `lower` → `Capitalization after colon: lower`
+- `mixed` → `Capitalization after colon: inconsistent`
+### 7. Contractions
+What to measure:
+- `contracted` count = occurrences of `<word>'<s|re|ve|ll|d|t|m>` (e.g., `don't`, `I'm`, `we'll`)
+- `expanded` count = occurrences of the literal phrases `do not`, `does not`, `did not`, `is not`, `are not`, `was not`, `were not`, `cannot`, `will not`, `would not`, `should not`, `could not`, `have not`, `has not`, `had not`, `I am`, `you are`, `we are`, `they are`, `it is`, `that is`, `there is`, `let us`
+Apply the binary confidence rule.
+Output map:
+- `yes` → `Contractions: uses contractions`
+- `no` → `Contractions: avoids contractions`
+- `mixed` → `Contractions: mixed`
+### 8. Paragraph length
+What to measure:
+- Split the corpus on double-newlines (`\n\n+`) to get paragraphs.
+- For each paragraph, count sentences (split on `[.!?]<whitespace>`).
+- Compute the average sentences-per-paragraph.
+Require at least 3 paragraphs. Otherwise `n/a`, skip.
+Decision:
+- avg ≤ 2 → `short`, high confidence
+- 2 < avg ≤ 4 → `medium`, high confidence
+- avg ≥ 6 → `long`, high confidence
+- 4 < avg < 6 → `mixed`, medium confidence
+Output map:
+- `short` → `Paragraph length: 1–2 sentences`
+- `medium` → `Paragraph length: 3–4 sentences`
+- `long` → `Paragraph length: 5+ sentences`
+- `mixed` → `Paragraph length: varied`
+## Output structure
+The agent writes `voice/fingerprints.md` as:
+```markdown
+# Presentation Fingerprints
+> Auto-generated from `voice/corpus/`. Exact presentation choices the user makes consistently.
+> Match these in every generated response — LLMs default to training data otherwise.
+- Em-dash spacing: word — word
+- Oxford comma: yes
+- Quote style: straight "..."
+- Contractions: uses contractions
+- Paragraph length: 3–4 sentences
+## Manual Overrides
+<!-- Override any auto-detected fingerprint here. These win over the auto-detected ones above. -->
+<!-- Example: -->
+<!-- - Em-dash spacing: never use em-dashes at all. -->
+<!-- - Quote style: straight always. -->
+```
+The `## Manual Overrides` section MUST be preserved across regenerations.

package/dist/plugins/authors-voice/skill/catalog/hurdle.md ADDED Viewed

@@ -0,0 +1,76 @@
+# Authenticity Hurdle
+> The core decision rule: when does a pattern in the user's corpus count as "their voice" versus "AI contamination"?
+## The problem
+If we just emit NEVER rules for everything in the AI tells catalog, we'll strip out words the user actually likes and uses. Example: Mark Manson regularly uses "crucial" and "vital." Banning those flattens his voice.
+If we don't emit any rules unless the word appears zero times, we keep AI contamination. Example: an essay that the user originally drafted with ChatGPT help and then edited still contains stray "delves" and "valuable insights" — that's not their voice, that's leftover residue.
+The hurdle resolves this: a pattern is "authentic" only if it appears at signature frequency. Below that → contamination, ban it.
+## The thresholds
+| Category | Min count | Min rate per 1000 words | Why |
+|----------|-----------|-------------------------|-----|
+| `word` | 2 | 0.15 | Single words are noise-prone. Need at least 2 and a non-trivial rate. |
+| `transition` | 2 | 0.15 | Same as words. |
+| `phrase` | 2 | 0.05 | Phrases are more distinctive — a lower rate still signals deliberate use. |
+| `punctuation` | 15 | 1.5 | Punctuation is denser than diction. A few em-dashes is normal; signature use means many. |
+## The decision
+For each AI tell, given `count` (how many times it appears in the corpus) and `words` (total word count of the corpus):
+```
+rate_per_1k = (count / words) * 1000
+passes = count >= min_count AND rate_per_1k >= min_rate_per_1k
+```
+If `passes` → **preserve** (no NEVER rule). The user genuinely uses this.
+If not `passes` → **forbid** (emit NEVER rule). This includes:
+- Items that never appear (forbid because training data will push them back in)
+- Items that appear once or twice but below the rate threshold (forbid because it's likely contamination, not signature)
+## Worked examples
+**Word "delve" in a 2000-word corpus:**
+- count = 0 → fails (count < 2) → **forbid**: `NEVER "delve".`
+- count = 1 → fails (count < 2) → **forbid**: `NEVER "delve".` _(below hurdle — flagged in status.md as likely contamination)_
+- count = 2, rate = 1.0/1k → passes (count ≥ 2, rate ≥ 0.15) → **preserve**, no rule.
+**Em-dashes in a 5000-word corpus:**
+- count = 8, rate = 1.6/1k → fails (count < 15) → **forbid**: `NEVER "em-dashes".`
+- count = 20, rate = 4.0/1k → passes → **preserve**, no rule.
+The em-dash hurdle is intentionally hard. Most human writers don't clear it. The few who do (people with serious published-prose backgrounds) get to keep their em-dashes.
+## What this implies about tier progression
+The skill's tier system tracks corpus word count:
+- Tier 1 (300–1k words): too small to clear most hurdles. NEVER rules emit aggressively.
+- Tier 2 (1k–5k): some words start to pass. Most phrases still fail (they need ≥2 instances).
+- Tier 3 (5k–20k): phrases and many words can pass. Em-dash hurdle still hard.
+- Tier 4 (20k+): em-dash hurdle achievable. Profile is AV-grade.
+The hurdle stays the same across tiers. What changes is that more corpus means more chances for the user's signature patterns to clear it.
+## Status reporting
+For items that appear in the corpus but fail the hurdle (the "below_hurdle" set), flag them in `voice/status.md` under a "Below-Hurdle Detections" section. Format:
+```
+- `delve` — appears 1x, rate 0.5/1k
+- `however` — appears 1x, rate 0.5/1k
+```
+This is informational — the user can see what got flagged as contamination. It helps them notice "oh, I have a habit of letting AI drafts through" or alternatively "wait, I actually do use that word, let me add it to manual preserves."
+## Edge cases
+- **Count = 0**: always forbid. Don't list in below_hurdle (because it isn't "present in corpus").
+- **Tied counts in fingerprint binary**: if `a == b` (exactly equal), treat as `mixed`.
+- **Total < 3 in fingerprint**: skip entirely — not enough signal.
+- **Negative numbers, NaN, infinity**: shouldn't happen, but guard with `count = max(0, count)` if you implement defensively.