openwriter 0.35.1 → 0.36.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (32) hide show
  1. package/dist/client/assets/{index-Be_l2OOL.css → index-B5p6e-z0.css} +1 -1
  2. package/dist/client/assets/{index-BPDt3Psd.js → index-BMhKsQ_t.js} +53 -53
  3. package/dist/client/index.html +2 -2
  4. package/dist/plugins/authors-voice/skill/LICENSE +21 -0
  5. package/dist/plugins/authors-voice/skill/README.md +126 -0
  6. package/dist/plugins/authors-voice/skill/SKILL.md +151 -0
  7. package/dist/plugins/authors-voice/skill/catalog/ai-tells.md +144 -0
  8. package/dist/plugins/authors-voice/skill/catalog/anchor-prompt.md +189 -0
  9. package/dist/plugins/authors-voice/skill/catalog/author-hints.md +119 -0
  10. package/dist/plugins/authors-voice/skill/catalog/fingerprints.md +175 -0
  11. package/dist/plugins/authors-voice/skill/catalog/hurdle.md +76 -0
  12. package/dist/plugins/authors-voice/skill/catalog/post-write-audit.md +105 -0
  13. package/dist/plugins/authors-voice/skill/docs/analysis.md +31 -0
  14. package/dist/plugins/authors-voice/skill/docs/anchor-iteration.md +176 -0
  15. package/dist/plugins/authors-voice/skill/docs/api/import.md +78 -0
  16. package/dist/plugins/authors-voice/skill/docs/api/protocol.md +140 -0
  17. package/dist/plugins/authors-voice/skill/docs/api/setup.md +37 -0
  18. package/dist/plugins/authors-voice/skill/docs/api/tools.md +102 -0
  19. package/dist/plugins/authors-voice/skill/docs/api/troubleshooting.md +7 -0
  20. package/dist/plugins/authors-voice/skill/docs/apply-protocol-deep.md +191 -0
  21. package/dist/plugins/authors-voice/skill/docs/context-hygiene.md +33 -0
  22. package/dist/plugins/authors-voice/skill/docs/setup.md +74 -0
  23. package/dist/plugins/authors-voice/skill/docs/tiers.md +13 -0
  24. package/dist/plugins/authors-voice/skill/package.json +35 -0
  25. package/dist/plugins/authors-voice/skill/prompts/skeleton.md +29 -0
  26. package/dist/plugins/authors-voice/skill/voice/README.md +51 -0
  27. package/dist/plugins/authors-voice/skill/voice/corpus/.gitkeep +0 -0
  28. package/dist/server/documents.js +7 -10
  29. package/dist/server/state.js +27 -7
  30. package/dist/server/title-resolve.js +87 -0
  31. package/dist/server/workspaces.js +10 -4
  32. package/package.json +1 -1
@@ -0,0 +1,189 @@
1
+ # Anchor Prompt
2
+
3
+ > The matcher logic the agent follows to produce a voice anchor blend entirely in-agent — no network, no hosted service.
4
+
5
+ ## Role
6
+
7
+ You are a literary stylometry analyst. You identify which training-data authors a piece of writing mechanically resembles. You match on **prose mechanics only** — sentence structure, punctuation patterns, rhetorical moves, discourse rhythm, vocab register. You **never** match on content, themes, or subject matter.
8
+
9
+ If the writer discusses biology, you do NOT match them to Dawkins because of the topic. You match them to whichever author's *sentence construction* most resembles theirs. If the writer discusses startups, you do NOT match them to Paul Graham because of the topic. You match on prose mechanics — how the sentences are shaped, how paragraphs transition, how punctuation lands.
10
+
11
+ ## Hard rules
12
+
13
+ 1. Score ONLY on prose mechanics. Forbidden: matching on themes, topics, subject matter, ideology, worldview, or what the writer is "about."
14
+ 2. Output exactly 3-5 authors with weights summing to exactly 100.
15
+ 3. Each author's `features_matched` MUST cite at least 2 specific PROSE features. Each feature must describe HOW sentences/paragraphs are constructed, NOT what they are about.
16
+ 4. Prefer authors from the author hints list — the model has the deepest compressed style modes for them. You MAY go outside the list if a clearly better stylistic match exists, but only when justified by prose features.
17
+ 5. If the sample is too short or too uniform to distinguish style confidently, return fewer authors and flag confidence as `low`.
18
+
19
+ ## Inputs the agent loads before running this protocol
20
+
21
+ 1. **The user's corpus** — every file in `voice/corpus/` (strip YAML frontmatter). **Keep each sample separate** — do not concatenate yet. You need per-sample analysis before the blend.
22
+ 2. **The deterministic stats** — read `voice/stats.md` if it exists. The sentence-length distribution and punctuation density numbers are the anchors that prevent agent drift. If `stats.md` doesn't exist yet, run the Analysis Protocol first to generate it.
23
+ 3. **The author hint list** — read `catalog/author-hints.md` for curated authors with heavy training-data representation, grouped by category.
24
+ 4. **The conversational guard** — set aside everything you know about the user from the current conversation (their projects, their interests, the topic they've been discussing). Score the corpus as if you've never read it before, with no context.
25
+
26
+ ## Per-sample register analysis (run BEFORE final scoring)
27
+
28
+ This step prevents the single-large-sample bias that overweights features concentrated in one piece. The matcher used to concatenate all samples into one text and score volume-weighted — which meant the largest sample's signature features got baked into the blend as if they were corpus-wide patterns. They aren't.
29
+
30
+ For each sample in `voice/corpus/`, record:
31
+
32
+ 1. **Word count** — split on whitespace, count tokens.
33
+ 2. **Address mode** — first-person (`I/me/my/we/our`), second-person (`you/your`), third-person (`he/she/they/the modern man`), or mixed. If mixed, note the dominant mode.
34
+ 3. **Tone register** — instructional, analytical/expository, polemical, conversational, reflective, narrative. Pick the dominant register.
35
+ 4. **Signature moves present** — list 1-3 prose mechanics that stand out in THIS sample (e.g., anaphoric `Your X.` stacking, definitional `X is Y` pivots, dismissive concession `whether X matters less than Y`).
36
+
37
+ Build a sample table:
38
+
39
+ | Sample | Words | Address | Register | Signature moves |
40
+ |--------|-------|---------|----------|-----------------|
41
+ | 001 | 200 | mixed (you + we) | instructional | … |
42
+ | 002 | 100 | generic-you | analytical | … |
43
+ | ... | | | | |
44
+
45
+ Then compute:
46
+
47
+ - **Volume share** per sample: `sample_word_count / total_corpus_words * 100`. Flag any sample with >25% volume share — its features will skew the blend unless deliberately balanced.
48
+ - **Register clusters**: group samples by register. Are there 2+ clearly distinct registers (e.g., third-person expository AND second-person instructional)? If yes, this is a **multi-register corpus** — surface as a warning in the final output.
49
+
50
+ ## Register-aware feature validation
51
+
52
+ When you're about to cite a feature in `features_matched` for any author in the blend, run this check:
53
+
54
+ 1. **In which samples does this feature appear?** List them.
55
+ 2. **What share of samples (by count) contains it?**
56
+ 3. **What share of corpus volume contains it?**
57
+
58
+ Then apply the rules:
59
+
60
+ - **Feature appears in ≥40% of samples by count** → corpus-wide signature, valid to cite without caveat.
61
+ - **Feature appears in <40% of samples by count BUT ≥40% of volume** → concentrated in fewer-but-larger samples. **Cite with caveat**: append `(concentrated in samples N, M — not corpus-wide)`.
62
+ - **Feature appears in <33% of samples AND <33% of volume** → not a real corpus signature. Do NOT cite. If this was your strongest evidence for an author, drop the author from the blend.
63
+
64
+ This rule kills the "one big sample's signature shows up as the dominant author weight" bug. The Manson-anaphora pattern from a single 480-word piece doesn't get to drive a 38% weight — unless it's actually replicated across multiple samples.
65
+
66
+ ## Scoring dimensions
67
+
68
+ Score the user's prose against each of these dimensions. Use these as the basis for feature matching:
69
+
70
+ 1. **Sentence length distribution** — what percentage short / medium / long / very-long? How does this compare to each candidate author's known distribution? Use the numbers from `voice/stats.md`.
71
+ 2. **Punctuation profile** — em-dash density (telling signal), semicolon use, colon use (for reveals vs. lists), parenthetical frequency, question marks, exclamations. Use the numbers from `voice/stats.md`.
72
+ 3. **Vocab register** — academic / casual / instructional / aphoristic / journalistic / literary / polemical. Diction level. Use of jargon, slang, formal vocabulary.
73
+ 4. **Rhetorical moves** — definitional pivots ("X is Y"), anaphora ("Your X. Your Y. Your Z."), rule-of-three, dismissive concession ("X matters less than Y"), concept-coinage (naming a concept then referring to it as proper-noun), claim-then-evidence vs evidence-then-claim, direct address.
74
+ 5. **Discourse patterns** — how paragraphs transition. Use of "But", "So", "However", "Moreover". Question pivots. Time markers vs logical connectors.
75
+ 6. **Paragraph rhythm** — uniform vs varied length. Where short paragraphs land (impact moments? insights? transitions?).
76
+ 7. **Person & address** — first-person dominance, second-person directness, third-person formality.
77
+ 8. **Hedging vs assertion** — frequency of qualifiers ("perhaps", "may", "could"), certainty markers, presence/absence of softeners.
78
+
79
+ ## Examples of valid `features_matched` entries
80
+
81
+ - `period-heavy short sentences (avg 9 words/sentence matches their pattern)`
82
+ - `concept-coinage with definitional pivots ("Sleep debt is X")`
83
+ - `anaphora across consecutive sentences ("Your job. Your family. Your kids.")`
84
+ - `low em-dash density (0.3/1000 words matches their 0.4)`
85
+ - `rule-of-three concrete lists, not abstract flourishes`
86
+ - `dismissive concession move ("whether X holds matters less than Y")`
87
+
88
+ ## Examples of INVALID `features_matched` entries
89
+
90
+ - `writes about biology` — content
91
+ - `anti-establishment themes` — content
92
+ - `interested in masculinity` — content
93
+ - `discusses religion` — content
94
+ - `concerned with personal development` — content
95
+
96
+ All of these are forbidden because they describe what the writer is ABOUT, not how the prose is constructed.
97
+
98
+ ## Self-criticism step
99
+
100
+ Before finalizing the blend, re-read each `features_matched` entry. For each entry, ask:
101
+
102
+ - Does this describe HOW the sentences/paragraphs are constructed? (good)
103
+ - Or does it describe WHAT the text is about — topics, themes, subject matter, ideology, worldview? (forbidden)
104
+
105
+ If ANY entry describes content rather than mechanics, replace it with a prose feature or remove that author from the blend.
106
+
107
+ After your check, set the self-check flags:
108
+
109
+ - `any_thematic_reasoning`: `true` if you still had thematic reasoning that you couldn't fully fix. Otherwise `false`.
110
+ - `confidence`:
111
+ - `high` if corpus total > 800 words AND prose features are clearly distinguishable
112
+ - `medium` if corpus total 400-800 words OR features are mixed/ambiguous
113
+ - `low` if corpus total < 400 words OR samples are too uniform to distinguish
114
+
115
+ ## Output
116
+
117
+ Write `voice/anchor.md` in this exact format:
118
+
119
+ ```markdown
120
+ # Writer's Voice Blend
121
+
122
+ > Generated in-agent by the writers-voice skill (fully local).
123
+ > Pasted on YYYY-MM-DD.
124
+ > Context: <general | tweets | essays | newsletter | email>
125
+
126
+ ## Blend
127
+
128
+ - **<weight>% <Author Name>**
129
+ - <prose feature 1>
130
+ - <prose feature 2>
131
+ - <optional prose feature 3+>
132
+ - **<weight>% <Author Name>**
133
+ - <prose feature 1>
134
+ - <prose feature 2>
135
+ - ...
136
+
137
+ ## Per-Sample Composition
138
+
139
+ | Sample | Words | Volume % | Address | Register | Signature moves |
140
+ |--------|-------|----------|---------|----------|-----------------|
141
+ | 001 | 200 | 14% | mixed | instructional | … |
142
+ | 002 | 100 | 7% | generic-you | analytical | … |
143
+
144
+ ## Register Diversity
145
+
146
+ - **Detected registers:** <list, e.g., "third-person expository, second-person instructional, first-person reflective">
147
+ - **Multi-register corpus:** <yes | no>
148
+ - **Dominant-sample warning:** <none | "Sample N is X% of corpus volume — its signature features drive the blend disproportionately. Consider running a context-specific anchor for the other registers (see Multi-Register Anchors in SKILL.md).">
149
+
150
+ ## Self-check
151
+
152
+ - Confidence: <high | medium | low>
153
+ - Any thematic reasoning: <true | false>
154
+ - Notes: <one short sentence about sample adequacy and match quality>
155
+
156
+ ## Apply Directive
157
+
158
+ Write in this blended style. Match each author's prose mechanics in proportion
159
+ to weight. Maintain across the conversation.
160
+
161
+ ## When this anchor doesn't fit
162
+
163
+ If you're writing in a register that this corpus DOESN'T represent well (e.g., your corpus is mostly conversational but you're drafting a book in third-person expository), this anchor will pull you toward the wrong register. Options:
164
+
165
+ 1. Add 2-3 samples in the missing register to `voice/corpus/`, then re-run the Anchor Protocol.
166
+ 2. Maintain a separate context-specific anchor (see "Multi-Register Anchors" in SKILL.md). File pattern: `voice/anchor-<context>.md` (e.g., `voice/anchor-book.md`).
167
+ ```
168
+
169
+ Weights are positive integers summing to exactly 100. Authors listed in descending weight order.
170
+
171
+ ## Recommending a multi-register split
172
+
173
+ If your register-diversity analysis above flagged the corpus as multi-register, do NOT just produce one blend. After writing the main `voice/anchor.md`, tell the user:
174
+
175
+ > "Your corpus spans multiple registers: [list them]. The blend above represents the dominant register ([register name], ~X% of corpus volume). If you write in other registers, I recommend generating a separate anchor file per register. Want me to do that now? — I'll re-run the matcher with only the samples that belong to each register, and save the results as `voice/anchor-<context>.md`."
176
+
177
+ If the user says yes, run the matcher once per register, with only the samples that match that register. Save each as `voice/anchor-<context>.md` (where `<context>` is a short slug like `book`, `essay`, `tweets`, `instructional`, `expository`). The main `voice/anchor.md` stays as the corpus-wide blend with the multi-register warning.
178
+
179
+ ## How this protocol runs
180
+
181
+ This is the **only** way to produce an anchor — entirely on the user's own agent,
182
+ no network, no hosted service, no cost. Best launched as a sub-agent (see
183
+ "Launching the anchor as a sub-agent" in `docs/setup.md`) so the rubric stays out
184
+ of the main session. It needs a corpus on disk (≥300 words); if the user has none,
185
+ seed a few samples first. It re-runs over the full corpus on demand and supports
186
+ per-register anchor files (`voice/anchor-<context>.md`).
187
+
188
+ > The hosted matcher at `openwriter.io/writers-voice` is **deprecated** — do not
189
+ > route users to it. All anchor derivation is local.
@@ -0,0 +1,119 @@
1
+ # Author Hints
2
+
3
+ > Curated list of authors with heavy training-data representation across registers.
4
+ > The agent picks 3-5 from this list (or goes outside if a clearly better stylistic match exists).
5
+ > Goal: high consistency, broad coverage of style modes.
6
+ >
7
+ > Each entry describes the author's characteristic PROSE FEATURES — not content, not topics.
8
+
9
+ ## Literary stylists — sentence as instrument
10
+
11
+ - **Ernest Hemingway** — iceberg theory, short declarative, almost no adjectives, hard nouns and verbs
12
+ - **Cormac McCarthy** — no commas in dialogue, long unpunctuated runs, biblical cadence
13
+ - **Joan Didion** — measured precision, lists of three, the specific over the general
14
+ - **David Foster Wallace** — maximalist sentences, footnote energy, recursive parentheticals, vocabulary
15
+ - **Toni Morrison** — rhythmic repetition, sensory specificity, vernacular weaved with formal
16
+ - **James Baldwin** — long balanced sentences, moral urgency in clause stacking, semicolons used as breath marks
17
+ - **Annie Dillard** — observational compression, present-tense immediacy, sentence as image
18
+ - **Marilynne Robinson** — theological cadence in plain words, long meditative sentences, Calvinist patience
19
+
20
+ ## Essayists / analytical
21
+
22
+ - **Paul Graham** — thinking out loud, simple words, short sentences mixed with one long earned conclusion
23
+ - **Patrick McKenzie** — long discursive sentences with parenthetical asides, domain-specific precision, dry humor
24
+ - **Tim Urban** — extended metaphors, conversational asides, building up frameworks with named characters
25
+ - **Scott Alexander** — rationalist sectioning, exhaustive enumeration of possibilities, fair-witness analysis
26
+ - **Ben Thompson** — business-strategy decomposition, recurring framework names, lots of "this is why"
27
+ - **Tyler Cowen** — compressed, list-heavy, blogger-shorthand, range of references in one paragraph
28
+ - **Malcolm Gladwell** — narrative anchor → general principle → reversal, three-act essay structure
29
+ - **Adam Grant** — research-anchored, paired contrasts, gentle prescriptive framing
30
+
31
+ ## Self-help / instructional / productivity
32
+
33
+ - **Mark Manson** — period-heavy clean prose, "Your X is Y" definitional moves, profane confidence, direct second-person
34
+ - **James Clear** — concept-coinage backbone, numbered enumeration, clean instructional cadence, named laws/frameworks
35
+ - **Ryan Holiday** — Stoic-instructional, drawing concepts from antiquity, clean delivery, repetition for emphasis
36
+ - **Tim Ferriss** — list-heavy, hack/protocol framing, second-person direct, capitalized concept names
37
+ - **Cal Newport** — academic-instructional, named frameworks, research-backed prescriptions, sober register
38
+ - **Greg McKeown** — one-idea-per-page rhythm, named principles, short paragraphs, prescriptive minimalism
39
+ - **Brené Brown** — vulnerability-as-rhetoric, personal anecdotes anchoring research, conversational warmth
40
+ - **Atomic Habits voice** — short-paragraph instruction, principle-then-example, mechanical-cause-and-effect language
41
+
42
+ ## Polemicists / contrarians / philosophical-provocative
43
+
44
+ - **Nassim Nicholas Taleb** — concept-coinage (Black Swan, antifragile), aphoristic stabs, attacking IYI, ancient-thinker citations
45
+ - **Jordan Peterson** — lecture cadence, biological-evolutionary framing, religious overlay, definitional pivots
46
+ - **Bronze Age Pervert** — baroque Nietzschean, mock-archaic spelling, vitalist anti-modernity, ironic register
47
+ - **Curtis Yarvin** — reactionary historical, concept-naming as branding, sneering wit, long allusive sentences
48
+ - **Camille Paglia** — punchy contrarian, aesthetic-biological frame, dense allusion, no hedging
49
+ - **Bryan Caplan** — libertarian-economic, direct refutation, hypothetical thought experiments, plain professorial prose
50
+
51
+ ## Tech / startup
52
+
53
+ - **Naval Ravikant** — aphoristic, tweet-shaped, concept-as-brand, distilled-wisdom cadence
54
+ - **Marc Andreessen** — manifesto-mode, accumulating short declarative lines, exhortation register
55
+ - **Sam Altman** — short, contrarian, "obvious in hindsight" framing, blog-post brevity
56
+ - **Peter Thiel** — paradox-as-thesis, philosophical-tech crossover, Strauss-influenced indirection
57
+ - **Joel Spolsky** — conversational tech-blog, anecdote-then-principle, signposting humor
58
+ - **Steve Yegge** — long discursive rants, programmer-culture inside jokes, accumulating digressions
59
+
60
+ ## Narrative non-fiction / journalism
61
+
62
+ - **Michael Lewis** — character-first reporting, scene-as-argument, clean unobtrusive prose
63
+ - **Sebastian Junger** — documentary precision, anthropological framing, present-tense narrative
64
+ - **Jon Krakauer** — present-tense urgency, sensory immediacy, restraint in adjective use
65
+ - **John McPhee** — list-as-paragraph, structural patterning, long sentences with specific detail
66
+ - **Ta-Nehisi Coates** — meditative-historical, repetition as emphasis, address as form of argument
67
+ - **Tom Wolfe** — New Journalism, exclamation, capitalization, italics for sound, voice-jumping
68
+ - **Joan Didion (essays)** — see literary; her journalism has the same compressed precision
69
+
70
+ ## Memoirists / personal voice
71
+
72
+ - **David Sedaris** — comic understatement, family scenes, deadpan one-liners as paragraph closers
73
+ - **Anne Lamott** — confessional warmth, self-deprecating humor, sentence fragments for emphasis
74
+ - **Anthony Bourdain** — profane confidence, food as window, baroque vocabulary mixed with kitchen-slang
75
+ - **Rick Bragg** — Southern oral cadence, specific-detail compression, sentence rhythms from speech
76
+ - **Mary Karr** — lyric memoir, line-break-tight sentences, Catholic-rural register
77
+
78
+ ## Aphorists / brief-form
79
+
80
+ - **La Rochefoucauld** — epigrammatic, paired antithesis, cynical wit in one breath
81
+ - **Friedrich Nietzsche** — aphoristic, hammer-blows of declaration, philosophical provocation
82
+ - **E.M. Cioran** — pessimistic aphorism, paradoxical brevity, polished despair
83
+ - **Eric Hoffer** — longshoreman intellectual, declarative wisdom, sociological observation
84
+
85
+ ## Academics-for-public
86
+
87
+ - **Steven Pinker** — cognitive-science precision, numbered argument, defending Enlightenment, ironic asides
88
+ - **Richard Dawkins** — precise zoological prose, metaphor as scaffolding, polemical clarity
89
+ - **Robert Sapolsky** — neuroendocrine framing, dense citation, jokes nested in dense paragraphs
90
+ - **Carl Sagan** — cosmic-scale lyric, science-as-wonder cadence, accessible majesty
91
+ - **Daniel Kahneman** — System 1 / System 2 framing, dispassionate exposition, named cognitive biases
92
+ - **Yuval Noah Harari** — sweeping historical synthesis, declarative simplification, "imagined orders" type concepts
93
+ - **Jared Diamond** — continent-scale comparison, geographic-determinist framing, list of factors enumeration
94
+
95
+ ## Religious / philosophical (modern)
96
+
97
+ - **C.S. Lewis** — clear analogical prose, common-sense apologetics, "imagine that" framing
98
+ - **G.K. Chesterton** — paradox-as-rhetoric, joyful contrarian, accumulating images per sentence
99
+ - **Thomas Merton** — contemplative prose, slow paragraphs, Catholic-Buddhist hybrid
100
+
101
+ ## Business / leadership
102
+
103
+ - **Peter Drucker** — dry analytical, principle-then-example, professorial calm
104
+ - **Andy Grove** — engineering-direct management, framework-naming, lean instructional
105
+ - **Ben Horowitz** — war-stories anchored to lessons, hip-hop epigraphs, conversational toughness
106
+
107
+ ## Short-form / Twitter native
108
+
109
+ - **Visakan Veerasamy** — thread-shaped, hyperlinked references, generous tone, recursive callbacks
110
+ - **Hari Kondabolu / Twitter-essayist** — one-liner punch followed by longer unpacking, comic timing
111
+
112
+ ## Genre-specific (modern poetic / lyric prose)
113
+
114
+ - **Ocean Vuong** — lyric memoir, line-conscious paragraphs, image-as-argument
115
+ - **Maggie Nelson** — theory-personal hybrid, numbered fragments, citations interleaved
116
+
117
+ ## Reminder
118
+
119
+ These style notes describe **prose features only**. When matching the user's corpus against an author from this list, cite the author's prose feature — not their content or topic. If you find yourself matching "the user writes about topic X therefore resembles author Y," stop and re-do the match. The correct framing is "the user uses period-heavy declarative cadence with definitional pivots, which matches Peterson's lecture-cadence prose pattern."
@@ -0,0 +1,175 @@
1
+ # Fingerprints Catalog
2
+
3
+ > Exact presentation choices the author makes consistently. LLMs default to training-data defaults for all of these unless explicitly told the user's choice. So we measure each one and emit a one-liner.
4
+ > Use this when analyzing a user's corpus.
5
+
6
+ ## How to use this catalog
7
+
8
+ For each fingerprint below:
9
+
10
+ 1. Read the user's corpus.
11
+ 2. Count the relevant variants.
12
+ 3. Apply the decision rule (mostly ≥3 total observations, then a ratio threshold).
13
+ 4. Emit a one-line fingerprint line if confidence ≥ medium. Skip if `n/a` (not enough signal).
14
+
15
+ The output goes into `voice/fingerprints.md` as a bullet list, each line in the format `<Label>: <value>`.
16
+
17
+ ## Confidence rule (used by most binary fingerprints)
18
+
19
+ Total observations = `a + b` where `a` is the count of one variant and `b` the other.
20
+
21
+ - If total < 3 → **`n/a` (low confidence)**, skip emitting.
22
+ - If `a / total ≥ 0.85` → **`a` wins, high confidence**, emit.
23
+ - If `a / total ≤ 0.15` → **`b` wins, high confidence**, emit.
24
+ - If `0.70 ≤ a / total < 0.85` → **`a` wins, medium confidence**, emit.
25
+ - If `0.15 < a / total ≤ 0.30` → **`b` wins, medium confidence**, emit.
26
+ - Otherwise (0.30 < ratio < 0.70) → **`mixed` (low confidence)**, emit as "inconsistent" only if the user explicitly wants to see mixed signals; otherwise skip.
27
+
28
+ For the curious: this is asymmetric because we want either a clear choice (≥70%) or to skip. The 0.30-0.70 band is "the author isn't actually making a consistent choice" — emitting a fingerprint there would mislead.
29
+
30
+ ## The eight fingerprints
31
+
32
+ ### 1. Em-dash spacing
33
+
34
+ What to measure:
35
+ - `spaced` count = occurrences of `<whitespace>—<whitespace>` (e.g., `word — word`)
36
+ - `unspaced` count = occurrences of `<non-whitespace>—<non-whitespace>` (e.g., `word—word`)
37
+
38
+ Apply the binary confidence rule.
39
+
40
+ Output map:
41
+ - `spaced` → `Em-dash spacing: word — word`
42
+ - `unspaced` → `Em-dash spacing: word—word`
43
+ - `mixed` → `Em-dash spacing: inconsistent`
44
+
45
+ Note: if the user is on a NEVER em-dashes rule (didn't clear the punctuation hurdle), skip this fingerprint entirely.
46
+
47
+ ### 2. Ellipsis style
48
+
49
+ What to measure:
50
+ - `three_dots` count = occurrences of `...` (three ASCII dots, not part of a longer run)
51
+ - `unicode` count = occurrences of `…` (single Unicode character)
52
+ - `spaced_dots` count = occurrences of `. . .` (dots separated by spaces)
53
+
54
+ Decision:
55
+ - If total < 3 → `n/a`, skip.
56
+ - If max variant / total ≥ 0.85 → emit the winning style.
57
+ - Otherwise → `mixed`.
58
+
59
+ Output map:
60
+ - `three_dots` → `Ellipsis style: ...`
61
+ - `unicode` → `Ellipsis style: …`
62
+ - `spaced_dots` → `Ellipsis style: . . .`
63
+ - `mixed` → `Ellipsis style: inconsistent`
64
+
65
+ ### 3. Sentence-initial conjunction capitalization
66
+
67
+ What to measure:
68
+ - `caps` count = occurrences of `[.!?]<whitespace>(But|And|So|Or|Yet)\b` — i.e., starts a new sentence with capitalized conjunction
69
+ - `lower` count = occurrences of `[.!?]<whitespace>(but|and|so|or|yet)\b` — same but lowercase (unusual, only if author uses a stylistic comma-after-period thing)
70
+
71
+ Apply the binary confidence rule.
72
+
73
+ Output map:
74
+ - `caps` → `Sentence-initial "But/And/So": capitalized (". But")`
75
+ - `lower` → `Sentence-initial "But/And/So": lowercase (", but")`
76
+ - `mixed` → `Sentence-initial "But/And/So": inconsistent`
77
+
78
+ ### 4. Oxford comma
79
+
80
+ What to measure (rough — accept some noise):
81
+ - `withOxford` = sequences matching `<word>, <word>(...), and <word>` or `<word>, <word>(...), or <word>`
82
+ - `withoutOxford` = sequences matching `<word>, <word>(...) and <word>` or `<word>, <word>(...) or <word>` (no comma before "and"/"or")
83
+
84
+ Apply the binary confidence rule.
85
+
86
+ Output map:
87
+ - `yes` → `Oxford comma: yes`
88
+ - `no` → `Oxford comma: no`
89
+ - `mixed` → `Oxford comma: inconsistent`
90
+
91
+ ### 5. Quote style
92
+
93
+ What to measure:
94
+ - `straight` count = occurrences of `"`
95
+ - `curly` count = occurrences of `"` or `"`
96
+
97
+ Apply the binary confidence rule.
98
+
99
+ Output map:
100
+ - `straight` → `Quote style: straight "..."`
101
+ - `curly` → `Quote style: curly "..."`
102
+ - `mixed` → `Quote style: inconsistent`
103
+
104
+ ### 6. Capitalization after colon
105
+
106
+ What to measure:
107
+ - `upper` count = occurrences of `: <Uppercase letter>`
108
+ - `lower` count = occurrences of `: <lowercase letter>`
109
+
110
+ Apply the binary confidence rule. Require total ≥ 3.
111
+
112
+ Output map:
113
+ - `upper` → `Capitalization after colon: upper`
114
+ - `lower` → `Capitalization after colon: lower`
115
+ - `mixed` → `Capitalization after colon: inconsistent`
116
+
117
+ ### 7. Contractions
118
+
119
+ What to measure:
120
+ - `contracted` count = occurrences of `<word>'<s|re|ve|ll|d|t|m>` (e.g., `don't`, `I'm`, `we'll`)
121
+ - `expanded` count = occurrences of the literal phrases `do not`, `does not`, `did not`, `is not`, `are not`, `was not`, `were not`, `cannot`, `will not`, `would not`, `should not`, `could not`, `have not`, `has not`, `had not`, `I am`, `you are`, `we are`, `they are`, `it is`, `that is`, `there is`, `let us`
122
+
123
+ Apply the binary confidence rule.
124
+
125
+ Output map:
126
+ - `yes` → `Contractions: uses contractions`
127
+ - `no` → `Contractions: avoids contractions`
128
+ - `mixed` → `Contractions: mixed`
129
+
130
+ ### 8. Paragraph length
131
+
132
+ What to measure:
133
+ - Split the corpus on double-newlines (`\n\n+`) to get paragraphs.
134
+ - For each paragraph, count sentences (split on `[.!?]<whitespace>`).
135
+ - Compute the average sentences-per-paragraph.
136
+
137
+ Require at least 3 paragraphs. Otherwise `n/a`, skip.
138
+
139
+ Decision:
140
+ - avg ≤ 2 → `short`, high confidence
141
+ - 2 < avg ≤ 4 → `medium`, high confidence
142
+ - avg ≥ 6 → `long`, high confidence
143
+ - 4 < avg < 6 → `mixed`, medium confidence
144
+
145
+ Output map:
146
+ - `short` → `Paragraph length: 1–2 sentences`
147
+ - `medium` → `Paragraph length: 3–4 sentences`
148
+ - `long` → `Paragraph length: 5+ sentences`
149
+ - `mixed` → `Paragraph length: varied`
150
+
151
+ ## Output structure
152
+
153
+ The agent writes `voice/fingerprints.md` as:
154
+
155
+ ```markdown
156
+ # Presentation Fingerprints
157
+
158
+ > Auto-generated from `voice/corpus/`. Exact presentation choices the user makes consistently.
159
+ > Match these in every generated response — LLMs default to training data otherwise.
160
+
161
+ - Em-dash spacing: word — word
162
+ - Oxford comma: yes
163
+ - Quote style: straight "..."
164
+ - Contractions: uses contractions
165
+ - Paragraph length: 3–4 sentences
166
+
167
+ ## Manual Overrides
168
+
169
+ <!-- Override any auto-detected fingerprint here. These win over the auto-detected ones above. -->
170
+ <!-- Example: -->
171
+ <!-- - Em-dash spacing: never use em-dashes at all. -->
172
+ <!-- - Quote style: straight always. -->
173
+ ```
174
+
175
+ The `## Manual Overrides` section MUST be preserved across regenerations.
@@ -0,0 +1,76 @@
1
+ # Authenticity Hurdle
2
+
3
+ > The core decision rule: when does a pattern in the user's corpus count as "their voice" versus "AI contamination"?
4
+
5
+ ## The problem
6
+
7
+ If we just emit NEVER rules for everything in the AI tells catalog, we'll strip out words the user actually likes and uses. Example: Mark Manson regularly uses "crucial" and "vital." Banning those flattens his voice.
8
+
9
+ If we don't emit any rules unless the word appears zero times, we keep AI contamination. Example: an essay that the user originally drafted with ChatGPT help and then edited still contains stray "delves" and "valuable insights" — that's not their voice, that's leftover residue.
10
+
11
+ The hurdle resolves this: a pattern is "authentic" only if it appears at signature frequency. Below that → contamination, ban it.
12
+
13
+ ## The thresholds
14
+
15
+ | Category | Min count | Min rate per 1000 words | Why |
16
+ |----------|-----------|-------------------------|-----|
17
+ | `word` | 2 | 0.15 | Single words are noise-prone. Need at least 2 and a non-trivial rate. |
18
+ | `transition` | 2 | 0.15 | Same as words. |
19
+ | `phrase` | 2 | 0.05 | Phrases are more distinctive — a lower rate still signals deliberate use. |
20
+ | `punctuation` | 15 | 1.5 | Punctuation is denser than diction. A few em-dashes is normal; signature use means many. |
21
+
22
+ ## The decision
23
+
24
+ For each AI tell, given `count` (how many times it appears in the corpus) and `words` (total word count of the corpus):
25
+
26
+ ```
27
+ rate_per_1k = (count / words) * 1000
28
+ passes = count >= min_count AND rate_per_1k >= min_rate_per_1k
29
+ ```
30
+
31
+ If `passes` → **preserve** (no NEVER rule). The user genuinely uses this.
32
+
33
+ If not `passes` → **forbid** (emit NEVER rule). This includes:
34
+ - Items that never appear (forbid because training data will push them back in)
35
+ - Items that appear once or twice but below the rate threshold (forbid because it's likely contamination, not signature)
36
+
37
+ ## Worked examples
38
+
39
+ **Word "delve" in a 2000-word corpus:**
40
+ - count = 0 → fails (count < 2) → **forbid**: `NEVER "delve".`
41
+ - count = 1 → fails (count < 2) → **forbid**: `NEVER "delve".` _(below hurdle — flagged in status.md as likely contamination)_
42
+ - count = 2, rate = 1.0/1k → passes (count ≥ 2, rate ≥ 0.15) → **preserve**, no rule.
43
+
44
+ **Em-dashes in a 5000-word corpus:**
45
+ - count = 8, rate = 1.6/1k → fails (count < 15) → **forbid**: `NEVER "em-dashes".`
46
+ - count = 20, rate = 4.0/1k → passes → **preserve**, no rule.
47
+
48
+ The em-dash hurdle is intentionally hard. Most human writers don't clear it. The few who do (people with serious published-prose backgrounds) get to keep their em-dashes.
49
+
50
+ ## What this implies about tier progression
51
+
52
+ The skill's tier system tracks corpus word count:
53
+ - Tier 1 (300–1k words): too small to clear most hurdles. NEVER rules emit aggressively.
54
+ - Tier 2 (1k–5k): some words start to pass. Most phrases still fail (they need ≥2 instances).
55
+ - Tier 3 (5k–20k): phrases and many words can pass. Em-dash hurdle still hard.
56
+ - Tier 4 (20k+): em-dash hurdle achievable. Profile is AV-grade.
57
+
58
+ The hurdle stays the same across tiers. What changes is that more corpus means more chances for the user's signature patterns to clear it.
59
+
60
+ ## Status reporting
61
+
62
+ For items that appear in the corpus but fail the hurdle (the "below_hurdle" set), flag them in `voice/status.md` under a "Below-Hurdle Detections" section. Format:
63
+
64
+ ```
65
+ - `delve` — appears 1x, rate 0.5/1k
66
+ - `however` — appears 1x, rate 0.5/1k
67
+ ```
68
+
69
+ This is informational — the user can see what got flagged as contamination. It helps them notice "oh, I have a habit of letting AI drafts through" or alternatively "wait, I actually do use that word, let me add it to manual preserves."
70
+
71
+ ## Edge cases
72
+
73
+ - **Count = 0**: always forbid. Don't list in below_hurdle (because it isn't "present in corpus").
74
+ - **Tied counts in fingerprint binary**: if `a == b` (exactly equal), treat as `mixed`.
75
+ - **Total < 3 in fingerprint**: skip entirely — not enough signal.
76
+ - **Negative numbers, NaN, infinity**: shouldn't happen, but guard with `count = max(0, count)` if you implement defensively.