openwriter 0.35.1 → 0.35.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (25) hide show
  1. package/dist/plugins/authors-voice/skill/LICENSE +21 -0
  2. package/dist/plugins/authors-voice/skill/README.md +126 -0
  3. package/dist/plugins/authors-voice/skill/SKILL.md +151 -0
  4. package/dist/plugins/authors-voice/skill/catalog/ai-tells.md +144 -0
  5. package/dist/plugins/authors-voice/skill/catalog/anchor-prompt.md +189 -0
  6. package/dist/plugins/authors-voice/skill/catalog/author-hints.md +119 -0
  7. package/dist/plugins/authors-voice/skill/catalog/fingerprints.md +175 -0
  8. package/dist/plugins/authors-voice/skill/catalog/hurdle.md +76 -0
  9. package/dist/plugins/authors-voice/skill/catalog/post-write-audit.md +105 -0
  10. package/dist/plugins/authors-voice/skill/docs/analysis.md +31 -0
  11. package/dist/plugins/authors-voice/skill/docs/anchor-iteration.md +176 -0
  12. package/dist/plugins/authors-voice/skill/docs/api/import.md +78 -0
  13. package/dist/plugins/authors-voice/skill/docs/api/protocol.md +140 -0
  14. package/dist/plugins/authors-voice/skill/docs/api/setup.md +37 -0
  15. package/dist/plugins/authors-voice/skill/docs/api/tools.md +102 -0
  16. package/dist/plugins/authors-voice/skill/docs/api/troubleshooting.md +7 -0
  17. package/dist/plugins/authors-voice/skill/docs/apply-protocol-deep.md +191 -0
  18. package/dist/plugins/authors-voice/skill/docs/context-hygiene.md +33 -0
  19. package/dist/plugins/authors-voice/skill/docs/setup.md +74 -0
  20. package/dist/plugins/authors-voice/skill/docs/tiers.md +13 -0
  21. package/dist/plugins/authors-voice/skill/package.json +35 -0
  22. package/dist/plugins/authors-voice/skill/prompts/skeleton.md +29 -0
  23. package/dist/plugins/authors-voice/skill/voice/README.md +51 -0
  24. package/dist/plugins/authors-voice/skill/voice/corpus/.gitkeep +0 -0
  25. package/package.json +1 -1
@@ -0,0 +1,119 @@
1
+ # Author Hints
2
+
3
+ > Curated list of authors with heavy training-data representation across registers.
4
+ > The agent picks 3-5 from this list (or goes outside if a clearly better stylistic match exists).
5
+ > Goal: high consistency, broad coverage of style modes.
6
+ >
7
+ > Each entry describes the author's characteristic PROSE FEATURES — not content, not topics.
8
+
9
+ ## Literary stylists — sentence as instrument
10
+
11
+ - **Ernest Hemingway** — iceberg theory, short declarative, almost no adjectives, hard nouns and verbs
12
+ - **Cormac McCarthy** — no commas in dialogue, long unpunctuated runs, biblical cadence
13
+ - **Joan Didion** — measured precision, lists of three, the specific over the general
14
+ - **David Foster Wallace** — maximalist sentences, footnote energy, recursive parentheticals, vocabulary
15
+ - **Toni Morrison** — rhythmic repetition, sensory specificity, vernacular weaved with formal
16
+ - **James Baldwin** — long balanced sentences, moral urgency in clause stacking, semicolons used as breath marks
17
+ - **Annie Dillard** — observational compression, present-tense immediacy, sentence as image
18
+ - **Marilynne Robinson** — theological cadence in plain words, long meditative sentences, Calvinist patience
19
+
20
+ ## Essayists / analytical
21
+
22
+ - **Paul Graham** — thinking out loud, simple words, short sentences mixed with one long earned conclusion
23
+ - **Patrick McKenzie** — long discursive sentences with parenthetical asides, domain-specific precision, dry humor
24
+ - **Tim Urban** — extended metaphors, conversational asides, building up frameworks with named characters
25
+ - **Scott Alexander** — rationalist sectioning, exhaustive enumeration of possibilities, fair-witness analysis
26
+ - **Ben Thompson** — business-strategy decomposition, recurring framework names, lots of "this is why"
27
+ - **Tyler Cowen** — compressed, list-heavy, blogger-shorthand, range of references in one paragraph
28
+ - **Malcolm Gladwell** — narrative anchor → general principle → reversal, three-act essay structure
29
+ - **Adam Grant** — research-anchored, paired contrasts, gentle prescriptive framing
30
+
31
+ ## Self-help / instructional / productivity
32
+
33
+ - **Mark Manson** — period-heavy clean prose, "Your X is Y" definitional moves, profane confidence, direct second-person
34
+ - **James Clear** — concept-coinage backbone, numbered enumeration, clean instructional cadence, named laws/frameworks
35
+ - **Ryan Holiday** — Stoic-instructional, drawing concepts from antiquity, clean delivery, repetition for emphasis
36
+ - **Tim Ferriss** — list-heavy, hack/protocol framing, second-person direct, capitalized concept names
37
+ - **Cal Newport** — academic-instructional, named frameworks, research-backed prescriptions, sober register
38
+ - **Greg McKeown** — one-idea-per-page rhythm, named principles, short paragraphs, prescriptive minimalism
39
+ - **Brené Brown** — vulnerability-as-rhetoric, personal anecdotes anchoring research, conversational warmth
40
+ - **Atomic Habits voice** — short-paragraph instruction, principle-then-example, mechanical-cause-and-effect language
41
+
42
+ ## Polemicists / contrarians / philosophical-provocative
43
+
44
+ - **Nassim Nicholas Taleb** — concept-coinage (Black Swan, antifragile), aphoristic stabs, attacking IYI, ancient-thinker citations
45
+ - **Jordan Peterson** — lecture cadence, biological-evolutionary framing, religious overlay, definitional pivots
46
+ - **Bronze Age Pervert** — baroque Nietzschean, mock-archaic spelling, vitalist anti-modernity, ironic register
47
+ - **Curtis Yarvin** — reactionary historical, concept-naming as branding, sneering wit, long allusive sentences
48
+ - **Camille Paglia** — punchy contrarian, aesthetic-biological frame, dense allusion, no hedging
49
+ - **Bryan Caplan** — libertarian-economic, direct refutation, hypothetical thought experiments, plain professorial prose
50
+
51
+ ## Tech / startup
52
+
53
+ - **Naval Ravikant** — aphoristic, tweet-shaped, concept-as-brand, distilled-wisdom cadence
54
+ - **Marc Andreessen** — manifesto-mode, accumulating short declarative lines, exhortation register
55
+ - **Sam Altman** — short, contrarian, "obvious in hindsight" framing, blog-post brevity
56
+ - **Peter Thiel** — paradox-as-thesis, philosophical-tech crossover, Strauss-influenced indirection
57
+ - **Joel Spolsky** — conversational tech-blog, anecdote-then-principle, signposting humor
58
+ - **Steve Yegge** — long discursive rants, programmer-culture inside jokes, accumulating digressions
59
+
60
+ ## Narrative non-fiction / journalism
61
+
62
+ - **Michael Lewis** — character-first reporting, scene-as-argument, clean unobtrusive prose
63
+ - **Sebastian Junger** — documentary precision, anthropological framing, present-tense narrative
64
+ - **Jon Krakauer** — present-tense urgency, sensory immediacy, restraint in adjective use
65
+ - **John McPhee** — list-as-paragraph, structural patterning, long sentences with specific detail
66
+ - **Ta-Nehisi Coates** — meditative-historical, repetition as emphasis, address as form of argument
67
+ - **Tom Wolfe** — New Journalism, exclamation, capitalization, italics for sound, voice-jumping
68
+ - **Joan Didion (essays)** — see literary; her journalism has the same compressed precision
69
+
70
+ ## Memoirists / personal voice
71
+
72
+ - **David Sedaris** — comic understatement, family scenes, deadpan one-liners as paragraph closers
73
+ - **Anne Lamott** — confessional warmth, self-deprecating humor, sentence fragments for emphasis
74
+ - **Anthony Bourdain** — profane confidence, food as window, baroque vocabulary mixed with kitchen-slang
75
+ - **Rick Bragg** — Southern oral cadence, specific-detail compression, sentence rhythms from speech
76
+ - **Mary Karr** — lyric memoir, line-break-tight sentences, Catholic-rural register
77
+
78
+ ## Aphorists / brief-form
79
+
80
+ - **La Rochefoucauld** — epigrammatic, paired antithesis, cynical wit in one breath
81
+ - **Friedrich Nietzsche** — aphoristic, hammer-blows of declaration, philosophical provocation
82
+ - **E.M. Cioran** — pessimistic aphorism, paradoxical brevity, polished despair
83
+ - **Eric Hoffer** — longshoreman intellectual, declarative wisdom, sociological observation
84
+
85
+ ## Academics-for-public
86
+
87
+ - **Steven Pinker** — cognitive-science precision, numbered argument, defending Enlightenment, ironic asides
88
+ - **Richard Dawkins** — precise zoological prose, metaphor as scaffolding, polemical clarity
89
+ - **Robert Sapolsky** — neuroendocrine framing, dense citation, jokes nested in dense paragraphs
90
+ - **Carl Sagan** — cosmic-scale lyric, science-as-wonder cadence, accessible majesty
91
+ - **Daniel Kahneman** — System 1 / System 2 framing, dispassionate exposition, named cognitive biases
92
+ - **Yuval Noah Harari** — sweeping historical synthesis, declarative simplification, "imagined orders" type concepts
93
+ - **Jared Diamond** — continent-scale comparison, geographic-determinist framing, list of factors enumeration
94
+
95
+ ## Religious / philosophical (modern)
96
+
97
+ - **C.S. Lewis** — clear analogical prose, common-sense apologetics, "imagine that" framing
98
+ - **G.K. Chesterton** — paradox-as-rhetoric, joyful contrarian, accumulating images per sentence
99
+ - **Thomas Merton** — contemplative prose, slow paragraphs, Catholic-Buddhist hybrid
100
+
101
+ ## Business / leadership
102
+
103
+ - **Peter Drucker** — dry analytical, principle-then-example, professorial calm
104
+ - **Andy Grove** — engineering-direct management, framework-naming, lean instructional
105
+ - **Ben Horowitz** — war-stories anchored to lessons, hip-hop epigraphs, conversational toughness
106
+
107
+ ## Short-form / Twitter native
108
+
109
+ - **Visakan Veerasamy** — thread-shaped, hyperlinked references, generous tone, recursive callbacks
110
+ - **Hari Kondabolu / Twitter-essayist** — one-liner punch followed by longer unpacking, comic timing
111
+
112
+ ## Genre-specific (modern poetic / lyric prose)
113
+
114
+ - **Ocean Vuong** — lyric memoir, line-conscious paragraphs, image-as-argument
115
+ - **Maggie Nelson** — theory-personal hybrid, numbered fragments, citations interleaved
116
+
117
+ ## Reminder
118
+
119
+ These style notes describe **prose features only**. When matching the user's corpus against an author from this list, cite the author's prose feature — not their content or topic. If you find yourself matching "the user writes about topic X therefore resembles author Y," stop and re-do the match. The correct framing is "the user uses period-heavy declarative cadence with definitional pivots, which matches Peterson's lecture-cadence prose pattern."
@@ -0,0 +1,175 @@
1
+ # Fingerprints Catalog
2
+
3
+ > Exact presentation choices the author makes consistently. LLMs default to training-data defaults for all of these unless explicitly told the user's choice. So we measure each one and emit a one-liner.
4
+ > Use this when analyzing a user's corpus.
5
+
6
+ ## How to use this catalog
7
+
8
+ For each fingerprint below:
9
+
10
+ 1. Read the user's corpus.
11
+ 2. Count the relevant variants.
12
+ 3. Apply the decision rule (mostly ≥3 total observations, then a ratio threshold).
13
+ 4. Emit a one-line fingerprint line if confidence ≥ medium. Skip if `n/a` (not enough signal).
14
+
15
+ The output goes into `voice/fingerprints.md` as a bullet list, each line in the format `<Label>: <value>`.
16
+
17
+ ## Confidence rule (used by most binary fingerprints)
18
+
19
+ Total observations = `a + b` where `a` is the count of one variant and `b` the other.
20
+
21
+ - If total < 3 → **`n/a` (low confidence)**, skip emitting.
22
+ - If `a / total ≥ 0.85` → **`a` wins, high confidence**, emit.
23
+ - If `a / total ≤ 0.15` → **`b` wins, high confidence**, emit.
24
+ - If `0.70 ≤ a / total < 0.85` → **`a` wins, medium confidence**, emit.
25
+ - If `0.15 < a / total ≤ 0.30` → **`b` wins, medium confidence**, emit.
26
+ - Otherwise (0.30 < ratio < 0.70) → **`mixed` (low confidence)**, emit as "inconsistent" only if the user explicitly wants to see mixed signals; otherwise skip.
27
+
28
+ For the curious: this is asymmetric because we want either a clear choice (≥70%) or to skip. The 0.30-0.70 band is "the author isn't actually making a consistent choice" — emitting a fingerprint there would mislead.
29
+
30
+ ## The eight fingerprints
31
+
32
+ ### 1. Em-dash spacing
33
+
34
+ What to measure:
35
+ - `spaced` count = occurrences of `<whitespace>—<whitespace>` (e.g., `word — word`)
36
+ - `unspaced` count = occurrences of `<non-whitespace>—<non-whitespace>` (e.g., `word—word`)
37
+
38
+ Apply the binary confidence rule.
39
+
40
+ Output map:
41
+ - `spaced` → `Em-dash spacing: word — word`
42
+ - `unspaced` → `Em-dash spacing: word—word`
43
+ - `mixed` → `Em-dash spacing: inconsistent`
44
+
45
+ Note: if the user is on a NEVER em-dashes rule (didn't clear the punctuation hurdle), skip this fingerprint entirely.
46
+
47
+ ### 2. Ellipsis style
48
+
49
+ What to measure:
50
+ - `three_dots` count = occurrences of `...` (three ASCII dots, not part of a longer run)
51
+ - `unicode` count = occurrences of `…` (single Unicode character)
52
+ - `spaced_dots` count = occurrences of `. . .` (dots separated by spaces)
53
+
54
+ Decision:
55
+ - If total < 3 → `n/a`, skip.
56
+ - If max variant / total ≥ 0.85 → emit the winning style.
57
+ - Otherwise → `mixed`.
58
+
59
+ Output map:
60
+ - `three_dots` → `Ellipsis style: ...`
61
+ - `unicode` → `Ellipsis style: …`
62
+ - `spaced_dots` → `Ellipsis style: . . .`
63
+ - `mixed` → `Ellipsis style: inconsistent`
64
+
65
+ ### 3. Sentence-initial conjunction capitalization
66
+
67
+ What to measure:
68
+ - `caps` count = occurrences of `[.!?]<whitespace>(But|And|So|Or|Yet)\b` — i.e., starts a new sentence with capitalized conjunction
69
+ - `lower` count = occurrences of `[.!?]<whitespace>(but|and|so|or|yet)\b` — same but lowercase (unusual, only if author uses a stylistic comma-after-period thing)
70
+
71
+ Apply the binary confidence rule.
72
+
73
+ Output map:
74
+ - `caps` → `Sentence-initial "But/And/So": capitalized (". But")`
75
+ - `lower` → `Sentence-initial "But/And/So": lowercase (", but")`
76
+ - `mixed` → `Sentence-initial "But/And/So": inconsistent`
77
+
78
+ ### 4. Oxford comma
79
+
80
+ What to measure (rough — accept some noise):
81
+ - `withOxford` = sequences matching `<word>, <word>(...), and <word>` or `<word>, <word>(...), or <word>`
82
+ - `withoutOxford` = sequences matching `<word>, <word>(...) and <word>` or `<word>, <word>(...) or <word>` (no comma before "and"/"or")
83
+
84
+ Apply the binary confidence rule.
85
+
86
+ Output map:
87
+ - `yes` → `Oxford comma: yes`
88
+ - `no` → `Oxford comma: no`
89
+ - `mixed` → `Oxford comma: inconsistent`
90
+
91
+ ### 5. Quote style
92
+
93
+ What to measure:
94
+ - `straight` count = occurrences of `"`
95
+ - `curly` count = occurrences of `"` or `"`
96
+
97
+ Apply the binary confidence rule.
98
+
99
+ Output map:
100
+ - `straight` → `Quote style: straight "..."`
101
+ - `curly` → `Quote style: curly "..."`
102
+ - `mixed` → `Quote style: inconsistent`
103
+
104
+ ### 6. Capitalization after colon
105
+
106
+ What to measure:
107
+ - `upper` count = occurrences of `: <Uppercase letter>`
108
+ - `lower` count = occurrences of `: <lowercase letter>`
109
+
110
+ Apply the binary confidence rule. Require total ≥ 3.
111
+
112
+ Output map:
113
+ - `upper` → `Capitalization after colon: upper`
114
+ - `lower` → `Capitalization after colon: lower`
115
+ - `mixed` → `Capitalization after colon: inconsistent`
116
+
117
+ ### 7. Contractions
118
+
119
+ What to measure:
120
+ - `contracted` count = occurrences of `<word>'<s|re|ve|ll|d|t|m>` (e.g., `don't`, `I'm`, `we'll`)
121
+ - `expanded` count = occurrences of the literal phrases `do not`, `does not`, `did not`, `is not`, `are not`, `was not`, `were not`, `cannot`, `will not`, `would not`, `should not`, `could not`, `have not`, `has not`, `had not`, `I am`, `you are`, `we are`, `they are`, `it is`, `that is`, `there is`, `let us`
122
+
123
+ Apply the binary confidence rule.
124
+
125
+ Output map:
126
+ - `yes` → `Contractions: uses contractions`
127
+ - `no` → `Contractions: avoids contractions`
128
+ - `mixed` → `Contractions: mixed`
129
+
130
+ ### 8. Paragraph length
131
+
132
+ What to measure:
133
+ - Split the corpus on double-newlines (`\n\n+`) to get paragraphs.
134
+ - For each paragraph, count sentences (split on `[.!?]<whitespace>`).
135
+ - Compute the average sentences-per-paragraph.
136
+
137
+ Require at least 3 paragraphs. Otherwise `n/a`, skip.
138
+
139
+ Decision:
140
+ - avg ≤ 2 → `short`, high confidence
141
+ - 2 < avg ≤ 4 → `medium`, high confidence
142
+ - avg ≥ 6 → `long`, high confidence
143
+ - 4 < avg < 6 → `mixed`, medium confidence
144
+
145
+ Output map:
146
+ - `short` → `Paragraph length: 1–2 sentences`
147
+ - `medium` → `Paragraph length: 3–4 sentences`
148
+ - `long` → `Paragraph length: 5+ sentences`
149
+ - `mixed` → `Paragraph length: varied`
150
+
151
+ ## Output structure
152
+
153
+ The agent writes `voice/fingerprints.md` as:
154
+
155
+ ```markdown
156
+ # Presentation Fingerprints
157
+
158
+ > Auto-generated from `voice/corpus/`. Exact presentation choices the user makes consistently.
159
+ > Match these in every generated response — LLMs default to training data otherwise.
160
+
161
+ - Em-dash spacing: word — word
162
+ - Oxford comma: yes
163
+ - Quote style: straight "..."
164
+ - Contractions: uses contractions
165
+ - Paragraph length: 3–4 sentences
166
+
167
+ ## Manual Overrides
168
+
169
+ <!-- Override any auto-detected fingerprint here. These win over the auto-detected ones above. -->
170
+ <!-- Example: -->
171
+ <!-- - Em-dash spacing: never use em-dashes at all. -->
172
+ <!-- - Quote style: straight always. -->
173
+ ```
174
+
175
+ The `## Manual Overrides` section MUST be preserved across regenerations.
@@ -0,0 +1,76 @@
1
+ # Authenticity Hurdle
2
+
3
+ > The core decision rule: when does a pattern in the user's corpus count as "their voice" versus "AI contamination"?
4
+
5
+ ## The problem
6
+
7
+ If we just emit NEVER rules for everything in the AI tells catalog, we'll strip out words the user actually likes and uses. Example: Mark Manson regularly uses "crucial" and "vital." Banning those flattens his voice.
8
+
9
+ If we don't emit any rules unless the word appears zero times, we keep AI contamination. Example: an essay that the user originally drafted with ChatGPT help and then edited still contains stray "delves" and "valuable insights" — that's not their voice, that's leftover residue.
10
+
11
+ The hurdle resolves this: a pattern is "authentic" only if it appears at signature frequency. Below that → contamination, ban it.
12
+
13
+ ## The thresholds
14
+
15
+ | Category | Min count | Min rate per 1000 words | Why |
16
+ |----------|-----------|-------------------------|-----|
17
+ | `word` | 2 | 0.15 | Single words are noise-prone. Need at least 2 and a non-trivial rate. |
18
+ | `transition` | 2 | 0.15 | Same as words. |
19
+ | `phrase` | 2 | 0.05 | Phrases are more distinctive — a lower rate still signals deliberate use. |
20
+ | `punctuation` | 15 | 1.5 | Punctuation is denser than diction. A few em-dashes is normal; signature use means many. |
21
+
22
+ ## The decision
23
+
24
+ For each AI tell, given `count` (how many times it appears in the corpus) and `words` (total word count of the corpus):
25
+
26
+ ```
27
+ rate_per_1k = (count / words) * 1000
28
+ passes = count >= min_count AND rate_per_1k >= min_rate_per_1k
29
+ ```
30
+
31
+ If `passes` → **preserve** (no NEVER rule). The user genuinely uses this.
32
+
33
+ If not `passes` → **forbid** (emit NEVER rule). This includes:
34
+ - Items that never appear (forbid because training data will push them back in)
35
+ - Items that appear once or twice but below the rate threshold (forbid because it's likely contamination, not signature)
36
+
37
+ ## Worked examples
38
+
39
+ **Word "delve" in a 2000-word corpus:**
40
+ - count = 0 → fails (count < 2) → **forbid**: `NEVER "delve".`
41
+ - count = 1 → fails (count < 2) → **forbid**: `NEVER "delve".` _(below hurdle — flagged in status.md as likely contamination)_
42
+ - count = 2, rate = 1.0/1k → passes (count ≥ 2, rate ≥ 0.15) → **preserve**, no rule.
43
+
44
+ **Em-dashes in a 5000-word corpus:**
45
+ - count = 8, rate = 1.6/1k → fails (count < 15) → **forbid**: `NEVER "em-dashes".`
46
+ - count = 20, rate = 4.0/1k → passes → **preserve**, no rule.
47
+
48
+ The em-dash hurdle is intentionally hard. Most human writers don't clear it. The few who do (people with serious published-prose backgrounds) get to keep their em-dashes.
49
+
50
+ ## What this implies about tier progression
51
+
52
+ The skill's tier system tracks corpus word count:
53
+ - Tier 1 (300–1k words): too small to clear most hurdles. NEVER rules emit aggressively.
54
+ - Tier 2 (1k–5k): some words start to pass. Most phrases still fail (they need ≥2 instances).
55
+ - Tier 3 (5k–20k): phrases and many words can pass. Em-dash hurdle still hard.
56
+ - Tier 4 (20k+): em-dash hurdle achievable. Profile is AV-grade.
57
+
58
+ The hurdle stays the same across tiers. What changes is that more corpus means more chances for the user's signature patterns to clear it.
59
+
60
+ ## Status reporting
61
+
62
+ For items that appear in the corpus but fail the hurdle (the "below_hurdle" set), flag them in `voice/status.md` under a "Below-Hurdle Detections" section. Format:
63
+
64
+ ```
65
+ - `delve` — appears 1x, rate 0.5/1k
66
+ - `however` — appears 1x, rate 0.5/1k
67
+ ```
68
+
69
+ This is informational — the user can see what got flagged as contamination. It helps them notice "oh, I have a habit of letting AI drafts through" or alternatively "wait, I actually do use that word, let me add it to manual preserves."
70
+
71
+ ## Edge cases
72
+
73
+ - **Count = 0**: always forbid. Don't list in below_hurdle (because it isn't "present in corpus").
74
+ - **Tied counts in fingerprint binary**: if `a == b` (exactly equal), treat as `mixed`.
75
+ - **Total < 3 in fingerprint**: skip entirely — not enough signal.
76
+ - **Negative numbers, NaN, infinity**: shouldn't happen, but guard with `count = max(0, count)` if you implement defensively.
@@ -0,0 +1,105 @@
1
+ # Post-Write Audit
2
+
3
+ > Distribution-level statistical checks the orchestrator runs against the minion's returned prose. Sits between step 6 (NEVER scan) and step 7 (integration) of the Apply Protocol. Catches statistical fingerprints the minion's prompt can't reasonably prevent without cognitively overloading the writing pass.
4
+
5
+ ## When this runs
6
+
7
+ After step 6 (NEVER-violations scan + brief-error patching), before step 7 (integration). The orchestrator reads this file, applies each check to the minion's output, and surgically rewrites the smallest span that brings the failing metric back into range.
8
+
9
+ ## Why this layer exists
10
+
11
+ The minion writes prose. The orchestrator polices distribution and lexicon. Anything mechanically detectable after the fact lives here, not in the writing-pass prompt — the minion's cognitive budget should go to channeling the anchor and hitting the commitments, not tracking 60 micro-bans.
12
+
13
+ Two enforcement points still exist for the bans the minion DOES need to see (contrastive negation, sentence-opener repetition, em-dashes, etc.) — those live in `voice/never-rules.md` and get scanned at step 6. This audit is for the slop that's cheaper to scrub than to prevent.
14
+
15
+ ## Remediation principle
16
+
17
+ For each failing check, rewrite the **smallest local span** that fixes the metric. Do not regenerate. Do not reach for stylistic improvement. The minion's voice IS the result — the audit only nudges the statistics.
18
+
19
+ If a failing span is load-bearing (a specific image, a coined term, a structural beat the brief demanded), leave it. Audit findings are advisory at the boundary case. The minion's intent wins ties.
20
+
21
+ Aim for the lightest touch: 5-10 small substitutions across a typical draft brings rates back in line. Heavier rewrites mean the audit is being misused.
22
+
23
+ ## Distribution checks
24
+
25
+ ### 1. Sentence-opener repetition
26
+
27
+ **What to measure:** walk the output sentence by sentence. For each window of 3 consecutive sentences, check whether all three start with the same first word.
28
+
29
+ **Threshold:** flag if >30% of windows trigger.
30
+
31
+ **Why this number:** human writing sits at ~17% (DFT 2026 — mostly from intentional list structures like "How does X?... How does Y?... How does Z?"). SFT models at T=0.7 hit 53.3%. The 30% line cleanly separates human from AI.
32
+
33
+ **Action:** locate the offending windows. For each, rewrite the second OR third sentence to start with a different word. If the window forms an intentional list, leave it — list structure is the human use case the 17% baseline reflects.
34
+
35
+ ### 2. Sentence-initial "The" frequency
36
+
37
+ **What to measure:** percentage of sentences that begin with the word "The."
38
+
39
+ **Threshold:** flag if >15% of all sentences.
40
+
41
+ **Why this number:** "The" at sentence start is over-used by ~90% in SFT output vs human writing (DFT 2026, 14B SFT model). The +90% inflation puts AI rates well above the natural human range.
42
+
43
+ **Action:** locate sentences starting with "The." Rewrite a portion to start with a different determiner ("A", "An", "These"), a pronoun, a prepositional phrase, or a different subject. Five to seven swaps across a typical paragraph is usually sufficient.
44
+
45
+ ### 3. Function-word over-use
46
+
47
+ **What to read for:** the AI distribution-distance signal lives mostly in function words, not fancy diction. Top-10 tokens account for 87.2% of L2 distribution distance in SFT output (DFT 2026). Watch for:
48
+
49
+ | Token | SFT inflation vs human |
50
+ |---|---|
51
+ | `is` | +44% |
52
+ | `was` | +49% |
53
+ | `are` | +31% |
54
+ | `that` | +25% |
55
+ | `a` | +15% |
56
+ | `to` | +11% |
57
+ | `.` (period) | +19% |
58
+
59
+ **Heuristic check (no exact threshold):** scan the draft for clusters of short copular sentences ("X is Y. Z is W. P is Q.") and high period density (many short sentences in a row). Both are signatures of function-word inflation.
60
+
61
+ **Action:** when noticed, merge two short copular sentences into one with a participial or relative clause; vary sentence structure to use action verbs instead of "is/was"; combine short sentences to drop period count. Three to five rewrites across a paragraph usually levels the distribution.
62
+
63
+ ### 4. Sentence-length variance
64
+
65
+ **What to measure:** compute standard deviation of sentence length (in words) across the output. If the user has a `voice/stats.md`, compare to the user's own σ. Otherwise compare to baseline σ ≥ 8 words.
66
+
67
+ **Threshold:** flag if σ < 6 words (low variance — uniform sentence length is an AI signature).
68
+
69
+ **Action:** locate runs of similar-length sentences. Merge two short ones into a longer compound, or split a medium one. The goal is to restore length variance, not hit a specific number.
70
+
71
+ ## Lexical watch list
72
+
73
+ Mechanical word-level scrubs. The minion doesn't see these — the audit handles them on the way out.
74
+
75
+ ### GPT-5 specific over-used tokens (DFT 2026)
76
+
77
+ When the minion is a GPT-5-class model, these tokens are inflated vs human writing. Scan for them:
78
+
79
+ | Token | Inflation vs human | Human baseline |
80
+ |---|---|---|
81
+ | `corridors` | +45.2% | 0.1% |
82
+ | `norms` | +43.1% | 0.1% |
83
+ | `align` / `aligns` / `alignment` | +36.0% | 0.2% |
84
+ | `metrics` | +27.2% | 0.2% |
85
+ | `engagement` | +26.5% | 0.2% |
86
+ | `targeted` | +5.1% | 1.6% |
87
+ | `identity` | +5.0% | 1.0% |
88
+ | `trust` | +4.9% | 1.2% |
89
+
90
+ **Action:** swap to a context-appropriate alternative when the word appears in surplus (3+ uses in a short piece, OR any use in a context where the word feels generic). If the user's corpus contains the word at signature frequency (in `voice/stats.md` or `voice/never-rules.md` exempts), leave it — they own that word.
91
+
92
+ ### Named-character defaults
93
+
94
+ AI defaults to specific generated names in fiction. Known examples:
95
+
96
+ - `Elara Voss` — documented in OpenAI's "goblin problem"
97
+ - Add new defaults as documented.
98
+
99
+ **Action:** if found in fiction output without explicit user specification, rename to something contextually appropriate or to a name the user has used in their corpus.
100
+
101
+ ## Source
102
+
103
+ Distribution thresholds and over-use rates from "Fixing LLM Writing with Distribution Fine-Tuning," Rosmine 2026 (https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distribution-fine-tuning/). Token over-use rates measured against 14B SFT vs human fineweb baseline. Sentence-opener repetition methodology: percent of texts containing 3+ consecutive sentences starting with the same first word.
104
+
105
+ This file is a living checklist. New research that surfaces measurable thresholds for AI-vs-human writing belongs here, not in `voice/never-rules.md` — the writing pass stays lean.
@@ -0,0 +1,31 @@
1
+ # Analysis Protocol
2
+
3
+ Regenerates the voice files from the corpus. Run any time the corpus changes (new samples added, samples removed, samples revised). Loaded only when triggered — not in context during normal writing sessions.
4
+
5
+ ## Protocol
6
+
7
+ 1. **Read inputs.** Concatenate every file in `voice/corpus/` (strip frontmatter). Count words. Read `catalog/ai-tells.md`, `catalog/fingerprints.md`, `catalog/hurdle.md`.
8
+
9
+ 2. **Compute deterministic tally** (best effort — counts may drift ±1 on long corpora):
10
+ - **Sentence distribution**: split on `[.!?]\s`, compute short/medium/long/very-long percentages, average length. Set `short_max` (25th-pct, clamped [6,12]) and `long_min` (75th-pct, clamped [18,28]). Do NOT emit a sentence-length cap in the apply directive — the corpus distribution carries the right ceiling and an arbitrary cap suppresses signature long sentences.
11
+ - **Punctuation density per 1k words** for em/en dash, colon, semicolon, question, exclamation, ellipsis, paren, bracket, straight/curly quotes. Categorize as `never` / `rare` / `low` / `strong`.
12
+ - **AI-tell tally**: count each item from `catalog/ai-tells.md`. Apply hurdle from `catalog/hurdle.md`: passes hurdle → preserve; fails → emit NEVER rule; below-hurdle but present → log to `below_hurdle`.
13
+ - **Fingerprints**: apply each detector from `catalog/fingerprints.md` with its decision rule.
14
+
15
+ 3. **Determine tier** by word count: <300 = 0 Empty; 300-999 = 1 Anchor; 1000-4999 = 2 Preliminary; 5000-19999 = 3 Full Coverage; ≥20000 = 4 AV-Grade. See `docs/tiers.md` for what unlocks at each tier.
16
+
17
+ 4. **Write `voice/stats.md`** — corpus stats, sentence distribution table, punctuation density table.
18
+
19
+ 5. **Write `voice/never-rules.md`** — preserve `## Manual Additions` section verbatim (anchored to start-of-line; the literal also appears in the intro blockquote — naive search will mis-grab it).
20
+
21
+ 6. **Write `voice/fingerprints.md`** — preserve `## Manual Overrides` section, same caution.
22
+
23
+ 7. **Write `voice/status.md`** — tier, words, active features, locked features, next milestone, file list, below-hurdle detections.
24
+
25
+ 8. **Report** — new tier, what changed in NEVER rules, what's locked next.
26
+
27
+ For corpora >10k words, count in passes (words → phrases → transitions) rather than tracking 60 counters at once.
28
+
29
+ ## Adding Samples Later
30
+
31
+ User says "add this to my voice profile" or pastes new writing. Append to next `voice/corpus/sample-NNN.md`, re-run Analysis Protocol, report tier change if any.