@houtini/voice-analyser 1.0.0 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +144 -461
- package/dist/analyzers/clustering.d.ts +36 -0
- package/dist/analyzers/clustering.d.ts.map +1 -0
- package/dist/analyzers/clustering.js +218 -0
- package/dist/analyzers/clustering.js.map +1 -0
- package/dist/analyzers/detection-risk.d.ts +28 -0
- package/dist/analyzers/detection-risk.d.ts.map +1 -0
- package/dist/analyzers/detection-risk.js +237 -0
- package/dist/analyzers/detection-risk.js.map +1 -0
- package/dist/analyzers/expression-markers.d.ts +57 -0
- package/dist/analyzers/expression-markers.d.ts.map +1 -0
- package/dist/analyzers/expression-markers.js +235 -0
- package/dist/analyzers/expression-markers.js.map +1 -0
- package/dist/analyzers/lexical-diversity.d.ts +36 -0
- package/dist/analyzers/lexical-diversity.d.ts.map +1 -0
- package/dist/analyzers/lexical-diversity.js +120 -0
- package/dist/analyzers/lexical-diversity.js.map +1 -0
- package/dist/analyzers/syntactic-patterns.d.ts +47 -0
- package/dist/analyzers/syntactic-patterns.d.ts.map +1 -0
- package/dist/analyzers/syntactic-patterns.js +186 -0
- package/dist/analyzers/syntactic-patterns.js.map +1 -0
- package/dist/tools/analyze-corpus.d.ts.map +1 -1
- package/dist/tools/analyze-corpus.js +119 -0
- package/dist/tools/analyze-corpus.js.map +1 -1
- package/dist/tools/generate-enhanced-guide.d.ts +21 -1
- package/dist/tools/generate-enhanced-guide.d.ts.map +1 -1
- package/dist/tools/generate-enhanced-guide.js +193 -2
- package/dist/tools/generate-enhanced-guide.js.map +1 -1
- package/dist/utils/advanced-statistics.d.ts +81 -0
- package/dist/utils/advanced-statistics.d.ts.map +1 -0
- package/dist/utils/advanced-statistics.js +162 -0
- package/dist/utils/advanced-statistics.js.map +1 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,566 +1,249 @@
|
|
|
1
|
-
# Voice
|
|
1
|
+
# Voice Analyser
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
[](https://www.npmjs.com/package/@houtini/voice-analyser)
|
|
4
|
+
[](https://opensource.org/licenses/MIT)
|
|
4
5
|
|
|
5
|
-
Extract
|
|
6
|
+
Extract statistical voice models from your published writing. Generate LLM-optimized style guides that actually replicate how you write.
|
|
6
7
|
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
## What This Does
|
|
10
|
-
|
|
11
|
-
Analyzes your published writing corpus (blog posts, articles) to create **statistical voice models** that LLMs can use to replicate your authentic voice. No more subjective "does this sound like me?" - measure it.
|
|
12
|
-
|
|
13
|
-
**Real results:**
|
|
14
|
-
- 90% first-pass acceptance (up from 60% with generic style guides)
|
|
15
|
-
- 55 minutes saved per article (35 min vs 90 min with rewrites)
|
|
16
|
-
- AI cliché detection in YOUR writing (patterns you didn't know you had)
|
|
17
|
-
- Function word fingerprints (z-scores show over-use/avoidance patterns)
|
|
18
|
-
|
|
19
|
-
---
|
|
20
|
-
|
|
21
|
-
## Quick Start
|
|
22
|
-
|
|
23
|
-
### Installation
|
|
8
|
+
## Installation
|
|
24
9
|
|
|
25
|
-
|
|
26
|
-
cd C:\dev\content-machine\mcp-server-voice-analysis
|
|
27
|
-
npm install
|
|
28
|
-
npm run build
|
|
29
|
-
```
|
|
30
|
-
|
|
31
|
-
### Add to Claude Desktop
|
|
10
|
+
### Claude Desktop
|
|
32
11
|
|
|
33
|
-
Add to `claude_desktop_config.json`:
|
|
12
|
+
Add to your `claude_desktop_config.json`:
|
|
34
13
|
|
|
35
14
|
```json
|
|
36
15
|
{
|
|
37
16
|
"mcpServers": {
|
|
38
17
|
"voice-analysis": {
|
|
39
|
-
"command": "
|
|
40
|
-
"args": [
|
|
41
|
-
"C:/dev/content-machine/mcp-server-voice-analysis/dist/index.js"
|
|
42
|
-
]
|
|
18
|
+
"command": "npx",
|
|
19
|
+
"args": ["-y", "@houtini/voice-analyser@latest"]
|
|
43
20
|
}
|
|
44
21
|
}
|
|
45
22
|
}
|
|
46
23
|
```
|
|
47
24
|
|
|
48
|
-
|
|
25
|
+
**Config locations:**
|
|
26
|
+
- Windows: `%APPDATA%\Claude\claude_desktop_config.json`
|
|
27
|
+
- macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`
|
|
28
|
+
- Linux: `~/.config/Claude/claude_desktop_config.json`
|
|
49
29
|
|
|
50
|
-
|
|
30
|
+
Restart Claude Desktop after saving.
|
|
51
31
|
|
|
52
|
-
|
|
53
|
-
```typescript
|
|
54
|
-
voice-analysis:collect_corpus({
|
|
55
|
-
sitemap_url: "https://yoursite.com/post-sitemap.xml",
|
|
56
|
-
output_name: "your-name",
|
|
57
|
-
max_articles: 50
|
|
58
|
-
})
|
|
59
|
-
```
|
|
32
|
+
### Requirements
|
|
60
33
|
|
|
61
|
-
|
|
62
|
-
- XML sitemaps
|
|
63
|
-
- RSS/Atom feeds
|
|
64
|
-
- Individual URLs (via Firecrawl integration)
|
|
65
|
-
|
|
66
|
-
**2. Analyze linguistic patterns:**
|
|
67
|
-
```typescript
|
|
68
|
-
voice-analysis:analyze_corpus({
|
|
69
|
-
corpus_name: "your-name",
|
|
70
|
-
analysis_type: "full"
|
|
71
|
-
})
|
|
72
|
-
```
|
|
34
|
+
- Node.js 20+
|
|
73
35
|
|
|
74
|
-
|
|
75
|
-
```typescript
|
|
76
|
-
voice-analysis:generate_enhanced_guide({
|
|
77
|
-
corpus_name: "your-name",
|
|
78
|
-
output_format: "llm"
|
|
79
|
-
})
|
|
80
|
-
```
|
|
36
|
+
## What It Does
|
|
81
37
|
|
|
82
|
-
|
|
38
|
+
Analyses your published writing to create statistical voice models. No subjective "does this sound like me?" - measure it.
|
|
83
39
|
|
|
84
|
-
|
|
40
|
+
**Captures:**
|
|
41
|
+
- Sentence length distributions (not averages - the full histogram)
|
|
42
|
+
- Function word fingerprints (z-scores showing your over-use/avoidance patterns)
|
|
43
|
+
- First-person density and hedging patterns
|
|
44
|
+
- Punctuation habits and British/American markers
|
|
45
|
+
- N-gram patterns at character, word, and part-of-speech levels
|
|
46
|
+
- AI clichés already present in YOUR writing
|
|
85
47
|
|
|
86
|
-
|
|
48
|
+
**Output:** 20,000-25,000 word statistical model ready for Claude to load as context.
|
|
87
49
|
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
**Sentence patterns:**
|
|
91
|
-
- Length distribution (not just average - the entire histogram)
|
|
92
|
-
- Syntactic structures (how you start sentences, common modifications)
|
|
93
|
-
- Sentence openers (frequency of "I", "The", "But", etc.)
|
|
50
|
+
## Quick Start
|
|
94
51
|
|
|
95
|
-
|
|
96
|
-
- Z-scores comparing your usage to general English
|
|
97
|
-
- Over-use patterns (distinctive markers)
|
|
98
|
-
- Avoidance patterns (words you rarely use)
|
|
52
|
+
### 1. Collect Your Writing
|
|
99
53
|
|
|
100
|
-
Example from real analysis:
|
|
101
54
|
```
|
|
102
|
-
|
|
103
|
-
"was": z = -2.46 (highly avoided - prefer active voice)
|
|
104
|
-
"of": z = -2.49 (avoided - use direct constructions)
|
|
55
|
+
Collect corpus from https://yoursite.com/post-sitemap.xml as "your-name"
|
|
105
56
|
```
|
|
106
57
|
|
|
107
|
-
|
|
108
|
-
- First-person density (0.6 per 100 words typical for authority voice)
|
|
109
|
-
- Hedging language ("I think", "seems to", "pretty much")
|
|
110
|
-
- British vs American English patterns
|
|
111
|
-
- Equipment specificity patterns ("my Simucube 2 Pro" not "a wheelbase")
|
|
112
|
-
|
|
113
|
-
**Anti-patterns detected:**
|
|
114
|
-
- AI clichés in YOUR corpus ("delve", "leverage", "unlock")
|
|
115
|
-
- Marketing speak patterns
|
|
116
|
-
- Generic references vs specific products
|
|
117
|
-
|
|
118
|
-
**Punctuation fingerprints:**
|
|
119
|
-
- Comma density (0.6-0.8 per sentence)
|
|
120
|
-
- Exclamation usage (5-8 per 1000 words for genuine enthusiasm)
|
|
121
|
-
- Quotation style (British double quotes)
|
|
122
|
-
- Dash preference patterns
|
|
123
|
-
|
|
124
|
-
### N-Gram Analysis (Enhanced Mode)
|
|
125
|
-
|
|
126
|
-
**Character n-grams:**
|
|
127
|
-
- Contraction patterns (`'s `, `'t `, `'ll `)
|
|
128
|
-
- Punctuation combinations
|
|
129
|
-
- Unique character sequences
|
|
130
|
-
|
|
131
|
-
**Word n-grams:**
|
|
132
|
-
- Phrase patterns (2-4 word sequences)
|
|
133
|
-
- Transitional phrases ("but I", "but it", "but the")
|
|
134
|
-
- Signature combinations
|
|
58
|
+
Works with XML sitemaps, RSS feeds, or individual URLs. Collects up to 100 articles by default.
|
|
135
59
|
|
|
136
|
-
|
|
137
|
-
- Syntactic patterns (ADJ NOUN, DET ADJ NOUN)
|
|
138
|
-
- Sentence structure fingerprints
|
|
139
|
-
- Grammatical constructions
|
|
140
|
-
|
|
141
|
-
---
|
|
142
|
-
|
|
143
|
-
## Output Format
|
|
144
|
-
|
|
145
|
-
### Generated Files
|
|
60
|
+
### 2. Analyse Patterns
|
|
146
61
|
|
|
147
62
|
```
|
|
148
|
-
corpus
|
|
149
|
-
└── your-name/
|
|
150
|
-
├── articles/ # Collected markdown
|
|
151
|
-
│ ├── 001-article-title.md
|
|
152
|
-
│ └── 002-another-article.md
|
|
153
|
-
├── corpus.json # Metadata
|
|
154
|
-
└── analysis/ # Analysis outputs
|
|
155
|
-
├── vocabulary.json
|
|
156
|
-
├── sentence.json
|
|
157
|
-
├── voice.json
|
|
158
|
-
├── function-words.json
|
|
159
|
-
├── character-ngrams.json # Enhanced mode
|
|
160
|
-
├── word-ngrams.json # Enhanced mode
|
|
161
|
-
└── pos-ngrams.json # Enhanced mode
|
|
162
|
-
|
|
163
|
-
templates/
|
|
164
|
-
└── writing_style_your-name.md # LLM-optimized guide (25k words)
|
|
63
|
+
Analyse corpus "your-name"
|
|
165
64
|
```
|
|
166
65
|
|
|
167
|
-
|
|
66
|
+
Generates statistical analysis of vocabulary, sentence structure, voice markers, and function word usage.
|
|
168
67
|
|
|
169
|
-
|
|
68
|
+
### 3. Generate Voice Model
|
|
170
69
|
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
4. **Vocabulary** - Domain-specific terms, British English markers
|
|
175
|
-
5. **Punctuation Patterns** - Density targets, style preferences
|
|
176
|
-
6. **Function Word Fingerprint** - Z-scores, over-use/avoidance patterns
|
|
177
|
-
7. **Transitional Phrases** - Connectives, discourse markers
|
|
178
|
-
8. **Anti-Patterns** - AI clichés to eliminate (detected in YOUR writing)
|
|
179
|
-
9. **Annotated Examples** - Good vs bad examples with pattern analysis
|
|
180
|
-
10. **Validation Checklist** - Concrete pass/fail criteria
|
|
70
|
+
```
|
|
71
|
+
Generate enhanced guide for "your-name"
|
|
72
|
+
```
|
|
181
73
|
|
|
182
|
-
|
|
74
|
+
Creates an LLM-optimized style guide with concrete targets and examples.
|
|
183
75
|
|
|
184
|
-
##
|
|
76
|
+
## Usage Examples
|
|
185
77
|
|
|
186
|
-
###
|
|
78
|
+
### Collect from Different Sources
|
|
187
79
|
|
|
188
|
-
**
|
|
80
|
+
**XML Sitemap:**
|
|
189
81
|
```
|
|
190
|
-
|
|
82
|
+
Collect corpus from https://example.com/post-sitemap.xml as "writer-name" with max 50 articles
|
|
191
83
|
```
|
|
192
84
|
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
- First-person count (target: 5+ statements)
|
|
198
|
-
- Sentence length distribution (15-21 ± 11-18 words)
|
|
199
|
-
- British English (100%)
|
|
200
|
-
- AI clichés (0)- Equipment specificity (named models, not generic)
|
|
201
|
-
- Zero marketing speak
|
|
85
|
+
**RSS Feed:**
|
|
86
|
+
```
|
|
87
|
+
Collect corpus from https://example.com/feed/ as "writer-name"
|
|
88
|
+
```
|
|
202
89
|
|
|
203
|
-
**
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
|
|
90
|
+
**Filter by URL pattern:**
|
|
91
|
+
```
|
|
92
|
+
Collect corpus from https://example.com/sitemap.xml as "writer-name" filtering URLs matching "blog"
|
|
93
|
+
```
|
|
207
94
|
|
|
208
|
-
###
|
|
95
|
+
### Analysis Options
|
|
209
96
|
|
|
210
|
-
|
|
97
|
+
**Full analysis (recommended):**
|
|
98
|
+
```
|
|
99
|
+
Analyse corpus "writer-name" with full analysis
|
|
100
|
+
```
|
|
211
101
|
|
|
212
|
-
|
|
213
|
-
|
|
214
|
-
|
|
215
|
-
|
|
216
|
-
collect_corpus({ url: "https://company.com/author/", output_name: "writer-corporate" })
|
|
102
|
+
**Quick iteration:**
|
|
103
|
+
```
|
|
104
|
+
Analyse corpus "writer-name" with quick analysis
|
|
105
|
+
```
|
|
217
106
|
|
|
218
|
-
|
|
219
|
-
analyze_corpus({ corpus_name: "writer-tech" })
|
|
220
|
-
analyze_corpus({ corpus_name: "writer-personal" })
|
|
221
|
-
analyze_corpus({ corpus_name: "writer-corporate" })
|
|
107
|
+
### Output Formats
|
|
222
108
|
|
|
223
|
-
|
|
224
|
-
|
|
109
|
+
**LLM-optimized (for Claude context):**
|
|
110
|
+
```
|
|
111
|
+
Generate enhanced guide for "writer-name" in llm format
|
|
225
112
|
```
|
|
226
113
|
|
|
227
|
-
**
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
The analysis captures these as appropriate variations, not errors.
|
|
114
|
+
**Human-readable overview:**
|
|
115
|
+
```
|
|
116
|
+
Generate enhanced guide for "writer-name" in human format
|
|
117
|
+
```
|
|
233
118
|
|
|
234
|
-
|
|
119
|
+
**Both formats:**
|
|
120
|
+
```
|
|
121
|
+
Generate enhanced guide for "writer-name" in both formats
|
|
122
|
+
```
|
|
235
123
|
|
|
236
|
-
##
|
|
124
|
+
## What Gets Measured
|
|
237
125
|
|
|
238
126
|
### Function Word Stylometry
|
|
239
127
|
|
|
240
|
-
Z-scores reveal unconscious
|
|
128
|
+
Z-scores reveal unconscious patterns:
|
|
241
129
|
|
|
242
130
|
| Z-Score | Meaning | Example |
|
|
243
131
|
|---------|---------|---------|
|
|
244
|
-
| +2.0+ | Highly distinctive
|
|
245
|
-
| +1.0 to +2.0 | Distinctive
|
|
132
|
+
| +2.0+ | Highly distinctive | "whilst" +5.7 (British marker) |
|
|
133
|
+
| +1.0 to +2.0 | Distinctive | "you" +1.75 (direct engagement) |
|
|
246
134
|
| -1.0 to +1.0 | Normal range | Typical usage |
|
|
247
|
-
| -
|
|
248
|
-
| -2.0- | Highly avoided
|
|
249
|
-
|
|
250
|
-
**Why this matters:** These patterns are invisible to you whilst writing but glaringly obvious when absent. That's why AI content feels "off" even when grammatically perfect.
|
|
251
|
-
|
|
252
|
-
### AI Cliché Detection
|
|
253
|
-
|
|
254
|
-
Analyzes YOUR corpus for overused AI-generated phrases:
|
|
255
|
-
|
|
256
|
-
**Detected patterns:**
|
|
257
|
-
- "dive into" (outlier frequency)
|
|
258
|
-
- "unlock" (appears unnaturally)
|
|
259
|
-
- "leverage", "seamless", "robust" (if present)
|
|
135
|
+
| -2.0 to -1.0 | Avoided | "the" -1.48 (prefer specific) |
|
|
136
|
+
| -2.0- | Highly avoided | "was" -2.46 (avoid passive) |
|
|
260
137
|
|
|
261
|
-
|
|
138
|
+
These patterns are invisible whilst writing but obvious when absent. That's why AI content feels "off" even when grammatically correct.
|
|
262
139
|
|
|
263
|
-
###
|
|
140
|
+
### Sentence Patterns
|
|
264
141
|
|
|
265
|
-
|
|
142
|
+
- Length distribution with variance (not just average)
|
|
143
|
+
- Sentence openers frequency ("I", "The", "But", etc.)
|
|
144
|
+
- Syntactic structures and modifications
|
|
266
145
|
|
|
267
|
-
|
|
268
|
-
- Contraction usage (`'s ` appears 6 times)
|
|
269
|
-
- Punctuation combinations
|
|
270
|
-
- Unique sequences
|
|
146
|
+
### Voice Markers
|
|
271
147
|
|
|
272
|
-
|
|
273
|
-
-
|
|
274
|
-
-
|
|
275
|
-
- "I think" (18 uses - hedging pattern)
|
|
276
|
-
|
|
277
|
-
**POS-level patterns:**
|
|
278
|
-
- DET NOUN (1242 times): the metrics, a domain
|
|
279
|
-
- ADJ NOUN (874 times): international markets, new website
|
|
280
|
-
- PRON AUX (735 times): I 'd, It 's
|
|
281
|
-
|
|
282
|
-
**Purpose:** Capture syntactic DNA that generic grammar rules miss.
|
|
283
|
-
|
|
284
|
-
---
|
|
285
|
-
|
|
286
|
-
## Requirements
|
|
148
|
+
- First-person density (0.6 per 100 words typical for authority voice)
|
|
149
|
+
- Hedging language frequency ("I think", "seems to")
|
|
150
|
+
- Equipment specificity ("my Simucube 2 Pro" vs "a wheelbase")
|
|
287
151
|
|
|
288
|
-
###
|
|
152
|
+
### N-Gram Analysis
|
|
289
153
|
|
|
290
|
-
**
|
|
291
|
-
-
|
|
292
|
-
-
|
|
293
|
-
-
|
|
154
|
+
**Word patterns:**
|
|
155
|
+
- "but I" (contrast marker)
|
|
156
|
+
- "in my" (authority phrase)
|
|
157
|
+
- Transitional phrases
|
|
294
158
|
|
|
295
|
-
**
|
|
159
|
+
**POS patterns:**
|
|
160
|
+
- DET NOUN (the metrics)
|
|
161
|
+
- ADJ NOUN (international markets)
|
|
162
|
+
- Syntactic DNA
|
|
296
163
|
|
|
297
|
-
|
|
164
|
+
### Anti-Patterns
|
|
298
165
|
|
|
299
|
-
|
|
166
|
+
Detects AI clichés in YOUR corpus:
|
|
167
|
+
- "delve", "leverage", "unlock"
|
|
168
|
+
- Marketing speak patterns
|
|
169
|
+
- Generic references
|
|
300
170
|
|
|
301
|
-
|
|
302
|
-
- Single author (no guest posts or collaborative writing)
|
|
303
|
-
- Consistent genre/domain (all technical, or all personal - or analyze separately)
|
|
304
|
-
- Recent writing (voice evolves - re-analyze quarterly)
|
|
305
|
-
- Published content (avoid unpublished drafts with incomplete editing)
|
|
171
|
+
## Output Structure
|
|
306
172
|
|
|
307
|
-
|
|
173
|
+
```
|
|
174
|
+
corpus/
|
|
175
|
+
└── your-name/
|
|
176
|
+
├── articles/ # Collected markdown
|
|
177
|
+
├── corpus.json # Metadata
|
|
178
|
+
└── analysis/ # JSON analysis files
|
|
308
179
|
|
|
309
|
-
|
|
180
|
+
templates/
|
|
181
|
+
└── writing_style_your-name.md # LLM-optimized guide
|
|
182
|
+
```
|
|
310
183
|
|
|
311
|
-
##
|
|
184
|
+
## Tools Reference
|
|
312
185
|
|
|
313
186
|
### collect_corpus
|
|
314
187
|
|
|
315
|
-
|
|
316
|
-
|
|
317
|
-
|
|
318
|
-
|
|
319
|
-
|
|
320
|
-
|
|
321
|
-
output_name: string; // Corpus identifier (e.g., "john-smith")
|
|
322
|
-
max_articles?: number; // Limit articles to collect (default: 100)
|
|
323
|
-
article_pattern?: string; // Optional regex filter for URLs
|
|
324
|
-
}
|
|
325
|
-
```
|
|
326
|
-
|
|
327
|
-
**Output:**
|
|
328
|
-
- Creates `corpus/{output_name}/` directory
|
|
329
|
-
- Saves articles as clean markdown
|
|
330
|
-
- Generates `corpus.json` with metadata
|
|
331
|
-
|
|
332
|
-
**Cleaning process:**
|
|
333
|
-
- Strips HTML, navigation, ads, comments
|
|
334
|
-
- Preserves article prose only
|
|
335
|
-
- Normalizes whitespace and formatting
|
|
188
|
+
| Parameter | Required | Description |
|
|
189
|
+
|-----------|----------|-------------|
|
|
190
|
+
| `sitemap_url` | Yes | XML sitemap, RSS feed, or URL |
|
|
191
|
+
| `output_name` | Yes | Corpus identifier |
|
|
192
|
+
| `max_articles` | No | Limit (default: 100) |
|
|
193
|
+
| `article_pattern` | No | Regex filter for URLs |
|
|
336
194
|
|
|
337
195
|
### analyze_corpus
|
|
338
196
|
|
|
339
|
-
|
|
340
|
-
|
|
341
|
-
|
|
342
|
-
|
|
343
|
-
{
|
|
344
|
-
corpus_name: string; // Name from collect_corpus
|
|
345
|
-
analysis_type: "full" | "quick" | "vocabulary" | "syntax";
|
|
346
|
-
}
|
|
347
|
-
```
|
|
348
|
-
|
|
349
|
-
**Analysis types:**
|
|
350
|
-
- **full**: Complete analysis (recommended)
|
|
351
|
-
- **quick**: Fast iteration during testing
|
|
352
|
-
- **vocabulary**: Word frequency only
|
|
353
|
-
- **syntax**: Sentence structure only
|
|
354
|
-
|
|
355
|
-
**Output files in `corpus/{name}/analysis/`:**
|
|
356
|
-
- vocabulary.json
|
|
357
|
-
- sentence.json
|
|
358
|
-
- voice.json
|
|
359
|
-
- paragraph.json
|
|
360
|
-
- punctuation.json
|
|
361
|
-
- function-words.json
|
|
362
|
-
- function-words-summary.md (human-readable)
|
|
197
|
+
| Parameter | Required | Description |
|
|
198
|
+
|-----------|----------|-------------|
|
|
199
|
+
| `corpus_name` | Yes | Name from collect_corpus |
|
|
200
|
+
| `analysis_type` | No | full, quick, vocabulary, syntax |
|
|
363
201
|
|
|
364
202
|
### generate_enhanced_guide
|
|
365
203
|
|
|
366
|
-
|
|
367
|
-
|
|
368
|
-
|
|
369
|
-
|
|
370
|
-
{
|
|
371
|
-
corpus_name: string;
|
|
372
|
-
output_format: "llm" | "human" | "both";
|
|
373
|
-
}
|
|
374
|
-
```
|
|
375
|
-
|
|
376
|
-
**Output:**
|
|
377
|
-
- **llm**: Optimized for AI consumption (25k words, statistical targets)
|
|
378
|
-
- **human**: Readable overview for writers
|
|
379
|
-
- **both**: Generates both formats
|
|
380
|
-
|
|
381
|
-
**Saved to:** `templates/writing_style_{corpus_name}.md`
|
|
204
|
+
| Parameter | Required | Description |
|
|
205
|
+
|-----------|----------|-------------|
|
|
206
|
+
| `corpus_name` | Yes | Name from analyze_corpus |
|
|
207
|
+
| `output_format` | No | llm, human, both |
|
|
382
208
|
|
|
383
|
-
|
|
209
|
+
### generate_tov_guide
|
|
384
210
|
|
|
385
|
-
|
|
211
|
+
Legacy basic guide generation. Use `generate_enhanced_guide` for production.
|
|
386
212
|
|
|
387
|
-
|
|
213
|
+
## Minimum Corpus Size
|
|
388
214
|
|
|
389
|
-
**
|
|
215
|
+
- **Minimum:** 15,000 words (20 articles)
|
|
216
|
+
- **Recommended:** 30,000 words
|
|
217
|
+
- **Ideal:** 50,000+ words
|
|
390
218
|
|
|
391
|
-
|
|
392
|
-
|
|
393
|
-
---
|
|
219
|
+
Below 15k words, you're measuring noise, not signal.
|
|
394
220
|
|
|
395
221
|
## Troubleshooting
|
|
396
222
|
|
|
397
|
-
|
|
398
|
-
|
|
399
|
-
**Solution:**
|
|
400
|
-
1. Run `collect_corpus` first
|
|
401
|
-
2. Check corpus name matches exactly (case-sensitive)
|
|
402
|
-
3. Verify `corpus/` directory exists in project root
|
|
403
|
-
|
|
404
|
-
### "Not enough data for reliable analysis"
|
|
405
|
-
|
|
406
|
-
**Solution:**
|
|
407
|
-
1. Collect more articles (minimum 20 articles, 15,000 words)
|
|
408
|
-
2. Check articles aren't empty after HTML stripping
|
|
409
|
-
3. Verify sitemap URL is accessible
|
|
410
|
-
|
|
411
|
-
### "Z-scores all near zero"
|
|
412
|
-
|
|
413
|
-
**Interpretation:** Indicates very typical English usage - not necessarily wrong
|
|
414
|
-
|
|
415
|
-
**Causes:**
|
|
416
|
-
- Generic corporate content (averaged voice)
|
|
417
|
-
- Mixed authorship (multiple writers)
|
|
418
|
-
- AI-edited content (stripped of distinctive patterns)
|
|
419
|
-
|
|
420
|
-
**Solution:** Collect more distinctive personal writing or domain-specific content
|
|
223
|
+
**"No corpus found"**
|
|
224
|
+
Run collect_corpus first. Check name matches exactly (case-sensitive).
|
|
421
225
|
|
|
422
|
-
|
|
226
|
+
**"Not enough data"**
|
|
227
|
+
Collect more articles. Minimum 20 articles, 15,000 words.
|
|
423
228
|
|
|
424
|
-
**
|
|
425
|
-
|
|
426
|
-
**Solution:**
|
|
427
|
-
- Re-analyze quarterly
|
|
428
|
-
- Focus on recent articles (filter by date)
|
|
429
|
-
- Document which corpus articles best represent current voice
|
|
430
|
-
|
|
431
|
-
---
|
|
229
|
+
**"Z-scores all near zero"**
|
|
230
|
+
Indicates typical English usage. May mean mixed authorship or AI-edited content.
|
|
432
231
|
|
|
433
232
|
## Development
|
|
434
233
|
|
|
435
|
-
### Build from Source
|
|
436
|
-
|
|
437
234
|
```bash
|
|
438
|
-
git clone https://github.com/
|
|
439
|
-
cd
|
|
235
|
+
git clone https://github.com/houtini-ai/voice-analyser-mcp.git
|
|
236
|
+
cd voice-analyser-mcp
|
|
440
237
|
npm install
|
|
441
238
|
npm run build
|
|
442
239
|
```
|
|
443
240
|
|
|
444
|
-
|
|
445
|
-
|
|
446
|
-
```
|
|
447
|
-
src/
|
|
448
|
-
├── index.ts # MCP server entry point
|
|
449
|
-
├── tools/ # MCP tool implementations
|
|
450
|
-
│ ├── collect.ts
|
|
451
|
-
│ ├── analyze.ts
|
|
452
|
-
│ └── generate.ts
|
|
453
|
-
├── analyzers/ # Linguistic analysis modules
|
|
454
|
-
│ ├── vocabulary.ts
|
|
455
|
-
│ ├── sentence.ts
|
|
456
|
-
│ ├── function-words.ts
|
|
457
|
-
│ ├── character-ngrams.ts
|
|
458
|
-
│ ├── word-ngrams.ts
|
|
459
|
-
│ └── pos-ngrams.ts
|
|
460
|
-
├── utils/ # Shared utilities
|
|
461
|
-
│ ├── cleaner.ts
|
|
462
|
-
│ ├── tokenizer.ts
|
|
463
|
-
│ └── stats.ts
|
|
464
|
-
└── reference/ # Reference data
|
|
465
|
-
└── english-reference.ts # Function word norms
|
|
466
|
-
|
|
467
|
-
dist/ # Compiled JavaScript output
|
|
468
|
-
corpus/ # Collected writing samples
|
|
469
|
-
templates/ # Generated voice models
|
|
470
|
-
```
|
|
471
|
-
|
|
472
|
-
### Dependencies
|
|
473
|
-
|
|
474
|
-
**Core:**
|
|
475
|
-
- `@modelcontextprotocol/sdk` - MCP protocol implementation
|
|
476
|
-
- `compromise` - Natural language processing
|
|
477
|
-
- `cheerio` - HTML parsing for content extraction
|
|
478
|
-
- `fast-xml-parser` - Sitemap and RSS parsing
|
|
479
|
-
|
|
480
|
-
**Analysis:**
|
|
481
|
-
- Function word reference data (50 most common English function words)
|
|
482
|
-
- Part-of-speech tagging
|
|
483
|
-
- N-gram extraction (character, word, POS)
|
|
484
|
-
|
|
485
|
-
---
|
|
486
|
-
|
|
487
|
-
## Technical Details
|
|
488
|
-
|
|
489
|
-
### Statistical Methods
|
|
490
|
-
|
|
491
|
-
**Z-Score Calculation:**
|
|
492
|
-
```
|
|
493
|
-
z = (observed_frequency - reference_mean) / reference_stddev
|
|
494
|
-
```
|
|
495
|
-
|
|
496
|
-
Where:
|
|
497
|
-
- observed_frequency = word count per 1000 words in your corpus
|
|
498
|
-
- reference_mean = average frequency in general English
|
|
499
|
-
- reference_stddev = standard deviation in general English
|
|
500
|
-
|
|
501
|
-
**Interpretation:** Z-scores create a statistical fingerprint. Replicating patterns by chance is astronomically unlikely.
|
|
502
|
-
|
|
503
|
-
### N-Gram Extraction
|
|
504
|
-
|
|
505
|
-
**Character n-grams:** Sequences of 2-4 characters
|
|
506
|
-
- Captures contractions, punctuation patterns
|
|
507
|
-
- Example: `'s ` (possessive), `n't ` (negation)
|
|
508
|
-
|
|
509
|
-
**Word n-grams:** Sequences of 2-4 words
|
|
510
|
-
- Captures phrase patterns, transitional markers
|
|
511
|
-
- Example: "but I think", "in my opinion"
|
|
512
|
-
|
|
513
|
-
**POS n-grams:** Sequences of 2-4 part-of-speech tags
|
|
514
|
-
- Captures syntactic structure
|
|
515
|
-
- Example: DET ADJ NOUN ("the big dog")
|
|
516
|
-
|
|
517
|
-
**Purpose:** These patterns encode voice at multiple levels - from character quirks to sentence structure DNA.
|
|
518
|
-
|
|
519
|
-
---
|
|
520
|
-
|
|
521
|
-
## Roadmap
|
|
522
|
-
|
|
523
|
-
### Current Status (v1.0)
|
|
524
|
-
- ✅ Corpus collection (sitemaps, RSS, URLs)
|
|
525
|
-
- ✅ Full linguistic analysis
|
|
526
|
-
- ✅ Function word stylometry
|
|
527
|
-
- ✅ Enhanced n-gram analysis
|
|
528
|
-
- ✅ LLM-optimized guide generation
|
|
529
|
-
- ✅ Anti-pattern detection
|
|
530
|
-
|
|
531
|
-
### Planned Features
|
|
532
|
-
- **Real-time validation**: API endpoint for live content checking
|
|
533
|
-
- **Voice drift detection**: Alert when published content deviates from model
|
|
534
|
-
- **Multi-author analysis**: Team voice harmonization
|
|
535
|
-
- **Competitive analysis**: Analyze competitor voices for differentiation
|
|
536
|
-
- **Delta distance scoring**: Automated authorship verification
|
|
537
|
-
|
|
538
|
-
---
|
|
539
|
-
|
|
540
|
-
## License
|
|
541
|
-
|
|
542
|
-
MIT
|
|
543
|
-
|
|
544
|
-
---
|
|
545
|
-
|
|
546
|
-
## Citation
|
|
547
|
-
|
|
548
|
-
If you use this tool in research or commercial projects:
|
|
549
|
-
|
|
550
|
-
```
|
|
551
|
-
Voice Analysis MCP Server (2025)
|
|
552
|
-
Statistical voice modeling for authentic AI content generation
|
|
553
|
-
https://github.com/yourusername/mcp-server-voice-analysis
|
|
554
|
-
```
|
|
555
|
-
|
|
556
|
-
---
|
|
241
|
+
## Research Foundation
|
|
557
242
|
|
|
558
|
-
|
|
243
|
+
The function word stylometry approach draws from computational authorship analysis research. Z-score comparisons against reference English corpora create statistical fingerprints that are difficult to replicate by chance.
|
|
559
244
|
|
|
560
|
-
|
|
561
|
-
**Documentation:** See `QUICKSTART.md` for detailed workflow examples
|
|
562
|
-
**Research:** See `research/` directory for technical background
|
|
245
|
+
Key insight: function words (the, of, and, to) are used unconsciously and form stable individual patterns. Content words vary by topic; function words reveal the author.
|
|
563
246
|
|
|
564
247
|
---
|
|
565
248
|
|
|
566
|
-
|
|
249
|
+
MIT License - [Houtini.ai](https://houtini.ai)
|