@houtini/voice-analyser 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. package/README.md +144 -461
  2. package/dist/analyzers/clustering.d.ts +36 -0
  3. package/dist/analyzers/clustering.d.ts.map +1 -0
  4. package/dist/analyzers/clustering.js +218 -0
  5. package/dist/analyzers/clustering.js.map +1 -0
  6. package/dist/analyzers/detection-risk.d.ts +28 -0
  7. package/dist/analyzers/detection-risk.d.ts.map +1 -0
  8. package/dist/analyzers/detection-risk.js +237 -0
  9. package/dist/analyzers/detection-risk.js.map +1 -0
  10. package/dist/analyzers/expression-markers.d.ts +57 -0
  11. package/dist/analyzers/expression-markers.d.ts.map +1 -0
  12. package/dist/analyzers/expression-markers.js +235 -0
  13. package/dist/analyzers/expression-markers.js.map +1 -0
  14. package/dist/analyzers/lexical-diversity.d.ts +36 -0
  15. package/dist/analyzers/lexical-diversity.d.ts.map +1 -0
  16. package/dist/analyzers/lexical-diversity.js +120 -0
  17. package/dist/analyzers/lexical-diversity.js.map +1 -0
  18. package/dist/analyzers/syntactic-patterns.d.ts +47 -0
  19. package/dist/analyzers/syntactic-patterns.d.ts.map +1 -0
  20. package/dist/analyzers/syntactic-patterns.js +186 -0
  21. package/dist/analyzers/syntactic-patterns.js.map +1 -0
  22. package/dist/tools/analyze-corpus.d.ts.map +1 -1
  23. package/dist/tools/analyze-corpus.js +119 -0
  24. package/dist/tools/analyze-corpus.js.map +1 -1
  25. package/dist/tools/generate-enhanced-guide.d.ts +21 -1
  26. package/dist/tools/generate-enhanced-guide.d.ts.map +1 -1
  27. package/dist/tools/generate-enhanced-guide.js +193 -2
  28. package/dist/tools/generate-enhanced-guide.js.map +1 -1
  29. package/dist/utils/advanced-statistics.d.ts +81 -0
  30. package/dist/utils/advanced-statistics.d.ts.map +1 -0
  31. package/dist/utils/advanced-statistics.js +162 -0
  32. package/dist/utils/advanced-statistics.js.map +1 -0
  33. package/package.json +1 -1
package/README.md CHANGED
@@ -1,566 +1,249 @@
1
- # Voice Analysis MCP Server
1
+ # Voice Analyser
2
2
 
3
- **Statistical voice analysis for authentic AI content generation.**
3
+ [![npm version](https://img.shields.io/npm/v/@houtini/voice-analyser)](https://www.npmjs.com/package/@houtini/voice-analyser)
4
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
4
5
 
5
- Extract linguistic fingerprints from published writing, generate LLM-optimized voice models, and eliminate "AI slop" through data-driven style replication.
6
+ Extract statistical voice models from your published writing. Generate LLM-optimized style guides that actually replicate how you write.
6
7
 
7
- ---
8
-
9
- ## What This Does
10
-
11
- Analyzes your published writing corpus (blog posts, articles) to create **statistical voice models** that LLMs can use to replicate your authentic voice. No more subjective "does this sound like me?" - measure it.
12
-
13
- **Real results:**
14
- - 90% first-pass acceptance (up from 60% with generic style guides)
15
- - 55 minutes saved per article (35 min vs 90 min with rewrites)
16
- - AI cliché detection in YOUR writing (patterns you didn't know you had)
17
- - Function word fingerprints (z-scores show over-use/avoidance patterns)
18
-
19
- ---
20
-
21
- ## Quick Start
22
-
23
- ### Installation
8
+ ## Installation
24
9
 
25
- ```bash
26
- cd C:\dev\content-machine\mcp-server-voice-analysis
27
- npm install
28
- npm run build
29
- ```
30
-
31
- ### Add to Claude Desktop
10
+ ### Claude Desktop
32
11
 
33
- Add to `claude_desktop_config.json`:
12
+ Add to your `claude_desktop_config.json`:
34
13
 
35
14
  ```json
36
15
  {
37
16
  "mcpServers": {
38
17
  "voice-analysis": {
39
- "command": "node",
40
- "args": [
41
- "C:/dev/content-machine/mcp-server-voice-analysis/dist/index.js"
42
- ]
18
+ "command": "npx",
19
+ "args": ["-y", "@houtini/voice-analyser@latest"]
43
20
  }
44
21
  }
45
22
  }
46
23
  ```
47
24
 
48
- Restart Claude Desktop.
25
+ **Config locations:**
26
+ - Windows: `%APPDATA%\Claude\claude_desktop_config.json`
27
+ - macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`
28
+ - Linux: `~/.config/Claude/claude_desktop_config.json`
49
29
 
50
- ### Three-Step Workflow
30
+ Restart Claude Desktop after saving.
51
31
 
52
- **1. Collect corpus from your published content:**
53
- ```typescript
54
- voice-analysis:collect_corpus({
55
- sitemap_url: "https://yoursite.com/post-sitemap.xml",
56
- output_name: "your-name",
57
- max_articles: 50
58
- })
59
- ```
32
+ ### Requirements
60
33
 
61
- **Supported sources:**
62
- - XML sitemaps
63
- - RSS/Atom feeds
64
- - Individual URLs (via Firecrawl integration)
65
-
66
- **2. Analyze linguistic patterns:**
67
- ```typescript
68
- voice-analysis:analyze_corpus({
69
- corpus_name: "your-name",
70
- analysis_type: "full"
71
- })
72
- ```
34
+ - Node.js 20+
73
35
 
74
- **3. Generate LLM-optimized voice guide:**
75
- ```typescript
76
- voice-analysis:generate_enhanced_guide({
77
- corpus_name: "your-name",
78
- output_format: "llm"
79
- })
80
- ```
36
+ ## What It Does
81
37
 
82
- Output: 20,000-25,000 word statistical model ready for Claude to load.
38
+ Analyses your published writing to create statistical voice models. No subjective "does this sound like me?" - measure it.
83
39
 
84
- ---
40
+ **Captures:**
41
+ - Sentence length distributions (not averages - the full histogram)
42
+ - Function word fingerprints (z-scores showing your over-use/avoidance patterns)
43
+ - First-person density and hedging patterns
44
+ - Punctuation habits and British/American markers
45
+ - N-gram patterns at character, word, and part-of-speech levels
46
+ - AI clichés already present in YOUR writing
85
47
 
86
- ## What Gets Analyzed
48
+ **Output:** 20,000-25,000 word statistical model ready for Claude to load as context.
87
49
 
88
- ### Statistical Fingerprints
89
-
90
- **Sentence patterns:**
91
- - Length distribution (not just average - the entire histogram)
92
- - Syntactic structures (how you start sentences, common modifications)
93
- - Sentence openers (frequency of "I", "The", "But", etc.)
50
+ ## Quick Start
94
51
 
95
- **Function word usage:**
96
- - Z-scores comparing your usage to general English
97
- - Over-use patterns (distinctive markers)
98
- - Avoidance patterns (words you rarely use)
52
+ ### 1. Collect Your Writing
99
53
 
100
- Example from real analysis:
101
54
  ```
102
- "you": z = +1.75 (highly distinctive - direct engagement style)
103
- "was": z = -2.46 (highly avoided - prefer active voice)
104
- "of": z = -2.49 (avoided - use direct constructions)
55
+ Collect corpus from https://yoursite.com/post-sitemap.xml as "your-name"
105
56
  ```
106
57
 
107
- **Voice markers:**
108
- - First-person density (0.6 per 100 words typical for authority voice)
109
- - Hedging language ("I think", "seems to", "pretty much")
110
- - British vs American English patterns
111
- - Equipment specificity patterns ("my Simucube 2 Pro" not "a wheelbase")
112
-
113
- **Anti-patterns detected:**
114
- - AI clichés in YOUR corpus ("delve", "leverage", "unlock")
115
- - Marketing speak patterns
116
- - Generic references vs specific products
117
-
118
- **Punctuation fingerprints:**
119
- - Comma density (0.6-0.8 per sentence)
120
- - Exclamation usage (5-8 per 1000 words for genuine enthusiasm)
121
- - Quotation style (British double quotes)
122
- - Dash preference patterns
123
-
124
- ### N-Gram Analysis (Enhanced Mode)
125
-
126
- **Character n-grams:**
127
- - Contraction patterns (`'s `, `'t `, `'ll `)
128
- - Punctuation combinations
129
- - Unique character sequences
130
-
131
- **Word n-grams:**
132
- - Phrase patterns (2-4 word sequences)
133
- - Transitional phrases ("but I", "but it", "but the")
134
- - Signature combinations
58
+ Works with XML sitemaps, RSS feeds, or individual URLs. Collects up to 100 articles by default.
135
59
 
136
- **POS n-grams:**
137
- - Syntactic patterns (ADJ NOUN, DET ADJ NOUN)
138
- - Sentence structure fingerprints
139
- - Grammatical constructions
140
-
141
- ---
142
-
143
- ## Output Format
144
-
145
- ### Generated Files
60
+ ### 2. Analyse Patterns
146
61
 
147
62
  ```
148
- corpus/
149
- └── your-name/
150
- ├── articles/ # Collected markdown
151
- │ ├── 001-article-title.md
152
- │ └── 002-another-article.md
153
- ├── corpus.json # Metadata
154
- └── analysis/ # Analysis outputs
155
- ├── vocabulary.json
156
- ├── sentence.json
157
- ├── voice.json
158
- ├── function-words.json
159
- ├── character-ngrams.json # Enhanced mode
160
- ├── word-ngrams.json # Enhanced mode
161
- └── pos-ngrams.json # Enhanced mode
162
-
163
- templates/
164
- └── writing_style_your-name.md # LLM-optimized guide (25k words)
63
+ Analyse corpus "your-name"
165
64
  ```
166
65
 
167
- ### LLM-Optimized Guide Structure
66
+ Generates statistical analysis of vocabulary, sentence structure, voice markers, and function word usage.
168
67
 
169
- The generated voice model includes:
68
+ ### 3. Generate Voice Model
170
69
 
171
- 1. **Corpus Statistics** - Total words, vocabulary richness, date range
172
- 2. **Sentence Construction** - Length targets, syntactic patterns, openers
173
- 3. **Voice & Authority** - First-person usage, hedging density, approved phrases
174
- 4. **Vocabulary** - Domain-specific terms, British English markers
175
- 5. **Punctuation Patterns** - Density targets, style preferences
176
- 6. **Function Word Fingerprint** - Z-scores, over-use/avoidance patterns
177
- 7. **Transitional Phrases** - Connectives, discourse markers
178
- 8. **Anti-Patterns** - AI clichés to eliminate (detected in YOUR writing)
179
- 9. **Annotated Examples** - Good vs bad examples with pattern analysis
180
- 10. **Validation Checklist** - Concrete pass/fail criteria
70
+ ```
71
+ Generate enhanced guide for "your-name"
72
+ ```
181
73
 
182
- ---
74
+ Creates an LLM-optimized style guide with concrete targets and examples.
183
75
 
184
- ## Real-World Usage
76
+ ## Usage Examples
185
77
 
186
- ### Integration with Content Workflows
78
+ ### Collect from Different Sources
187
79
 
188
- **Before writing:**
80
+ **XML Sitemap:**
189
81
  ```
190
- Load C:\path\to\templates\writing_style_your-name.md
82
+ Collect corpus from https://example.com/post-sitemap.xml as "writer-name" with max 50 articles
191
83
  ```
192
84
 
193
- Claude now has 25,000 words of statistical patterns as context. Every sentence generated is checked against your actual usage.
194
-
195
- **After drafting:**
196
- Run validation checklist from voice model:
197
- - First-person count (target: 5+ statements)
198
- - Sentence length distribution (15-21 ± 11-18 words)
199
- - British English (100%)
200
- - AI clichés (0)- Equipment specificity (named models, not generic)
201
- - Zero marketing speak
85
+ **RSS Feed:**
86
+ ```
87
+ Collect corpus from https://example.com/feed/ as "writer-name"
88
+ ```
202
89
 
203
- **Pass rate improvement:**
204
- - Before: 60% first-pass acceptance
205
- - After: 90% first-pass acceptance
206
- - Time savings: 55 minutes per article
90
+ **Filter by URL pattern:**
91
+ ```
92
+ Collect corpus from https://example.com/sitemap.xml as "writer-name" filtering URLs matching "blog"
93
+ ```
207
94
 
208
- ### Multi-Domain Voice Modeling
95
+ ### Analysis Options
209
96
 
210
- Analyze writing across different domains to capture full voice range:
97
+ **Full analysis (recommended):**
98
+ ```
99
+ Analyse corpus "writer-name" with full analysis
100
+ ```
211
101
 
212
- ```typescript
213
- // Collect from multiple sources
214
- collect_corpus({ url: "https://techblog.com/feed/", output_name: "writer-tech" })
215
- collect_corpus({ url: "https://personalblog.com/sitemap.xml", output_name: "writer-personal" })
216
- collect_corpus({ url: "https://company.com/author/", output_name: "writer-corporate" })
102
+ **Quick iteration:**
103
+ ```
104
+ Analyse corpus "writer-name" with quick analysis
105
+ ```
217
106
 
218
- // Analyze each separately to identify domain variations
219
- analyze_corpus({ corpus_name: "writer-tech" })
220
- analyze_corpus({ corpus_name: "writer-personal" })
221
- analyze_corpus({ corpus_name: "writer-corporate" })
107
+ ### Output Formats
222
108
 
223
- // Generate comprehensive multi-domain guide
224
- // (Manual combination of insights from each domain)
109
+ **LLM-optimized (for Claude context):**
110
+ ```
111
+ Generate enhanced guide for "writer-name" in llm format
225
112
  ```
226
113
 
227
- **Insight:** First-person usage naturally varies by domain:
228
- - Technical documentation: 0.4 per 100 words
229
- - Personal narratives: 0.9 per 100 words
230
- - Corporate content: 0.6 per 100 words
231
-
232
- The analysis captures these as appropriate variations, not errors.
114
+ **Human-readable overview:**
115
+ ```
116
+ Generate enhanced guide for "writer-name" in human format
117
+ ```
233
118
 
234
- ---
119
+ **Both formats:**
120
+ ```
121
+ Generate enhanced guide for "writer-name" in both formats
122
+ ```
235
123
 
236
- ## Advanced Features
124
+ ## What Gets Measured
237
125
 
238
126
  ### Function Word Stylometry
239
127
 
240
- Z-scores reveal unconscious style patterns:
128
+ Z-scores reveal unconscious patterns:
241
129
 
242
130
  | Z-Score | Meaning | Example |
243
131
  |---------|---------|---------|
244
- | +2.0+ | Highly distinctive (much more than typical) | "whilst" +5.7 (British marker) |
245
- | +1.0 to +2.0 | Distinctive (more than typical) | "you" +1.75 (direct engagement) |
132
+ | +2.0+ | Highly distinctive | "whilst" +5.7 (British marker) |
133
+ | +1.0 to +2.0 | Distinctive | "you" +1.75 (direct engagement) |
246
134
  | -1.0 to +1.0 | Normal range | Typical usage |
247
- | -1.0 to -2.0 | Avoided (less than typical) | "the" -1.48 (prefer specific) |
248
- | -2.0- | Highly avoided (much less than typical) | "was" -2.46 (avoid passive) |
249
-
250
- **Why this matters:** These patterns are invisible to you whilst writing but glaringly obvious when absent. That's why AI content feels "off" even when grammatically perfect.
251
-
252
- ### AI Cliché Detection
253
-
254
- Analyzes YOUR corpus for overused AI-generated phrases:
255
-
256
- **Detected patterns:**
257
- - "dive into" (outlier frequency)
258
- - "unlock" (appears unnaturally)
259
- - "leverage", "seamless", "robust" (if present)
135
+ | -2.0 to -1.0 | Avoided | "the" -1.48 (prefer specific) |
136
+ | -2.0- | Highly avoided | "was" -2.46 (avoid passive) |
260
137
 
261
- **Elimination:** Voice model explicitly flags these as anti-patterns even if they appeared in your historical writing.
138
+ These patterns are invisible whilst writing but obvious when absent. That's why AI content feels "off" even when grammatically correct.
262
139
 
263
- ### Enhanced N-Gram Mode
140
+ ### Sentence Patterns
264
141
 
265
- Activated via `generate_enhanced_guide`:
142
+ - Length distribution with variance (not just average)
143
+ - Sentence openers frequency ("I", "The", "But", etc.)
144
+ - Syntactic structures and modifications
266
145
 
267
- **Character-level patterns:**
268
- - Contraction usage (`'s ` appears 6 times)
269
- - Punctuation combinations
270
- - Unique sequences
146
+ ### Voice Markers
271
147
 
272
- **Word-level patterns:**
273
- - "but I" (7-14 uses - primary contrast marker)
274
- - "in my" (23 uses - authority phrase)
275
- - "I think" (18 uses - hedging pattern)
276
-
277
- **POS-level patterns:**
278
- - DET NOUN (1242 times): the metrics, a domain
279
- - ADJ NOUN (874 times): international markets, new website
280
- - PRON AUX (735 times): I 'd, It 's
281
-
282
- **Purpose:** Capture syntactic DNA that generic grammar rules miss.
283
-
284
- ---
285
-
286
- ## Requirements
148
+ - First-person density (0.6 per 100 words typical for authority voice)
149
+ - Hedging language frequency ("I think", "seems to")
150
+ - Equipment specificity ("my Simucube 2 Pro" vs "a wheelbase")
287
151
 
288
- ### Minimum Corpus Size
152
+ ### N-Gram Analysis
289
153
 
290
- **For reliable statistics:**
291
- - Minimum: 15,000 words
292
- - Recommended: 30,000 words
293
- - Ideal: 50,000+ words
154
+ **Word patterns:**
155
+ - "but I" (contrast marker)
156
+ - "in my" (authority phrase)
157
+ - Transitional phrases
294
158
 
295
- **Example:** 50 blog posts × 1,200 words = 60,000 words (excellent)
159
+ **POS patterns:**
160
+ - DET NOUN (the metrics)
161
+ - ADJ NOUN (international markets)
162
+ - Syntactic DNA
296
163
 
297
- **Why size matters:** Below 15k words, you're measuring noise, not signal. Statistical patterns aren't stable.
164
+ ### Anti-Patterns
298
165
 
299
- ### Content Quality
166
+ Detects AI clichés in YOUR corpus:
167
+ - "delve", "leverage", "unlock"
168
+ - Marketing speak patterns
169
+ - Generic references
300
170
 
301
- **Best results when corpus contains:**
302
- - Single author (no guest posts or collaborative writing)
303
- - Consistent genre/domain (all technical, or all personal - or analyze separately)
304
- - Recent writing (voice evolves - re-analyze quarterly)
305
- - Published content (avoid unpublished drafts with incomplete editing)
171
+ ## Output Structure
306
172
 
307
- **Multi-domain:** Analyze separately, then combine insights to understand context-appropriate variations.
173
+ ```
174
+ corpus/
175
+ └── your-name/
176
+ ├── articles/ # Collected markdown
177
+ ├── corpus.json # Metadata
178
+ └── analysis/ # JSON analysis files
308
179
 
309
- ---
180
+ templates/
181
+ └── writing_style_your-name.md # LLM-optimized guide
182
+ ```
310
183
 
311
- ## Tool Reference
184
+ ## Tools Reference
312
185
 
313
186
  ### collect_corpus
314
187
 
315
- **Purpose:** Extract clean writing samples from web sources
316
-
317
- **Parameters:**
318
- ```typescript
319
- {
320
- sitemap_url: string; // XML sitemap, RSS feed, or individual URL
321
- output_name: string; // Corpus identifier (e.g., "john-smith")
322
- max_articles?: number; // Limit articles to collect (default: 100)
323
- article_pattern?: string; // Optional regex filter for URLs
324
- }
325
- ```
326
-
327
- **Output:**
328
- - Creates `corpus/{output_name}/` directory
329
- - Saves articles as clean markdown
330
- - Generates `corpus.json` with metadata
331
-
332
- **Cleaning process:**
333
- - Strips HTML, navigation, ads, comments
334
- - Preserves article prose only
335
- - Normalizes whitespace and formatting
188
+ | Parameter | Required | Description |
189
+ |-----------|----------|-------------|
190
+ | `sitemap_url` | Yes | XML sitemap, RSS feed, or URL |
191
+ | `output_name` | Yes | Corpus identifier |
192
+ | `max_articles` | No | Limit (default: 100) |
193
+ | `article_pattern` | No | Regex filter for URLs |
336
194
 
337
195
  ### analyze_corpus
338
196
 
339
- **Purpose:** Perform linguistic analysis on collected corpus
340
-
341
- **Parameters:**
342
- ```typescript
343
- {
344
- corpus_name: string; // Name from collect_corpus
345
- analysis_type: "full" | "quick" | "vocabulary" | "syntax";
346
- }
347
- ```
348
-
349
- **Analysis types:**
350
- - **full**: Complete analysis (recommended)
351
- - **quick**: Fast iteration during testing
352
- - **vocabulary**: Word frequency only
353
- - **syntax**: Sentence structure only
354
-
355
- **Output files in `corpus/{name}/analysis/`:**
356
- - vocabulary.json
357
- - sentence.json
358
- - voice.json
359
- - paragraph.json
360
- - punctuation.json
361
- - function-words.json
362
- - function-words-summary.md (human-readable)
197
+ | Parameter | Required | Description |
198
+ |-----------|----------|-------------|
199
+ | `corpus_name` | Yes | Name from collect_corpus |
200
+ | `analysis_type` | No | full, quick, vocabulary, syntax |
363
201
 
364
202
  ### generate_enhanced_guide
365
203
 
366
- **Purpose:** Create LLM-optimized statistical voice model
367
-
368
- **Parameters:**
369
- ```typescript
370
- {
371
- corpus_name: string;
372
- output_format: "llm" | "human" | "both";
373
- }
374
- ```
375
-
376
- **Output:**
377
- - **llm**: Optimized for AI consumption (25k words, statistical targets)
378
- - **human**: Readable overview for writers
379
- - **both**: Generates both formats
380
-
381
- **Saved to:** `templates/writing_style_{corpus_name}.md`
204
+ | Parameter | Required | Description |
205
+ |-----------|----------|-------------|
206
+ | `corpus_name` | Yes | Name from analyze_corpus |
207
+ | `output_format` | No | llm, human, both |
382
208
 
383
- **Enhanced mode:** Automatically includes n-gram analysis (character, word, POS patterns) for maximum voice fidelity.
209
+ ### generate_tov_guide
384
210
 
385
- ### generate_tov_guide (Legacy)
211
+ Legacy basic guide generation. Use `generate_enhanced_guide` for production.
386
212
 
387
- **Purpose:** Generate basic voice guide (pre-enhanced version)
213
+ ## Minimum Corpus Size
388
214
 
389
- **Use case:** Simpler output format, faster generation
215
+ - **Minimum:** 15,000 words (20 articles)
216
+ - **Recommended:** 30,000 words
217
+ - **Ideal:** 50,000+ words
390
218
 
391
- **Note:** Use `generate_enhanced_guide` for production work. This tool maintained for backward compatibility.
392
-
393
- ---
219
+ Below 15k words, you're measuring noise, not signal.
394
220
 
395
221
  ## Troubleshooting
396
222
 
397
- ### "No corpus found"
398
-
399
- **Solution:**
400
- 1. Run `collect_corpus` first
401
- 2. Check corpus name matches exactly (case-sensitive)
402
- 3. Verify `corpus/` directory exists in project root
403
-
404
- ### "Not enough data for reliable analysis"
405
-
406
- **Solution:**
407
- 1. Collect more articles (minimum 20 articles, 15,000 words)
408
- 2. Check articles aren't empty after HTML stripping
409
- 3. Verify sitemap URL is accessible
410
-
411
- ### "Z-scores all near zero"
412
-
413
- **Interpretation:** Indicates very typical English usage - not necessarily wrong
414
-
415
- **Causes:**
416
- - Generic corporate content (averaged voice)
417
- - Mixed authorship (multiple writers)
418
- - AI-edited content (stripped of distinctive patterns)
419
-
420
- **Solution:** Collect more distinctive personal writing or domain-specific content
223
+ **"No corpus found"**
224
+ Run collect_corpus first. Check name matches exactly (case-sensitive).
421
225
 
422
- ### Voice model doesn't match current style
226
+ **"Not enough data"**
227
+ Collect more articles. Minimum 20 articles, 15,000 words.
423
228
 
424
- **Cause:** Voice evolution over time
425
-
426
- **Solution:**
427
- - Re-analyze quarterly
428
- - Focus on recent articles (filter by date)
429
- - Document which corpus articles best represent current voice
430
-
431
- ---
229
+ **"Z-scores all near zero"**
230
+ Indicates typical English usage. May mean mixed authorship or AI-edited content.
432
231
 
433
232
  ## Development
434
233
 
435
- ### Build from Source
436
-
437
234
  ```bash
438
- git clone https://github.com/yourusername/mcp-server-voice-analysis
439
- cd mcp-server-voice-analysis
235
+ git clone https://github.com/houtini-ai/voice-analyser-mcp.git
236
+ cd voice-analyser-mcp
440
237
  npm install
441
238
  npm run build
442
239
  ```
443
240
 
444
- ### Project Structure
445
-
446
- ```
447
- src/
448
- ├── index.ts # MCP server entry point
449
- ├── tools/ # MCP tool implementations
450
- │ ├── collect.ts
451
- │ ├── analyze.ts
452
- │ └── generate.ts
453
- ├── analyzers/ # Linguistic analysis modules
454
- │ ├── vocabulary.ts
455
- │ ├── sentence.ts
456
- │ ├── function-words.ts
457
- │ ├── character-ngrams.ts
458
- │ ├── word-ngrams.ts
459
- │ └── pos-ngrams.ts
460
- ├── utils/ # Shared utilities
461
- │ ├── cleaner.ts
462
- │ ├── tokenizer.ts
463
- │ └── stats.ts
464
- └── reference/ # Reference data
465
- └── english-reference.ts # Function word norms
466
-
467
- dist/ # Compiled JavaScript output
468
- corpus/ # Collected writing samples
469
- templates/ # Generated voice models
470
- ```
471
-
472
- ### Dependencies
473
-
474
- **Core:**
475
- - `@modelcontextprotocol/sdk` - MCP protocol implementation
476
- - `compromise` - Natural language processing
477
- - `cheerio` - HTML parsing for content extraction
478
- - `fast-xml-parser` - Sitemap and RSS parsing
479
-
480
- **Analysis:**
481
- - Function word reference data (50 most common English function words)
482
- - Part-of-speech tagging
483
- - N-gram extraction (character, word, POS)
484
-
485
- ---
486
-
487
- ## Technical Details
488
-
489
- ### Statistical Methods
490
-
491
- **Z-Score Calculation:**
492
- ```
493
- z = (observed_frequency - reference_mean) / reference_stddev
494
- ```
495
-
496
- Where:
497
- - observed_frequency = word count per 1000 words in your corpus
498
- - reference_mean = average frequency in general English
499
- - reference_stddev = standard deviation in general English
500
-
501
- **Interpretation:** Z-scores create a statistical fingerprint. Replicating patterns by chance is astronomically unlikely.
502
-
503
- ### N-Gram Extraction
504
-
505
- **Character n-grams:** Sequences of 2-4 characters
506
- - Captures contractions, punctuation patterns
507
- - Example: `'s ` (possessive), `n't ` (negation)
508
-
509
- **Word n-grams:** Sequences of 2-4 words
510
- - Captures phrase patterns, transitional markers
511
- - Example: "but I think", "in my opinion"
512
-
513
- **POS n-grams:** Sequences of 2-4 part-of-speech tags
514
- - Captures syntactic structure
515
- - Example: DET ADJ NOUN ("the big dog")
516
-
517
- **Purpose:** These patterns encode voice at multiple levels - from character quirks to sentence structure DNA.
518
-
519
- ---
520
-
521
- ## Roadmap
522
-
523
- ### Current Status (v1.0)
524
- - ✅ Corpus collection (sitemaps, RSS, URLs)
525
- - ✅ Full linguistic analysis
526
- - ✅ Function word stylometry
527
- - ✅ Enhanced n-gram analysis
528
- - ✅ LLM-optimized guide generation
529
- - ✅ Anti-pattern detection
530
-
531
- ### Planned Features
532
- - **Real-time validation**: API endpoint for live content checking
533
- - **Voice drift detection**: Alert when published content deviates from model
534
- - **Multi-author analysis**: Team voice harmonization
535
- - **Competitive analysis**: Analyze competitor voices for differentiation
536
- - **Delta distance scoring**: Automated authorship verification
537
-
538
- ---
539
-
540
- ## License
541
-
542
- MIT
543
-
544
- ---
545
-
546
- ## Citation
547
-
548
- If you use this tool in research or commercial projects:
549
-
550
- ```
551
- Voice Analysis MCP Server (2025)
552
- Statistical voice modeling for authentic AI content generation
553
- https://github.com/yourusername/mcp-server-voice-analysis
554
- ```
555
-
556
- ---
241
+ ## Research Foundation
557
242
 
558
- ## Support
243
+ The function word stylometry approach draws from computational authorship analysis research. Z-score comparisons against reference English corpora create statistical fingerprints that are difficult to replicate by chance.
559
244
 
560
- **Issues:** Open issue on GitHub
561
- **Documentation:** See `QUICKSTART.md` for detailed workflow examples
562
- **Research:** See `research/` directory for technical background
245
+ Key insight: function words (the, of, and, to) are used unconsciously and form stable individual patterns. Content words vary by topic; function words reveal the author.
563
246
 
564
247
  ---
565
248
 
566
- **Built for Content Machine project** - Systematic WordPress content enhancement with voice preservation.
249
+ MIT License - [Houtini.ai](https://houtini.ai)