codesummary 1.1.1 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/features.md CHANGED
@@ -1,502 +1,418 @@
1
- # CodeSummary – Detailed Features and Functional Specification
2
-
3
- ## 1. Overview
4
-
5
- **CodeSummary** is a **Node.js-based, cross-platform CLI tool** (distributed via **npm**) that automatically scans a project's source code and generates a **clean, professional PDF document** containing:
6
-
7
- - **Complete project file structure** with hierarchical organization
8
- - **Full source code content** for all selected files (no truncation)
9
- - **Intelligent file type detection** and user-selectable filtering
10
- - **Clean, readable formatting** optimized for code documentation
11
-
12
- Its primary goal is to **simplify code reviews, audits, and archival snapshots**, enabling teams and individuals to produce **self-contained, complete documentation** of their codebases with minimal setup.
13
-
14
- > **Repository**: [https://github.com/skamoll/CodeSummary](https://github.com/skamoll/CodeSummary)
15
- > **npm Package Name**: `codesummary`
16
-
17
- ---
18
-
19
- ### 1.1 Target Audience
20
-
21
- - **Developers** needing quick overviews of large projects with complete content
22
- - **Auditors/Consultants** requiring traceable documentation snapshots without size limits
23
- - **Educators/Students** preparing comprehensive code handovers or learning materials
24
- - **Teams** performing thorough code reviews or compliance checks
25
- - **Project Managers** creating complete project documentation for stakeholders
26
-
27
- ---
28
-
29
- ### 1.2 Core Objectives
30
-
31
- 1. **Complete automated documentation** — includes ALL file content without truncation
32
- 2. **Cross-platform reliability** — identical behavior on Windows, macOS, and Linux
33
- 3. **Advanced configurability** — user-defined filters, styles, and output preferences
34
- 4. **Unlimited scalability** — handles projects of any size with efficient streaming
35
- 5. **Intelligent safe defaults** — avoids binaries and unwanted files with smart filtering
36
- 6. **Professional output** — clean, readable PDFs suitable for all professional contexts
37
- 7. **Smart conflict handling** — automatic timestamped filenames when files are in use
38
-
39
- ---
40
-
41
- ### 1.3 Key Differentiators
42
-
43
- - **No content limits** processes files of any size completely
44
- - **Smart file conflict resolution** automatic timestamped fallbacks
45
- - **Terminal compatibility** works with all terminal types across platforms
46
- - **Whitelist-driven filtering** with extensive language support
47
- - **Interactive first-run setup** with persistent global configuration
48
- - **Memory-efficient streaming** for optimal performance on large projects
49
- - **Non-destructive scanning** with comprehensive error handling
50
- - **Fully offline operation** with no external dependencies
51
-
52
- ---
53
-
54
- ### 1.4 Technology Stack
55
-
56
- - **Node.js** 18 (for native ES modules and modern APIs)
57
- - **PDFKit** for professional PDF generation with streaming support
58
- - **Inquirer.js** for interactive command-line prompts
59
- - **Chalk** for cross-platform terminal styling
60
- - **Ora** for progress indicators and status updates
61
- - **fs-extra** for enhanced file system operations
62
-
63
- ---
64
-
65
- ## 2. Functional Requirements
66
-
67
- ### 2.1 Command-Line Interface
68
-
69
- #### 2.1.1 Primary Commands
70
-
71
- | Command | Description | Example |
72
- |---------|-------------|---------|
73
- | `codesummary` | Scan current directory and generate PDF | `codesummary` |
74
- | `codesummary config` | Launch interactive configuration editor | `codesummary config` |
75
- | `codesummary --show-config` | Display current configuration settings | `codesummary --show-config` |
76
- | `codesummary --reset-config` | Reset to defaults and run setup wizard | `codesummary --reset-config` |
77
- | `codesummary --help` | Show comprehensive help information | `codesummary --help` |
78
-
79
- #### 2.1.2 Command-Line Options
80
-
81
- | Option | Short | Description | Example |
82
- |--------|-------|-------------|---------|
83
- | `--output` | `-o` | Override output directory | `codesummary -o ./docs` |
84
- | `--show-config` | - | Display current configuration | `codesummary --show-config` |
85
- | `--reset-config` | - | Reset configuration to defaults | `codesummary --reset-config` |
86
- | `--help` | `-h` | Show help message | `codesummary -h` |
87
-
88
- #### 2.1.3 Interactive Workflow
89
-
90
- 1. **First-Run Setup**
91
- - Detects missing configuration automatically
92
- - Launches interactive setup wizard
93
- - Configures output mode (relative/fixed path)
94
- - Creates output directory if needed
95
- - Saves persistent global configuration
96
-
97
- 2. **Directory Scanning**
98
- - Recursively scans current working directory
99
- - Applies whitelist filtering for file extensions
100
- - Excludes common build/dependency directories
101
- - Shows comprehensive scan summary with statistics
102
-
103
- 3. **Extension Selection**
104
- - Presents detected file types in checkbox format
105
- - Shows file counts for each extension
106
- - Allows selective inclusion/exclusion
107
- - Pre-selects all detected extensions by default
108
-
109
- 4. **PDF Generation**
110
- - Processes all selected files completely (no truncation)
111
- - Shows progress indicators for large files
112
- - Handles file conflicts with timestamped names
113
- - Generates clean, professional PDF output
114
-
115
- ---
116
-
117
- ### 2.2 Configuration Management
118
-
119
- #### 2.2.1 Global Configuration Storage
120
-
121
- **Storage Locations:**
122
- - **Linux/macOS**: `~/.codesummary/config.json`
123
- - **Windows**: `%APPDATA%\\CodeSummary\\config.json`
124
-
125
- #### 2.2.2 Configuration Structure
126
-
127
- ```json
128
- {
129
- \"output\": {
130
- \"mode\": \"fixed\" | \"relative\",
131
- \"fixedPath\": \"string (absolute path)\"
132
- },
133
- \"allowedExtensions\": [\"array of file extensions\"],
134
- \"excludeDirs\": [\"array of directory names to exclude\"],
135
- \"styles\": {
136
- \"colors\": {
137
- \"title\": \"#333353\",
138
- \"section\": \"#00FFB9\",
139
- \"text\": \"#333333\",
140
- \"error\": \"#FF4D4D\",
141
- \"footer\": \"#666666\"
142
- },
143
- \"layout\": {
144
- \"marginLeft\": 40,
145
- \"marginTop\": 40,
146
- \"marginRight\": 40,
147
- \"footerHeight\": 20
148
- }
149
- },
150
- \"settings\": {
151
- \"documentTitle\": \"Project Code Summary\",
152
- \"maxFilesBeforePrompt\": 500
153
- }
154
- }
155
- ```
156
-
157
- #### 2.2.3 Configuration Features
158
-
159
- - **Cross-platform path handling** with automatic normalization
160
- - **Validation system** prevents invalid configurations
161
- - **Interactive editor** for all configuration sections
162
- - **Automatic backup and recovery** for corrupted configurations
163
- - **Reset functionality** to restore defaults
164
-
165
- ---
166
-
167
- ### 2.3 File System Scanning
168
-
169
- #### 2.3.1 Scanning Algorithm
170
-
171
- 1. **Recursive Directory Traversal**
172
- - Starts from current working directory
173
- - Follows symbolic links safely
174
- - Respects file system permissions
175
- - Handles large directory structures efficiently
176
-
177
- 2. **Filtering Logic**
178
- - **Whitelist approach**: Only processes explicitly allowed extensions
179
- - **Directory exclusions**: Skips common build/dependency directories
180
- - **Hidden file handling**: Includes important dot files (.gitignore, .env.example)
181
- - **Binary detection**: Automatically skips binary files
182
-
183
- 3. **Error Handling**
184
- - Graceful handling of permission denied errors
185
- - Continues scanning despite individual file failures
186
- - Logs warnings for inaccessible files
187
- - Provides detailed error context
188
-
189
- #### 2.3.2 Supported File Extensions
190
-
191
- **Programming Languages:**
192
- - JavaScript: `.js`, `.jsx`, `.mjs`
193
- - TypeScript: `.ts`, `.tsx`, `.d.ts`
194
- - Python: `.py`, `.pyw`, `.pyx`
195
- - Java: `.java`
196
- - C/C++: `.c`, `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp`
197
- - C#: `.cs`
198
- - Go: `.go`
199
- - Rust: `.rs`
200
- - Swift: `.swift`
201
- - Kotlin: `.kt`, `.kts`
202
- - Scala: `.scala`
203
- - PHP: `.php`, `.phtml`
204
- - Ruby: `.rb`, `.rbw`
205
-
206
- **Web Technologies:**
207
- - HTML: `.html`, `.htm`
208
- - CSS: `.css`, `.scss`, `.sass`, `.less`
209
- - Vue.js: `.vue`
210
- - Svelte: `.svelte`
211
-
212
- **Data & Configuration:**
213
- - JSON: `.json`, `.jsonc`
214
- - XML: `.xml`, `.xsd`, `.xsl`
215
- - YAML: `.yaml`, `.yml`
216
- - TOML: `.toml`
217
- - SQL: `.sql`
218
- - GraphQL: `.graphql`, `.gql`
219
-
220
- **Scripts & Shell:**
221
- - Shell: `.sh`, `.bash`, `.zsh`
222
- - Batch: `.bat`, `.cmd`
223
- - PowerShell: `.ps1`, `.psm1`
224
-
225
- **Documentation:**
226
- - Markdown: `.md`, `.markdown`
227
- - Text: `.txt`
228
- - Dockerfile: `.dockerfile`
229
-
230
- #### 2.3.3 Directory Exclusions
231
-
232
- **Default Excluded Directories:**
233
- - `node_modules` (Node.js dependencies)
234
- - `.git` (Git version control)
235
- - `.vscode` (VS Code settings)
236
- - `dist`, `build` (Build outputs)
237
- - `coverage` (Test coverage reports)
238
- - `out` (Output directories)
239
- - `__pycache__` (Python cache)
240
- - `.next` (Next.js build)
241
- - `.nuxt` (Nuxt.js build)
242
- - `vendor` (Dependency directories)
243
- - `.cache` (Cache directories)
244
-
245
- ---
246
-
247
- ### 2.4 PDF Generation
248
-
249
- #### 2.4.1 Document Structure
250
-
251
- **1. Project Overview Section**
252
- - Document title (configurable)
253
- - Project name (derived from directory)
254
- - Generation timestamp
255
- - List of included file types with descriptions
256
- - Clean, professional formatting
257
-
258
- **2. File Structure Section**
259
- - Complete hierarchical file listing
260
- - Organized by relative paths from project root
261
- - Sorted alphabetically for easy navigation
262
- - Monospace font for proper alignment
263
-
264
- **3. File Content Section**
265
- - **Complete source code** for each selected file
266
- - **No truncation or size limits**
267
- - Proper monospace formatting for code readability
268
- - File headers with clear identification
269
- - Natural page breaks when needed
270
- - Error handling for unreadable files
271
-
272
- #### 2.4.2 PDF Specifications
273
-
274
- **Format & Layout:**
275
- - **Paper size**: A4 (595 × 842 points)
276
- - **Margins**: 40pt on all sides for optimal content area
277
- - **Fonts**:
278
- - Headers: Helvetica Bold
279
- - Body text: Helvetica
280
- - Code content: Courier (monospace)
281
- - **Colors**: Professional color scheme with high contrast
282
-
283
- **Advanced Features:**
284
- - **Streaming generation** for memory efficiency
285
- - **Automatic page breaks** handled by PDFKit
286
- - **Smart file conflict handling** with timestamped names
287
- - **Progress indicators** for large file processing
288
- - **Error recovery** with graceful failure handling
289
-
290
- #### 2.4.3 File Naming Convention
291
-
292
- **Standard naming:**
293
- ```
294
- PROJECTNAME_code.pdf
295
- ```
296
-
297
- **Conflict resolution (when file is in use):**
298
- ```
299
- PROJECTNAME_code_YYYYMMDD_HHMMSS.pdf
300
- ```
301
-
302
- **Example:**
303
- ```
304
- MYPROJECT_code.pdf # Standard
305
- MYPROJECT_code_20250729_141602.pdf # Timestamped fallback
306
- ```
307
-
308
- ---
309
-
310
- ### 2.5 Cross-Platform Compatibility
311
-
312
- #### 2.5.1 Operating System Support
313
-
314
- - **Windows** (10, 11, Server 2019+)
315
- - **macOS** (10.15+, including Apple Silicon)
316
- - **Linux** (Ubuntu 18.04+, CentOS 7+, other major distributions)
317
-
318
- #### 2.5.2 Terminal Compatibility
319
-
320
- - **Universal ASCII output** - no special Unicode characters
321
- - **Color support detection** with graceful fallbacks
322
- - **All terminal types supported** (cmd, PowerShell, bash, zsh, fish)
323
- - **Screen reader compatible** output format
324
-
325
- #### 2.5.3 Path Handling
326
-
327
- - **Automatic path normalization** across platforms
328
- - **Unicode filename support** for international characters
329
- - **Long path support** on Windows (>260 characters)
330
- - **Case sensitivity handling** appropriate to each platform
331
-
332
- ---
333
-
334
- ### 2.6 Performance & Scalability
335
-
336
- #### 2.6.1 Memory Management
337
-
338
- - **Streaming file processing** to minimize memory usage
339
- - **Efficient PDF generation** with incremental building
340
- - **Garbage collection optimization** for large projects
341
- - **Memory usage monitoring** with warnings for extreme cases
342
-
343
- #### 2.6.2 Large Project Handling
344
-
345
- - **No file size limits** - processes files of any size completely
346
- - **Progress indicators** for files with >1000 lines
347
- - **Configurable warning thresholds** (default: 500 files)
348
- - **User confirmation** for very large projects
349
- - **Streaming architecture** prevents memory overflow
350
-
351
- #### 2.6.3 Performance Optimizations
352
-
353
- - **Parallel file scanning** where safe
354
- - **Efficient binary detection** to skip non-text files quickly
355
- - **Smart caching** of file metadata
356
- - **Optimized PDF rendering** with minimal memory footprint
357
-
358
- ---
359
-
360
- ### 2.7 Error Handling & Validation
361
-
362
- #### 2.7.1 Input Validation
363
-
364
- - **Path validation** with security checks
365
- - **Configuration validation** with schema enforcement
366
- - **File extension validation** with normalization
367
- - **Permission checking** before operations
368
-
369
- #### 2.7.2 Error Recovery
370
-
371
- - **Graceful degradation** when files are inaccessible
372
- - **Automatic retry** for transient failures
373
- - **Detailed error logging** with context information
374
- - **User-friendly error messages** with suggested solutions
375
-
376
- #### 2.7.3 File Conflict Handling
377
-
378
- - **Automatic detection** of files in use
379
- - **Timestamped filename generation** for conflicts
380
- - **User notification** of filename changes
381
- - **Fallback mechanisms** for write failures
382
-
383
- ---
384
-
385
- ## 3. Technical Architecture
386
-
387
- ### 3.1 Module Structure
388
-
389
- ```
390
- src/
391
- ├── cli.js # Command-line interface and user interaction
392
- ├── configManager.js # Global configuration management
393
- ├── scanner.js # File system scanning and filtering
394
- ├── pdfGenerator.js # PDF creation and formatting
395
- └── errorHandler.js # Comprehensive error handling
396
- ```
397
-
398
- ### 3.2 Key Design Patterns
399
-
400
- - **Modular architecture** with clear separation of concerns
401
- - **Event-driven processing** for scalable file handling
402
- - **Stream-based operations** for memory efficiency
403
- - **Functional programming principles** where appropriate
404
- - **Comprehensive error boundaries** with graceful recovery
405
-
406
- ### 3.3 Dependencies
407
-
408
- **Core Dependencies:**
409
- - `pdfkit` - Professional PDF generation
410
- - `inquirer` - Interactive command-line prompts
411
- - `chalk` - Cross-platform terminal styling
412
- - `ora` - Progress indicators and spinners
413
- - `fs-extra` - Enhanced file system operations
414
-
415
- **Development Dependencies:**
416
- - Modern ES modules (Node.js 18+)
417
- - Native Promise-based APIs
418
- - Cross-platform path handling
419
- - Unicode and internationalization support
420
-
421
- ---
422
-
423
- ## 4. Quality Assurance
424
-
425
- ### 4.1 Testing Strategy
426
-
427
- - **Cross-platform testing** on Windows, macOS, and Linux
428
- - **Large project stress testing** with thousands of files
429
- - **Memory usage profiling** for optimization
430
- - **Terminal compatibility verification** across different environments
431
- - **File conflict scenario testing** with various edge cases
432
-
433
- ### 4.2 Security Considerations
434
-
435
- - **Path traversal prevention** with input validation
436
- - **Permission-based access control** respecting system security
437
- - **No external network dependencies** for complete offline operation
438
- - **Safe file handling** with proper error boundaries
439
- - **Configuration validation** to prevent malicious settings
440
-
441
- ### 4.3 Documentation Standards
442
-
443
- - **Comprehensive README** with usage examples
444
- - **Detailed feature specification** (this document)
445
- - **Inline code documentation** with JSDoc standards
446
- - **Error message clarity** with actionable guidance
447
- - **Contributing guidelines** for open-source collaboration
448
-
449
- ---
450
-
451
- ## 5. Future Enhancements
452
-
453
- ### 5.1 Planned Features
454
-
455
- - **Syntax highlighting** in PDF output for better code readability
456
- - **Clickable table of contents** with bookmarks for navigation
457
- - **Multiple output formats** (HTML, JSON, Markdown)
458
- - **Project metrics and statistics** (line counts, complexity analysis)
459
- - **CI/CD integration mode** for automated documentation pipelines
460
- - **Custom PDF themes** and styling options
461
- - **Plugin system** for custom file processors
462
-
463
- ### 5.2 Advanced Capabilities
464
-
465
- - **Incremental updates** for changed files only
466
- - **Git integration** for commit-specific documentation
467
- - **Code annotation** system for additional context
468
- - **Multi-language support** for international users
469
- - **Web-based configuration** interface for easier setup
470
- - **Integration APIs** for third-party tools
471
-
472
- ---
473
-
474
- ## 6. Success Metrics
475
-
476
- ### 6.1 Performance Targets
477
-
478
- - **Scan speed**: >1000 files per second on modern hardware
479
- - **Memory usage**: <200MB for projects with 10,000+ files
480
- - **PDF generation**: <30 seconds for typical projects (100 files)
481
- - **Cross-platform consistency**: 100% feature parity across platforms
482
-
483
- ### 6.2 Quality Targets
484
-
485
- - **Zero data loss**: All file content included without truncation
486
- - **Error rate**: <0.1% failure rate on valid projects
487
- - **User satisfaction**: Clear, actionable error messages for all failure cases
488
- - **Compatibility**: Works on 99%+ of supported platform/terminal combinations
489
-
490
- ---
491
-
492
- ## 7. Conclusion
493
-
494
- CodeSummary represents a comprehensive solution for automated code documentation, combining professional-grade PDF output with intelligent file processing and cross-platform compatibility. Its focus on complete content inclusion, smart conflict handling, and terminal compatibility makes it suitable for both individual developers and enterprise environments.
495
-
496
- The tool's architecture supports unlimited scalability while maintaining efficient resource usage, ensuring it can handle projects of any size. With its extensive language support and intelligent filtering, CodeSummary serves as a valuable tool for code reviews, audits, documentation, and archival purposes.
497
-
498
- ---
499
-
500
- **Document Version**: 2.0
501
- **Last Updated**: January 2025
502
- **Status**: Implementation Complete - Ready for Release
1
+ # CodeSummary – Detailed Features and Functional Specification
2
+
3
+ ## 1. Overview
4
+
5
+ **CodeSummary** is a **Node.js-based, cross-platform CLI tool** (distributed via **npm**) that automatically scans a project's source code and generates output in three formats:
6
+
7
+ - **PDF**: clean, professional A4 documentation for code reviews, audits, and archival snapshots
8
+ - **RAG JSON**: structured output with semantic chunks, byte offsets, and token estimates — built for vector databases and programmatic LLM integration
9
+ - **LLM Markdown**: a single token-optimised Markdown file for pasting directly into any chat-based LLM (any LLM chat interface)
10
+
11
+ > **Repository**: [https://github.com/skamoll/CodeSummary](https://github.com/skamoll/CodeSummary)
12
+ > **npm Package Name**: `codesummary`
13
+
14
+ ---
15
+
16
+ ### 1.1 Target Audience
17
+
18
+ - **Developers** who need quick, complete overviews of large projects
19
+ - **Auditors / Consultants** requiring traceable documentation snapshots
20
+ - **Educators / Students** preparing comprehensive code handovers
21
+ - **AI Engineers** building RAG pipelines or feeding code into vector databases
22
+ - **Anyone** who wants to work with their codebase inside a chat-based LLM efficiently
23
+
24
+ ---
25
+
26
+ ### 1.2 Core Objectives
27
+
28
+ 1. **Three output modes** — PDF for humans, RAG JSON for machines, LLM Markdown for chat
29
+ 2. **Cross-platform reliability** — identical behaviour on Windows, macOS, and Linux
30
+ 3. **Lossless content optimisation** — reduce token count without altering code meaning
31
+ 4. **Smart config migration** — new defaults merge into existing config without data loss
32
+ 5. **Versioned output** — `-v1`, `-v2` suffixes prevent overwrites and timestamp clutter
33
+ 6. **Non-interactive operation** — `--no-interactive` for CI/CD pipelines
34
+
35
+ ---
36
+
37
+ ### 1.3 Technology Stack
38
+
39
+ - **Node.js** ≥ 18 (native ES modules)
40
+ - **PDFKit** for PDF generation with streaming support
41
+ - **Inquirer.js** for interactive prompts
42
+ - **Chalk** for terminal styling
43
+ - **Ora** for progress indicators
44
+ - **fs-extra** for enhanced file system operations
45
+ - **js-yaml** for YAML config loading
46
+ - **ajv** for JSON schema validation
47
+
48
+ ---
49
+
50
+ ## 2. Functional Requirements
51
+
52
+ ### 2.1 Command-Line Interface
53
+
54
+ #### 2.1.1 Primary Commands
55
+
56
+ | Command | Description |
57
+ |---------|-------------|
58
+ | `codesummary` | Scan current directory, generate PDF |
59
+ | `codesummary --format rag` | Generate RAG-optimised JSON |
60
+ | `codesummary --format llm` | Generate LLM-optimised Markdown |
61
+ | `codesummary --format both` | Generate PDF + RAG JSON (single scan) |
62
+ | `codesummary config` | Launch interactive configuration editor |
63
+ | `codesummary --show-config` | Display current configuration |
64
+ | `codesummary --reset-config` | Reset to defaults and run setup wizard |
65
+ | `codesummary --help` | Show help |
66
+ | `codesummary --version` | Show version |
67
+
68
+ #### 2.1.2 Command-Line Options
69
+
70
+ | Option | Short | Description |
71
+ |--------|-------|-------------|
72
+ | `--format <format>` | `-f` | `pdf` (default), `rag`, `llm`, or `both` |
73
+ | `--output <path>` | `-o` | Override output directory for this run |
74
+ | `--no-interactive` | | Skip all prompts; auto-select all extensions |
75
+ | `--show-config` | | Display current configuration |
76
+ | `--reset-config` | | Reset configuration to defaults |
77
+ | `--help` | `-h` | Show help message |
78
+ | `--version` | `-v` | Show version |
79
+
80
+ #### 2.1.3 Interactive Workflow
81
+
82
+ 1. **First-run setup** — detects missing config, launches setup wizard, creates output directory
83
+ 2. **Directory scanning** recursive scan with whitelist filtering and exclusion rules
84
+ 3. **Extension selection** checkbox prompt with file counts; skipped with `--no-interactive`
85
+ 4. **Generation** selected format(s) generated, versioned filenames used if target exists
86
+
87
+ ---
88
+
89
+ ### 2.2 Output Formats
90
+
91
+ #### 2.2.1 PDF (`--format pdf`)
92
+
93
+ Generates a professional A4 PDF with three sections:
94
+
95
+ 1. **Project overview**: title, project name, timestamp, included file types
96
+ 2. **File structure**: sorted complete file listing
97
+ 3. **File content**: full source of every selected file, monospace font, no truncation
98
+
99
+ File naming: `PROJECTNAME_code.pdf` `PROJECTNAME_code-v1.pdf` `PROJECTNAME_code-v2.pdf` ...
100
+
101
+ #### 2.2.2 RAG JSON (`--format rag`)
102
+
103
+ Generates a structured JSON file built for embedding and retrieval in AI/ML pipelines.
104
+
105
+ **When to use RAG:**
106
+ - Loading code into a vector database (Pinecone, Qdrant, Chroma, etc.)
107
+ - Building a retrieval-augmented generation pipeline
108
+ - Programmatic LLM integration where you control chunking and retrieval
109
+ - Rapid chunk seeking via byte offsets without re-parsing the full JSON
110
+
111
+ **JSON structure:**
112
+ ```json
113
+ {
114
+ "metadata": { "projectName": "...", "generatedAt": "...", "version": "..." },
115
+ "files": [
116
+ {
117
+ "id": "abc123",
118
+ "path": "src/component.js",
119
+ "language": "JavaScript",
120
+ "hash": "sha256-...",
121
+ "chunks": [
122
+ {
123
+ "id": "chunk_abc123_0",
124
+ "content": "function myFn() { ... }",
125
+ "tokenEstimate": 45,
126
+ "lineStart": 1,
127
+ "lineEnd": 15,
128
+ "chunkingMethod": "semantic-function",
129
+ "context": "function_myFn",
130
+ "imports": ["react"],
131
+ "calls": ["useState"]
132
+ }
133
+ ]
134
+ }
135
+ ],
136
+ "index": {
137
+ "chunkOffsets": {
138
+ "chunk_abc123_0": { "contentStart": 12123, "contentEnd": 12356 }
139
+ },
140
+ "statistics": { "processingTimeMs": 245, "chunksWithValidOffsets": 387 }
141
+ }
142
+ }
143
+ ```
144
+
145
+ **Key RAG features:**
146
+ - Semantic chunking by function, class, or logical block
147
+ - Byte-accurate content offsets for fast random access
148
+ - SHA-256 file hashes for deduplication
149
+ - Language-aware token estimation (±20% accuracy)
150
+ - Import and call graph extraction
151
+ - YAML-configurable via `raggen.config.yaml`
152
+
153
+ File naming: `PROJECTNAME_rag.json` → `PROJECTNAME_rag-v1.json` → ...
154
+
155
+ #### 2.2.3 LLM Markdown (`--format llm`)
156
+
157
+ Generates a single Markdown file optimised for direct consumption by chat-based LLMs.
158
+
159
+ **When to use LLM Markdown:**
160
+ - Asking any LLM chat interface to review or explain your codebase
161
+ - One-off questions that don't justify setting up a RAG pipeline
162
+ - Sharing project context in a conversation without a file upload feature
163
+
164
+ **File structure:**
165
+ ```markdown
166
+ # ProjectName — Code Summary
167
+
168
+ **Generated:** 2026-04-05 | **Files:** 42 | **Total size:** 1.2 MB
169
+
170
+ ---
171
+
172
+ ## File Tree
173
+
174
+ ```
175
+ src/cli.js
176
+ src/scanner.js
177
+ ...
178
+ ```
179
+
180
+ ---
181
+
182
+ ## src/cli.js
183
+
184
+ ```js
185
+ // full file content
186
+ ```
187
+ ```
188
+
189
+ **Lossless optimisations applied automatically:**
190
+
191
+ | Optimisation | Applies to | Notes |
192
+ |---|---|---|
193
+ | Normalise line endings (`\r\n` → `\n`) | All files | Safe for all languages |
194
+ | Strip trailing whitespace per line | All files | Never has semantic meaning |
195
+ | Remove leading/trailing blank lines | All files | Per-file trimming |
196
+ | Compact JSON | `.json` files | Re-serialise without indentation |
197
+ | Max 2 consecutive blank lines | `.md` / `.mdx` | Preserves paragraph semantics |
198
+ | Max 1 consecutive blank line | All other files | Removes relleno without touching indentation |
199
+
200
+ **What is never modified:**
201
+ - Indentation (critical for Python, YAML, Makefiles)
202
+ - Multiple spaces within a line (may be in string literals)
203
+ - Comments
204
+ - Code logic
205
+
206
+ File naming: `PROJECTNAME_llm.md` → `PROJECTNAME_llm-v1.md` → ...
207
+
208
+ #### 2.2.4 Both (`--format both`)
209
+
210
+ Runs PDF and RAG generation in sequence using a single scan pass. Uses continue-on-error: if one format fails, the other still completes. Exit code 1 if either failed.
211
+
212
+ ---
213
+
214
+ ### 2.3 Configuration Management
215
+
216
+ #### 2.3.1 Storage Locations
217
+
218
+ - **Linux/macOS**: `~/.codesummary/config.json`
219
+ - **Windows**: `%APPDATA%\CodeSummary\config.json`
220
+
221
+ #### 2.3.2 Configuration Structure
222
+
223
+ ```json
224
+ {
225
+ "configVersion": "1.1.0",
226
+ "output": {
227
+ "mode": "fixed | relative",
228
+ "fixedPath": "absolute path"
229
+ },
230
+ "allowedExtensions": ["array of extensions"],
231
+ "excludeDirs": ["array of directory names"],
232
+ "excludeFiles": ["array of glob patterns"],
233
+ "styles": { "colors": {}, "layout": {}, "fonts": {} },
234
+ "settings": {
235
+ "documentTitle": "Project Code Summary",
236
+ "maxFilesBeforePrompt": 500
237
+ }
238
+ }
239
+ ```
240
+
241
+ #### 2.3.3 Smart Migration
242
+
243
+ On every run, new defaults are merged into the existing config using `smartMergeArrays`:
244
+ - Items already present are kept in place
245
+ - New items are appended at the end
246
+ - User removals are respected (removed items are not re-added)
247
+ - Changes are saved automatically and the user is notified
248
+
249
+ #### 2.3.4 Interactive Editor
250
+
251
+ Sections available via `codesummary config`:
252
+ - Output settings (mode, fixed path)
253
+ - Allowed extensions
254
+ - Excluded directories
255
+ - Excluded file patterns
256
+ - General settings (document title, file warning threshold)
257
+
258
+ ---
259
+
260
+ ### 2.4 File System Scanning
261
+
262
+ #### 2.4.1 Algorithm
263
+
264
+ 1. Recursive directory traversal from `process.cwd()`
265
+ 2. Whitelist filtering by allowed extensions
266
+ 3. Directory exclusion by exact name match + built-in common-skip list
267
+ 4. File exclusion by glob pattern matching
268
+ 5. Symlink detection (skipped to avoid loops)
269
+ 6. File size limit: 100 MB per file
270
+ 7. Duplicate detection via absolute path tracking
271
+
272
+ #### 2.4.2 Supported Extensions (defaults)
273
+
274
+ **Web & JavaScript ecosystem:**
275
+ `.js`, `.jsx`, `.ts`, `.tsx`, `.vue`, `.svelte`, `.astro`, `.mdx`
276
+
277
+ **Backend languages:**
278
+ `.py`, `.java`, `.cs`, `.cpp`, `.c`, `.h`, `.go`, `.rs`, `.swift`, `.kt`, `.scala`, `.rb`, `.php`, `.dart`, `.lua`, `.r`, `.ex`, `.exs`, `.pl`
279
+
280
+ **Web & markup:**
281
+ `.html`, `.css`, `.scss`, `.xml`
282
+
283
+ **Data & config:**
284
+ `.json`, `.yaml`, `.yml`, `.toml`, `.ini`, `.properties`, `.tf`, `.tfvars`, `.env`, `.local`, `.cfg`, `.conf`
285
+
286
+ **Schema & query:**
287
+ `.sql`, `.graphql`, `.gql`, `.proto`, `.prisma`
288
+
289
+ **Scripts:**
290
+ `.sh`, `.bat`, `.ps1`, `.mk`, `.cmake`
291
+
292
+ **Documentation:**
293
+ `.md`, `.mdx`, `.txt`
294
+
295
+ **Specialised:**
296
+ `.ino` (Arduino), `.j2` (Jinja2), `.service`, `.timer` (systemd), `.crt` (certificates), `.csv`, `.tsv`
297
+
298
+ #### 2.4.3 Default Excluded Directories
299
+
300
+ Build output: `dist`, `build`, `out`, `target`
301
+ Dependencies: `node_modules`, `vendor`, `bower_components`
302
+ Caches: `.cache`, `.turbo`, `.gradle`, `.yarn`, `.pnpm-store`, `.pytest_cache`, `.mypy_cache`, `.tox`, `htmlcov`
303
+ IDE: `.git`, `.vscode`, `.idea`
304
+ Framework: `.next`, `.nuxt`, `.angular`, `.svelte-kit`, `.expo`, `.dart_tool`, `storybook-static`
305
+ Python: `__pycache__`, `venv`, `.venv`
306
+ Infrastructure: `.terraform`
307
+
308
+ #### 2.4.4 Default Excluded File Patterns
309
+
310
+ Lock files: `*-lock.json`, `*.lock`, `composer.lock`, `Pipfile.lock`, `*-lock.yaml`
311
+ Minified: `*.min.js`, `*.min.css`, `*.map`
312
+ Compiled: `*.pyc`, `*.pyo`, `*.class`
313
+ Temporary: `*.log`, `*.tmp`, `*.temp`, `*.swp`, `*.bak`, `*.orig`
314
+ OS metadata: `.DS_Store`, `Thumbs.db`, `desktop.ini`, `ehthumbs.db`
315
+
316
+ ---
317
+
318
+ ### 2.5 Versioned Output Files
319
+
320
+ When a target file already exists, a `-vN` suffix is added instead of overwriting:
321
+
322
+ ```
323
+ PROJECTNAME_llm.md ← exists
324
+ PROJECTNAME_llm-v1.md ← created
325
+ next run: v1 exists
326
+ PROJECTNAME_llm-v2.md ← created
327
+ ```
328
+
329
+ Applies to all three formats. Existing `-vN` suffixes are stripped before re-versioning to avoid `name-v1-v1.md`.
330
+
331
+ ---
332
+
333
+ ### 2.6 Non-Interactive Mode
334
+
335
+ `--no-interactive` (or non-TTY stdin) skips:
336
+ - Extension selection checkbox → all detected extensions selected
337
+ - File count confirmation prompt → proceeds automatically
338
+
339
+ Designed for use in CI/CD pipelines and scripted environments.
340
+
341
+ ---
342
+
343
+ ### 2.7 Error Handling
344
+
345
+ - **Path traversal prevention**: blocks `..`, null bytes, and Windows reserved names
346
+ - **Non-ASCII path support**: Unicode characters in paths (e.g. `C:\Users\Andrés\...`) are preserved correctly
347
+ - **Graceful scan errors**: permission denied and missing files are logged but don't abort the scan
348
+ - **PDF stream errors**: file-in-use (EBUSY/EACCES) triggers versioned filename fallback
349
+ - **LLM/RAG errors**: unreadable files emit a warning block in output instead of crashing
350
+ - **`--format both` failures**: continue-on-error; both outputs attempted, all errors reported together
351
+
352
+ ---
353
+
354
+ ## 3. Technical Architecture
355
+
356
+ ### 3.1 Module Structure
357
+
358
+ ```
359
+ src/
360
+ ├── cli.js # Argument parsing, orchestration, user interaction
361
+ ├── scanner.js # Recursive directory scanning and filtering
362
+ ├── pdfGenerator.js # PDF generation (PDFKit, streaming)
363
+ ├── ragGenerator.js # RAG JSON generation with semantic chunking
364
+ ├── llmGenerator.js # LLM Markdown generation with content optimisations
365
+ ├── configManager.js # Global config load, save, migrate, edit
366
+ ├── ragConfig.js # RAG-specific config (YAML loading, defaults)
367
+ ├── errorHandler.js # Path validation, sanitisation, global error handlers
368
+ └── utils.js # Shared: formatFileSize, getExtensionDescription,
369
+ # matchesGlobPattern, resolveVersionedPath
370
+ ```
371
+
372
+ ### 3.2 Data Flow
373
+
374
+ ```
375
+ bin/codesummary.js
376
+ └─ src/index.js (bootstrap)
377
+ └─ src/cli.js (parse args → executeMainFlow)
378
+ ├─ scanner.js (scan filesByExtension)
379
+ ├─ pdfGenerator.js (format: pdf)
380
+ ├─ ragGenerator.js (format: rag) ← uses ragConfig.js
381
+ └─ llmGenerator.js (format: llm)
382
+ ```
383
+
384
+ ### 3.3 Key Design Decisions
385
+
386
+ - **ESM modules** throughout (`"type": "module"`)
387
+ - **No singleton exports** — all modules export classes, instantiated at call site
388
+ - **Shared utilities** in `utils.js` — single source of truth, no duplication
389
+ - **Streaming writes** for PDF and LLM output — memory-efficient on large projects
390
+ - **Static imports** only — dynamic `import()` avoided for consistency
391
+
392
+ ---
393
+
394
+ ## 4. Security
395
+
396
+ - Path traversal (`..`) blocked via pattern matching before any file operation
397
+ - User-supplied paths sanitised: control characters and injection sequences removed
398
+ - Unicode characters in paths preserved (non-ASCII allowed)
399
+ - Windows reserved device names (CON, NUL, COM1, etc.) rejected
400
+ - No external network calls at runtime — fully offline operation
401
+ - Config validated against schema before use; corrupt config prompts reset
402
+
403
+ ---
404
+
405
+ ## 5. Future Enhancements
406
+
407
+ - `--format all`: PDF + RAG + LLM in a single pass
408
+ - Syntax highlighting in PDF output
409
+ - Clickable table of contents with bookmarks in PDF
410
+ - Git integration: document only changed files since last commit
411
+ - CI/CD plugin for automated documentation on push
412
+ - Custom PDF themes and styling
413
+
414
+ ---
415
+
416
+ **Document Version**: 3.0
417
+ **Last Updated**: 2026-04-05
418
+ **Status**: Implementation Complete