codesummary 1.1.1 → 1.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,607 +1,483 @@
1
- # CodeSummary
2
-
3
- [![npm version](https://badge.fury.io/js/codesummary.svg)](https://badge.fury.io/js/codesummary)
4
- [![Node.js Version](https://img.shields.io/badge/node-%3E%3D18.0.0-brightgreen.svg)](https://nodejs.org/)
5
- [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
6
- [![Cross-Platform](https://img.shields.io/badge/platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey)](#)
7
-
8
- A **cross-platform CLI tool** that automatically scans project source code and generates both **clean, professional PDF documentation** and **RAG-optimized JSON outputs** for AI/ML applications. Perfect for code reviews, audits, project documentation, archival snapshots, and feeding code into vector databases or LLM systems.
9
-
10
- ## 🚀 Key Features
11
-
12
- ### 📄 **PDF Generation**
13
- - **🔍 Intelligent Scanning**: Recursively scans project directories with configurable file type filtering
14
- - **📄 Clean PDF Output**: Generates well-structured A4 PDFs with optimized formatting and complete content flow
15
- - **📝 Complete Content**: Includes ALL file content without truncation - no size limits
16
-
17
- ### 🤖 **RAG & AI Integration** *(New in v1.1.0)*
18
- - **📊 RAG-Optimized JSON**: Purpose-built output format for vector databases and LLM applications
19
- - **🎯 Semantic Chunking**: Intelligent code segmentation by functions, classes, and logical blocks
20
- - **📈 Precision Offsets**: Byte-accurate indexing for rapid content retrieval (99.8% precision)
21
- - **🧠 Smart Token Estimation**: Language-aware token counting with 20% improved accuracy
22
- - **⚡ High-Performance Seeking**: Complete offset index for instant chunk access in RAG pipelines
23
- - **🔄 Schema Versioning**: Future-proof JSON structure with migration support
24
- - **⚙️ Global Configuration**: One-time setup with persistent cross-platform user preferences
25
- - **🎯 Interactive Selection**: Choose which file types to include via intuitive checkbox prompts
26
- - **🛡️ Safe & Smart**: Whitelist-driven approach prevents binary files, with intelligent fallbacks
27
- - **🌍 Cross-Platform**: Works identically on Windows, macOS, and Linux with terminal compatibility
28
- - **📊 Smart Filtering**: Automatically excludes build directories, dependencies, and temporary files
29
- - **⚡ Performance Optimized**: Efficient memory usage and streaming for large projects
30
- - **🔄 File Conflict Handling**: Automatic timestamped filenames when original files are in use
31
-
32
- ## 📦 Installation
33
-
34
- ```bash
35
- npm install -g codesummary
36
- ```
37
-
38
- **Requirements**: Node.js 18.0.0
39
-
40
- ## 🎯 Dual Output Modes
41
-
42
- ### 📄 PDF Mode (Default)
43
- Generate clean, professional PDF documentation:
44
-
45
- ```bash
46
- codesummary
47
- # Creates: PROJECT_code.pdf
48
- ```
49
-
50
- ### 🤖 RAG Mode *(New!)*
51
- Generate RAG-optimized JSON for AI applications:
52
-
53
- ```bash
54
- codesummary --rag
55
- # Creates: PROJECT_rag.json with semantic chunks and precise offsets
56
- ```
57
-
58
- ### 🔄 Both Modes
59
- Generate both PDF and RAG outputs:
60
-
61
- ```bash
62
- codesummary --both
63
- # Creates: PROJECT_code.pdf + PROJECT_rag.json
64
- ```
65
-
66
- ## 🎯 Quick Start
67
-
68
- ### 📄 **PDF Generation**
69
- 1. **First-time setup** (interactive wizard):
70
- ```bash
71
- codesummary
72
- ```
73
-
74
- 2. **Generate PDF for current project**:
75
- ```bash
76
- cd /path/to/your/project
77
- codesummary
78
- ```
79
-
80
- ### 🤖 **RAG/AI Integration**
81
- 1. **Generate RAG JSON** for vector databases:
82
- ```bash
83
- codesummary --rag
84
- ```
85
-
86
- 2. **Use in your AI pipeline**:
87
- ```javascript
88
- // Example: Loading and using RAG output
89
- const ragData = JSON.parse(fs.readFileSync('project_rag.json'));
90
-
91
- // Access semantic chunks
92
- const chunks = ragData.files.flatMap(f => f.chunks);
93
-
94
- // Use precise offsets for rapid seeking
95
- const chunkId = 'chunk_abc123_0';
96
- const offset = ragData.index.chunkOffsets[chunkId];
97
- // Seek to offset.contentStart offset.contentEnd for exact content
98
- ```
99
-
100
- 3. **Override output location**:
101
- ```bash
102
- codesummary --rag --output ./ai-data
103
- ```
104
-
105
- ## 📖 Usage
106
-
107
- ### Interactive Workflow
108
-
109
- #### 1. First Run Setup
110
-
111
- ```bash
112
- $ codesummary
113
- Welcome to CodeSummary!
114
- No configuration found. Starting setup...
115
-
116
- Where should the PDF be generated by default?
117
- > [ ] Current working directory (relative mode)
118
- > [x] Fixed folder (absolute mode)
119
-
120
- Enter absolute path for fixed folder:
121
- > ~/Desktop/CodeSummaries
122
- ```
123
-
124
- #### 2. Extension Selection
125
-
126
- ```bash
127
- Scanning directory: /path/to/project
128
-
129
- Scan Summary:
130
- Extensions found: .js, .ts, .md, .json
131
- Total files: 127
132
- Total size: 2.4 MB
133
-
134
- Select file extensions to include:
135
- [x] .js JavaScript (42 files)
136
- [x] .ts → TypeScript (28 files)
137
- [x] .md Markdown (5 files)
138
- [ ] .json → JSON (52 files)
139
- ```
140
-
141
- #### 3. Generation Complete
142
-
143
- ```bash
144
- SUCCESS: PDF generation completed successfully!
145
-
146
- Summary:
147
- Output: ~/Desktop/CodeSummaries/MYPROJECT_code.pdf
148
- Extensions: .js, .ts, .md
149
- Total files: 75
150
- PDF size: 2.3 MB
151
- ```
152
-
153
- ### Command Reference
154
-
155
- | Command | Description |
156
- | ---------------------------- | --------------------------------------- |
157
- | `codesummary` | Generate PDF documentation (default) |
158
- | `codesummary --rag` | Generate RAG-optimized JSON output |
159
- | `codesummary --both` | Generate both PDF and RAG outputs |
160
- | `codesummary config` | Edit configuration settings |
161
- | `codesummary --show-config` | Display current configuration |
162
- | `codesummary --reset-config` | Reset configuration to defaults |
163
- | `codesummary --help` | Show help information |
164
-
165
- ### Command Line Options
166
-
167
- | Option | Description |
168
- | --------------------- | ---------------------------------------- |
169
- | `-o, --output <path>` | Override output directory for this run |
170
- | `--rag` | Generate RAG-optimized JSON output |
171
- | `--both` | Generate both PDF and RAG outputs |
172
- | `--show-config` | Display current configuration |
173
- | `--reset-config` | Reset configuration and run setup wizard |
174
- | `-h, --help` | Show help message |
175
-
176
- ### Examples
177
-
178
- ```bash
179
- # Generate PDF with default settings
180
- codesummary
181
-
182
- # Generate RAG JSON for AI/ML applications
183
- codesummary --rag
184
-
185
- # Generate both PDF and RAG outputs
186
- codesummary --both
187
-
188
- # Save outputs to specific directory
189
- codesummary --both --output ~/Documents/AIData
190
-
191
- # Edit configuration
192
- codesummary config
193
-
194
- # View current settings
195
- codesummary --show-config
196
- ```
197
-
198
- ## ⚙️ Configuration
199
-
200
- CodeSummary stores global configuration in:
201
-
202
- - **Linux/macOS**: `~/.codesummary/config.json`
203
- - **Windows**: `%APPDATA%\\CodeSummary\\config.json`
204
-
205
- ### Default Configuration
206
-
207
- ```json
208
- {
209
- "output": {
210
- "mode": "fixed",
211
- "fixedPath": "~/Desktop/CodeSummaries"
212
- },
213
- "allowedExtensions": [
214
- ".json", ".ts", ".js", ".jsx", ".tsx", ".xml", ".html",
215
- ".css", ".scss", ".md", ".txt", ".py", ".java", ".cs",
216
- ".cpp", ".c", ".h", ".yaml", ".yml", ".sh", ".bat",
217
- ".ps1", ".php", ".rb", ".go", ".rs", ".swift", ".kt",
218
- ".scala", ".vue", ".svelte", ".dockerfile", ".sql", ".graphql"
219
- ],
220
- "excludeDirs": [
221
- "node_modules", ".git", ".vscode", "dist", "build",
222
- "coverage", "out", "__pycache__", ".next", ".nuxt"
223
- ],
224
- "styles": {
225
- "colors": {
226
- "title": "#333353",
227
- "section": "#00FFB9",
228
- "text": "#333333",
229
- "error": "#FF4D4D",
230
- "footer": "#666666"
231
- },
232
- "layout": {
233
- "marginLeft": 40,
234
- "marginTop": 40,
235
- "marginRight": 40,
236
- "footerHeight": 20
237
- }
238
- },
239
- "settings": {
240
- "documentTitle": "Project Code Summary",
241
- "maxFilesBeforePrompt": 500
242
- }
243
- }
244
- ```
245
-
246
- ## 📋 PDF Structure
247
-
248
- Generated PDFs use **A4 format** with optimized margins and contain three main sections:
249
-
250
- ### 1. Project Overview
251
-
252
- - Document title and project name
253
- - Generation timestamp
254
- - List of included file types with descriptions
255
-
256
- ### 2. File Structure
257
-
258
- - Complete hierarchical listing of all included files
259
- - Organized by relative paths from project root
260
- - Sorted alphabetically for easy navigation
261
-
262
- ### 3. File Content
263
-
264
- - **Complete source code** for each file (no truncation)
265
- - Proper formatting with monospace fonts for code
266
- - Intelligent text wrapping without overlap
267
- - Natural page breaks when needed
268
- - Error handling for unreadable files
269
-
270
- ## 🤖 RAG JSON Structure *(New in v1.1.0)*
271
-
272
- The RAG-optimized JSON output is purpose-built for AI/ML applications, vector databases, and LLM integration:
273
-
274
- ### 📊 **Complete JSON Schema**
275
-
276
- ```json
277
- {
278
- "metadata": {
279
- "projectName": "MyProject",
280
- "generatedAt": "2025-07-31T08:00:00.000Z",
281
- "version": "3.1.0",
282
- "schemaVersion": "1.0",
283
- "schemaUrl": "https://github.com/skamoll/CodeSummary/schemas/rag-output.json",
284
- "config": {
285
- "maxTokensPerChunk": 1000,
286
- "tokenEstimationMethod": "enhanced_heuristic_v1.0"
287
- }
288
- },
289
- "files": [
290
- {
291
- "id": "abc123def456",
292
- "path": "src/component.js",
293
- "language": "JavaScript",
294
- "size": 2048,
295
- "hash": "sha256-...",
296
- "chunks": [
297
- {
298
- "id": "chunk_abc123def456_0",
299
- "content": "function myFunction() { ... }",
300
- "tokenEstimate": 45,
301
- "lineStart": 1,
302
- "lineEnd": 15,
303
- "chunkingMethod": "semantic-function",
304
- "context": "function_myFunction",
305
- "imports": ["lodash", "react"],
306
- "calls": ["useState", "useEffect"]
307
- }
308
- ]
309
- }
310
- ],
311
- "index": {
312
- "summary": {
313
- "fileCount": 42,
314
- "chunkCount": 387,
315
- "totalBytes": 1048576,
316
- "languages": ["JavaScript", "TypeScript"],
317
- "extensions": [".js", ".ts"]
318
- },
319
- "chunkOffsets": {
320
- "chunk_abc123def456_0": {
321
- "jsonStart": 12045,
322
- "jsonEnd": 12389,
323
- "contentStart": 12123,
324
- "contentEnd": 12356,
325
- "filePath": "src/component.js"
326
- }
327
- },
328
- "fileOffsets": {
329
- "abc123def456": [8192, 16384]
330
- },
331
- "statistics": {
332
- "processingTimeMs": 245,
333
- "bytesPerSecond": 4278190,
334
- "chunksWithValidOffsets": 387
335
- }
336
- }
337
- }
338
- ```
339
-
340
- ### 🎯 **Key RAG Features**
341
-
342
- #### **1. Semantic Chunking**
343
- - **Function-based segmentation**: Each function, class, or logical block becomes a chunk
344
- - **Context preservation**: Maintains relationships between code elements
345
- - **Smart boundaries**: Respects language syntax and structure
346
- - **Metadata enrichment**: Includes imports, function calls, and context tags
347
-
348
- #### **2. Precision Offsets (99.8% accuracy)**
349
- - **Byte-accurate positioning**: Exact start/end positions for rapid seeking
350
- - **Dual offset system**: Both JSON structure and content offsets
351
- - **Instant retrieval**: No need to parse entire file to access specific chunks
352
- - **Vector DB optimized**: Perfect for embedding-based retrieval systems
353
-
354
- #### **3. Enhanced Token Estimation**
355
- - **Language-aware calculation**: JavaScript gets different treatment than Python
356
- - **Syntax consideration**: Accounts for operators, brackets, and language-specific tokens
357
- - **20% more accurate**: Better LLM context planning and token budget management
358
- - **Multiple heuristics**: Character count, word count, and syntax analysis combined
359
-
360
- #### **4. Complete Statistics & Monitoring**
361
- - **Processing metrics**: Time, throughput, success rates
362
- - **Quality indicators**: Valid offsets, empty files, error tracking
363
- - **Project insights**: Language distribution, file sizes, chunk density
364
-
365
- ### 🚀 **RAG Integration Examples**
366
-
367
- #### **Vector Database Integration**
368
- ```javascript
369
- // Load RAG output
370
- const ragData = JSON.parse(fs.readFileSync('project_rag.json'));
371
-
372
- // Extract chunks for embedding
373
- const chunks = ragData.files.flatMap(file =>
374
- file.chunks.map(chunk => ({
375
- id: chunk.id,
376
- content: chunk.content,
377
- metadata: {
378
- filePath: file.path,
379
- language: file.language,
380
- tokenEstimate: chunk.tokenEstimate,
381
- context: chunk.context
382
- }
383
- }))
384
- );
385
-
386
- // Create embeddings and store in vector DB
387
- for (const chunk of chunks) {
388
- const embedding = await createEmbedding(chunk.content);
389
- await vectorDB.store(chunk.id, embedding, chunk.metadata);
390
- }
391
- ```
392
-
393
- #### **Rapid Content Retrieval**
394
- ```javascript
395
- // Fast chunk access using offsets
396
- const chunkId = 'chunk_abc123def456_15';
397
- const offset = ragData.index.chunkOffsets[chunkId];
398
-
399
- // Direct file seeking (no JSON parsing needed)
400
- const fd = fs.openSync('project_rag.json', 'r');
401
- const buffer = Buffer.alloc(offset.contentEnd - offset.contentStart);
402
- fs.readSync(fd, buffer, 0, buffer.length, offset.contentStart);
403
- const chunkContent = buffer.toString();
404
- ```
405
-
406
- #### **LLM Context Building**
407
- ```javascript
408
- // Smart context assembly
409
- function buildContext(relevantChunkIds, maxTokens = 4000) {
410
- let context = '';
411
- let tokenCount = 0;
412
-
413
- for (const chunkId of relevantChunkIds) {
414
- const chunk = findChunkById(chunkId);
415
- if (tokenCount + chunk.tokenEstimate <= maxTokens) {
416
- context += `// File: ${chunk.filePath}\n${chunk.content}\n\n`;
417
- tokenCount += chunk.tokenEstimate;
418
- }
419
- }
420
-
421
- return { context, tokenCount };
422
- }
423
- ```
424
-
425
- ### 📈 **Performance Benefits**
426
-
427
- | Operation | Traditional Parsing | RAG Offsets | Speedup |
428
- |-----------|-------------------|-------------|----------|
429
- | Single chunk access | ~50ms | ~0.1ms | **500x** |
430
- | Multiple chunk retrieval | ~200ms | ~0.5ms | **400x** |
431
- | File-based filtering | ~100ms | ~0.2ms | **500x** |
432
- | Context assembly | ~300ms | ~1ms | **300x** |
433
-
434
- ## 🔧 Advanced Features
435
-
436
- ### Smart File Conflict Handling
437
-
438
- When the target PDF file is in use (e.g., open in a PDF viewer), CodeSummary automatically creates a timestamped version:
439
-
440
- ```bash
441
- # Original filename
442
- MYPROJECT_code.pdf
443
-
444
- # If file is in use, creates:
445
- MYPROJECT_code_20250729_141602.pdf
446
- ```
447
-
448
- ### Large File Processing
449
-
450
- - **No file size limits**: Processes files of any size completely
451
- - **Progress indicators**: Shows processing status for large files
452
- - **Memory efficient**: Uses streaming for optimal performance
453
- - **Smart warnings**: Informs about large files being processed
454
-
455
- ### Terminal Compatibility
456
-
457
- - **Universal compatibility**: Works with all terminal types and operating systems
458
- - **No special characters**: Uses standard ASCII text for maximum compatibility
459
- - **Clear output**: Color-coded messages with fallback text indicators
460
-
461
- ## 🎨 Supported File Types
462
-
463
- CodeSummary supports an extensive range of text-based file formats:
464
-
465
- | Extension | Language/Type | Extension | Language/Type |
466
- | --------- | -------------- | ------------ | ------------- |
467
- | `.js` | JavaScript | `.py` | Python |
468
- | `.ts` | TypeScript | `.java` | Java |
469
- | `.jsx` | React JSX | `.cs` | C# |
470
- | `.tsx` | TypeScript JSX | `.cpp` | C++ |
471
- | `.json` | JSON | `.c` | C |
472
- | `.xml` | XML | `.h` | Header |
473
- | `.html` | HTML | `.yaml/.yml` | YAML |
474
- | `.css` | CSS | `.sh` | Shell Script |
475
- | `.scss` | SCSS | `.bat` | Batch File |
476
- | `.md` | Markdown | `.ps1` | PowerShell |
477
- | `.txt` | Plain Text | `.php` | PHP |
478
- | `.go` | Go | `.rb` | Ruby |
479
- | `.rs` | Rust | `.swift` | Swift |
480
- | `.kt` | Kotlin | `.scala` | Scala |
481
- | `.vue` | Vue.js | `.svelte` | Svelte |
482
- | `.sql` | SQL | `.graphql` | GraphQL |
483
-
484
- ## 🛠️ Development
485
-
486
- ### Project Structure
487
-
488
- ```
489
- codesummary/
490
- ├── bin/
491
- │ └── codesummary.js # Global executable entry point
492
- ├── src/
493
- │ ├── cli.js # Command line interface
494
- │ ├── configManager.js # Global configuration management
495
- │ ├── scanner.js # File system scanning and filtering
496
- │ ├── pdfGenerator.js # PDF creation and formatting
497
- │ └── errorHandler.js # Comprehensive error handling
498
- ├── package.json
499
- ├── README.md
500
- └── features.md
501
- ```
502
-
503
- ### Building from Source
504
-
505
- ```bash
506
- # Clone repository
507
- git clone https://github.com/skamoll/CodeSummary.git
508
- cd CodeSummary
509
-
510
- # Install dependencies
511
- npm install
512
-
513
- # Test the CLI
514
- node bin/codesummary.js --help
515
-
516
- # Run locally without global install
517
- node bin/codesummary.js
518
- ```
519
-
520
- ## 🔍 Troubleshooting
521
-
522
- ### Common Issues
523
-
524
- **Configuration not found**
525
-
526
- - Run `codesummary` to trigger first-time setup
527
- - Check file permissions in config directory
528
-
529
- **PDF generation fails**
530
-
531
- - Verify output directory permissions
532
- - Ensure Node.js version ≥18.0.0
533
- - Close any open PDF viewers on the target file
534
-
535
- **Files not showing up**
536
-
537
- - Check that file extensions are in `allowedExtensions`
538
- - Verify directories aren't in `excludeDirs` list
539
- - Ensure files are text-based (not binary)
540
-
541
- **Large project performance**
542
-
543
- - Adjust `maxFilesBeforePrompt` in configuration
544
- - Use extension filtering to reduce file count
545
- - CodeSummary handles large files efficiently with streaming
546
-
547
- ### Getting Help
548
-
549
- 1. Run `codesummary --help` for usage information
550
- 2. Check configuration with `codesummary --show-config`
551
- 3. Reset configuration with `codesummary --reset-config`
552
- 4. Open an issue on [GitHub](https://github.com/skamoll/CodeSummary/issues)
553
-
554
- ## 🤝 Contributing
555
-
556
- We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
557
-
558
- ### Development Setup
559
-
560
- 1. Fork the repository
561
- 2. Clone your fork: `git clone https://github.com/yourusername/CodeSummary.git`
562
- 3. Install dependencies: `npm install`
563
- 4. Create a feature branch: `git checkout -b feature-name`
564
- 5. Make your changes and test thoroughly
565
- 6. Submit a pull request
566
-
567
- ## 📄 License
568
-
569
- This project is licensed under the GNU General Public License v3.0 - see the [LICENSE](LICENSE) file for details.
570
-
571
- ### License Summary
572
-
573
- - ✅ Commercial use permitted
574
- - ✅ Modification allowed
575
- - ✅ Distribution allowed
576
- - ✅ Private use allowed
577
- - ❗ Copyleft: derivative works must use GPL-3.0
578
- - ❗ Must include license and copyright notice
579
-
580
- ## 🙏 Acknowledgments
581
-
582
- - Built with [PDFKit](https://pdfkit.org/) for PDF generation
583
- - Uses [Inquirer.js](https://github.com/SBoudrias/Inquirer.js) for interactive prompts
584
- - Styled with [Chalk](https://github.com/chalk/chalk) for colorful console output
585
- - Uses [Ora](https://github.com/sindresorhus/ora) for progress indicators
586
-
587
- ## 📊 Roadmap
588
-
589
- ### Future Enhancements
590
-
591
- - [ ] Syntax highlighting in PDF output
592
- - [ ] Clickable table of contents with bookmarks
593
- - [ ] Multiple output formats (HTML, JSON, Markdown)
594
- - [ ] Project metrics and code statistics
595
- - [ ] CI/CD integration mode for automated documentation
596
- - [ ] Custom PDF themes and styling options
597
- - [ ] Plugin system for custom processors
598
-
599
- ## 📞 Support
600
-
601
- - 📧 Report bugs: [GitHub Issues](https://github.com/skamoll/CodeSummary/issues)
602
- - 💬 Ask questions: [GitHub Discussions](https://github.com/skamoll/CodeSummary/discussions)
603
- - 📖 Documentation: [Wiki](https://github.com/skamoll/CodeSummary/wiki)
604
-
605
- ---
606
-
607
- **Made with ❤️ for developers worldwide**
1
+ # CodeSummary
2
+
3
+ [![npm version](https://badge.fury.io/js/codesummary.svg)](https://badge.fury.io/js/codesummary)
4
+ [![Node.js Version](https://img.shields.io/badge/node-%3E%3D18.0.0-brightgreen.svg)](https://nodejs.org/)
5
+ [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
6
+ [![Cross-Platform](https://img.shields.io/badge/platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey)](#)
7
+
8
+ A **cross-platform CLI tool** that scans your project's source code and generates three types of output:
9
+
10
+ - **PDF** clean, professional A4 documentation for code reviews and audits
11
+ - **RAG JSON** — semantic chunks with byte offsets and token estimates, ready for vector databases and LLM pipelines
12
+ - **LLM Markdown** — token-optimised single file to paste directly into any chat-based LLM
13
+
14
+ ## 🚀 Key Features
15
+
16
+ - **Three output formats**: PDF, RAG JSON, LLM Markdown — pick what you need
17
+ - **Intelligent scanning**: recursive traversal with configurable whitelist filtering
18
+ - **Extensive language support**: 50+ file types out of the box
19
+ - **Versioned output files**: `-v1`, `-v2` suffixes instead of overwriting or timestamps
20
+ - **Non-interactive mode**: `--no-interactive` for CI/CD and scripted use
21
+ - **Smart config migration**: new defaults merge into existing config without overwriting customisations
22
+ - **Cross-platform**: identical behaviour on Windows, macOS, and Linux
23
+
24
+ ## 📦 Installation
25
+
26
+ ```bash
27
+ npm install -g codesummary
28
+ ```
29
+
30
+ **Requirements**: Node.js 18.0.0
31
+
32
+ ---
33
+
34
+ ## 🎯 Output Modes
35
+
36
+ ### 📄 PDF — Documentation & audits
37
+
38
+ Generate a professional A4 PDF with file structure and complete source content:
39
+
40
+ ```bash
41
+ codesummary
42
+ # or explicitly:
43
+ codesummary --format pdf
44
+ # Output: MYPROJECT_code.pdf
45
+ ```
46
+
47
+ Best for: code reviews, client handovers, compliance audits, archival snapshots.
48
+
49
+ ---
50
+
51
+ ### 🤖 RAG Vector databases & AI pipelines
52
+
53
+ Generate a structured JSON file built for embedding and retrieval:
54
+
55
+ ```bash
56
+ codesummary --format rag
57
+ # Output: MYPROJECT_rag.json
58
+ ```
59
+
60
+ The JSON contains semantic chunks (split by function, class, or logical block) with:
61
+ - Byte-accurate content offsets for fast seeking
62
+ - SHA-256 file hashes for deduplication
63
+ - Token estimates for context budget planning
64
+ - Import/call extraction for graph-based retrieval
65
+ - Full statistics for monitoring
66
+
67
+ Best for: building RAG systems, loading code into a vector database (Pinecone, Qdrant, Chroma, etc.), or programmatic LLM integration where you control chunking and retrieval.
68
+
69
+ ---
70
+
71
+ ### 💬 LLM — Direct chat with any AI assistant
72
+
73
+ Generate a single, token-optimised Markdown file to paste directly into any LLM:
74
+
75
+ ```bash
76
+ codesummary --format llm
77
+ # Output: MYPROJECT_llm.md
78
+ ```
79
+
80
+ The file contains a project header, a complete file tree, and each file's content in a fenced code block with syntax highlighting hints. These lossless optimisations are applied automatically to reduce token count:
81
+
82
+ - Line endings normalised to `\n`
83
+ - Trailing whitespace stripped per line
84
+ - Leading/trailing blank lines removed per file
85
+ - JSON files compacted (re-serialised without indentation)
86
+ - Markdown files: max 2 consecutive blank lines preserved
87
+ - All other files: max 1 consecutive blank line
88
+
89
+ Best for: asking any LLM chat interface to review, explain, or work with your codebase in a single paste — much more token-efficient than a PDF.
90
+
91
+ ---
92
+
93
+ ### 🔄 Both — PDF + RAG together
94
+
95
+ ```bash
96
+ codesummary --format both
97
+ # Output: MYPROJECT_code.pdf + MYPROJECT_rag.json
98
+ ```
99
+
100
+ Uses a single scan pass. If one format fails, the other still completes.
101
+
102
+ ---
103
+
104
+ ## 📖 Usage
105
+
106
+ ### Quick start
107
+
108
+ ```bash
109
+ # First run: interactive setup wizard
110
+ codesummary
111
+
112
+ # Generate LLM Markdown for the current project
113
+ codesummary --format llm
114
+
115
+ # Generate RAG JSON and save to a specific directory
116
+ codesummary --format rag --output ./ai-data
117
+
118
+ # Skip all prompts (CI-friendly)
119
+ codesummary --format llm --no-interactive
120
+
121
+ # Generate everything at once
122
+ codesummary --format both
123
+ ```
124
+
125
+ ### Interactive workflow
126
+
127
+ #### 1. First-run setup
128
+
129
+ ```
130
+ Welcome to CodeSummary!
131
+ No configuration found. Starting setup...
132
+
133
+ Where should the output be saved by default?
134
+ > [ ] Current working directory (relative mode)
135
+ > [x] Fixed folder (absolute mode)
136
+
137
+ Enter absolute path for fixed folder:
138
+ > ~/Desktop/CodeSummaries
139
+ ```
140
+
141
+ #### 2. Extension selection
142
+
143
+ ```
144
+ Scan Summary:
145
+ Extensions found: .js, .ts, .md, .json
146
+ Total files: 127 — Total size: 2.4 MB
147
+
148
+ Select file extensions to include:
149
+ [x] .js → JavaScript (42 files)
150
+ [x] .ts → TypeScript (28 files)
151
+ [x] .md → Markdown (5 files)
152
+ [ ] .json → JSON (52 files)
153
+ ```
154
+
155
+ #### 3. Output
156
+
157
+ ```
158
+ SUCCESS: LLM-optimised Markdown generated successfully!
159
+
160
+ Output: ~/Desktop/CodeSummaries/MYPROJECT_llm.md
161
+ Extensions: .js, .ts, .md
162
+ Total files: 75
163
+ File size: 1.1 MB
164
+ Ready to paste into any LLM chat interface
165
+ ```
166
+
167
+ ### Command reference
168
+
169
+ | Command | Description |
170
+ | ------- | ----------- |
171
+ | `codesummary` | Scan and generate PDF (default) |
172
+ | `codesummary --format pdf` | Generate PDF documentation |
173
+ | `codesummary --format rag` | Generate RAG-optimised JSON |
174
+ | `codesummary --format llm` | Generate LLM-optimised Markdown |
175
+ | `codesummary --format both` | Generate PDF + RAG JSON |
176
+ | `codesummary config` | Edit configuration interactively |
177
+ | `codesummary --show-config` | Display current configuration |
178
+ | `codesummary --reset-config` | Reset configuration to defaults |
179
+
180
+ ### Options
181
+
182
+ | Option | Short | Description |
183
+ | ------ | ----- | ----------- |
184
+ | `--format <format>` | `-f` | Output format: `pdf` (default), `rag`, `llm`, or `both` |
185
+ | `--output <path>` | `-o` | Override output directory for this run |
186
+ | `--no-interactive` | | Skip all prompts; auto-select all extensions |
187
+ | `--show-config` | | Display current configuration |
188
+ | `--reset-config` | | Reset configuration to defaults |
189
+ | `--help` | `-h` | Show help |
190
+ | `--version` | `-v` | Show version |
191
+
192
+ ---
193
+
194
+ ## ⚙️ Configuration
195
+
196
+ Configuration is stored globally at:
197
+
198
+ - **Linux/macOS**: `~/.codesummary/config.json`
199
+ - **Windows**: `%APPDATA%\CodeSummary\config.json`
200
+
201
+ Existing configuration is never overwritten on upgrade — new defaults are merged in automatically.
202
+
203
+ ### Default configuration
204
+
205
+ ```json
206
+ {
207
+ "output": {
208
+ "mode": "fixed",
209
+ "fixedPath": "~/Desktop/CodeSummaries"
210
+ },
211
+ "allowedExtensions": [
212
+ ".js", ".jsx", ".ts", ".tsx", ".json", ".html", ".css", ".scss",
213
+ ".md", ".txt", ".py", ".java", ".cs", ".cpp", ".c", ".h",
214
+ ".xml", ".yaml", ".yml", ".sh", ".bat", ".ps1",
215
+ ".cfg", ".conf", ".env", ".local", ".service", ".timer",
216
+ ".ino", ".j2", ".csv", ".tsv", ".crt", ".sql",
217
+ ".toml", ".ini", ".properties", ".tf", ".tfvars", ".proto", ".prisma",
218
+ ".dart", ".lua", ".r", ".ex", ".exs", ".pl", ".mk", ".cmake",
219
+ ".mdx", ".astro", ".graphql", ".gql"
220
+ ],
221
+ "excludeDirs": [
222
+ "node_modules", ".git", ".vscode", "dist", "build", "coverage",
223
+ "out", "__pycache__", ".next", ".nuxt",
224
+ ".idea", "target", ".gradle", "venv", ".venv",
225
+ ".pytest_cache", ".mypy_cache", ".tox", ".terraform", ".turbo",
226
+ ".angular", ".svelte-kit", ".yarn", ".pnpm-store",
227
+ ".expo", ".dart_tool", "storybook-static", "htmlcov"
228
+ ],
229
+ "excludeFiles": [
230
+ "*-lock.json", "*.lock", "*.min.js", "*.min.css", "*.map",
231
+ ".DS_Store", "Thumbs.db", "desktop.ini", "ehthumbs.db",
232
+ "*.pyc", "*.pyo", "*.class", "*.log", "*.tmp", "*.temp",
233
+ "*.swp", "*.bak", "*.orig"
234
+ ],
235
+ "settings": {
236
+ "documentTitle": "Project Code Summary",
237
+ "maxFilesBeforePrompt": 500
238
+ }
239
+ }
240
+ ```
241
+
242
+ ---
243
+
244
+ ## 📋 PDF structure
245
+
246
+ Generated PDFs are A4 with three sections:
247
+
248
+ 1. **Project overview** title, project name, generation timestamp, included file types
249
+ 2. **File structure** — complete sorted file listing
250
+ 3. **File content** — full source of every selected file, monospace font, no truncation
251
+
252
+ ---
253
+
254
+ ## 🤖 RAG JSON structure
255
+
256
+ ```json
257
+ {
258
+ "metadata": {
259
+ "projectName": "MyProject",
260
+ "generatedAt": "2025-07-31T08:00:00.000Z",
261
+ "version": "3.1.0"
262
+ },
263
+ "files": [
264
+ {
265
+ "id": "abc123def456",
266
+ "path": "src/component.js",
267
+ "language": "JavaScript",
268
+ "hash": "sha256-...",
269
+ "chunks": [
270
+ {
271
+ "id": "chunk_abc123def456_0",
272
+ "content": "function myFunction() { ... }",
273
+ "tokenEstimate": 45,
274
+ "lineStart": 1,
275
+ "lineEnd": 15,
276
+ "chunkingMethod": "semantic-function",
277
+ "context": "function_myFunction",
278
+ "imports": ["lodash", "react"],
279
+ "calls": ["useState", "useEffect"]
280
+ }
281
+ ]
282
+ }
283
+ ],
284
+ "index": {
285
+ "chunkOffsets": {
286
+ "chunk_abc123def456_0": {
287
+ "contentStart": 12123,
288
+ "contentEnd": 12356,
289
+ "filePath": "src/component.js"
290
+ }
291
+ },
292
+ "statistics": { "processingTimeMs": 245, "chunksWithValidOffsets": 387 }
293
+ }
294
+ }
295
+ ```
296
+
297
+ ### RAG integration example
298
+
299
+ ```javascript
300
+ const ragData = JSON.parse(fs.readFileSync('project_rag.json'));
301
+
302
+ // Extract all chunks for embedding
303
+ const chunks = ragData.files.flatMap(file =>
304
+ file.chunks.map(chunk => ({
305
+ id: chunk.id,
306
+ content: chunk.content,
307
+ metadata: { filePath: file.path, language: file.language }
308
+ }))
309
+ );
310
+
311
+ // Store in your vector database
312
+ for (const chunk of chunks) {
313
+ const embedding = await embed(chunk.content);
314
+ await vectorDB.upsert(chunk.id, embedding, chunk.metadata);
315
+ }
316
+ ```
317
+
318
+ ---
319
+
320
+ ## 💬 LLM Markdown structure
321
+
322
+ ```markdown
323
+ # MyProject — Code Summary
324
+
325
+ **Generated:** 2026-04-05 | **Files:** 42 | **Total size:** 1.2 MB
326
+
327
+ ---
328
+
329
+ ## File Tree
330
+
331
+ ```
332
+ src/cli.js
333
+ src/scanner.js
334
+ ...
335
+ ```
336
+
337
+ ---
338
+
339
+ ## src/cli.js
340
+
341
+ ```js
342
+ import chalk from 'chalk';
343
+ ...
344
+ ```
345
+ ```
346
+
347
+ Paste the `.md` file directly into any LLM chat interface. No further processing needed.
348
+
349
+ ---
350
+
351
+ ## 🔧 Advanced features
352
+
353
+ ### Versioned output filenames
354
+
355
+ When the target file already exists, CodeSummary creates a versioned copy instead of overwriting:
356
+
357
+ ```
358
+ MYPROJECT_llm.md ← exists
359
+ MYPROJECT_llm-v1.md ← created
360
+ MYPROJECT_llm-v1.md ← exists on next run
361
+ MYPROJECT_llm-v2.md ← created
362
+ ```
363
+
364
+ This applies to all three output formats (PDF, RAG JSON, LLM Markdown).
365
+
366
+ ### Non-interactive mode
367
+
368
+ Skip all prompts and auto-select all detected extensions:
369
+
370
+ ```bash
371
+ codesummary --format llm --no-interactive
372
+ ```
373
+
374
+ Useful for CI pipelines or scripted documentation generation.
375
+
376
+ ---
377
+
378
+ ## 🎨 Supported file types
379
+
380
+ | Extension | Type | Extension | Type |
381
+ | --------- | ---- | --------- | ---- |
382
+ | `.js` `.jsx` | JavaScript | `.ts` `.tsx` | TypeScript |
383
+ | `.py` | Python | `.java` | Java |
384
+ | `.cs` | C# | `.cpp` `.c` `.h` | C/C++ |
385
+ | `.go` | Go | `.rs` | Rust |
386
+ | `.swift` | Swift | `.kt` | Kotlin |
387
+ | `.rb` | Ruby | `.php` | PHP |
388
+ | `.dart` | Dart | `.lua` | Lua |
389
+ | `.r` | R | `.ex` `.exs` | Elixir |
390
+ | `.pl` | Perl | `.scala` | Scala |
391
+ | `.html` | HTML | `.css` `.scss` | CSS |
392
+ | `.vue` | Vue.js | `.svelte` | Svelte |
393
+ | `.astro` | Astro | `.mdx` | MDX |
394
+ | `.json` | JSON | `.yaml` `.yml` | YAML |
395
+ | `.toml` | TOML | `.xml` | XML |
396
+ | `.ini` | INI | `.properties` | Java Properties |
397
+ | `.tf` `.tfvars` | Terraform | `.proto` | Protobuf |
398
+ | `.prisma` | Prisma | `.graphql` `.gql` | GraphQL |
399
+ | `.sql` | SQL | `.md` `.txt` | Docs |
400
+ | `.sh` `.bash` | Shell | `.bat` | Batch |
401
+ | `.ps1` | PowerShell | `.mk` `.cmake` | Build |
402
+ | `.cfg` `.conf` | Config | `.env` `.local` | Environment |
403
+ | `.service` `.timer` | Systemd | `.ino` | Arduino |
404
+ | `.j2` | Jinja2 | `.csv` `.tsv` | Data |
405
+ | `.crt` | Certificate | `.dockerfile` | Docker |
406
+
407
+ ---
408
+
409
+ ## 🛠️ Project structure
410
+
411
+ ```
412
+ codesummary/
413
+ ├── bin/
414
+ │ └── codesummary.js # Entry point
415
+ ├── src/
416
+ │ ├── cli.js # Argument parsing, orchestration
417
+ │ ├── scanner.js # Recursive directory scanning
418
+ │ ├── pdfGenerator.js # PDF generation (PDFKit)
419
+ │ ├── ragGenerator.js # RAG JSON generation with semantic chunking
420
+ │ ├── llmGenerator.js # LLM Markdown generation with optimisations
421
+ │ ├── configManager.js # Global config storage and migration
422
+ │ ├── ragConfig.js # RAG-specific configuration and YAML loading
423
+ │ ├── errorHandler.js # Centralised error handling and path validation
424
+ │ └── utils.js # Shared utilities (formatFileSize, etc.)
425
+ ├── rag-schema.json
426
+ ├── raggen.config.yaml
427
+ └── package.json
428
+ ```
429
+
430
+ ---
431
+
432
+ ## 🔍 Troubleshooting
433
+
434
+ **No files found after scan**
435
+ - Check `allowedExtensions` in your config (`codesummary --show-config`)
436
+ - Verify the directory is not listed in `excludeDirs`
437
+
438
+ **Output file not generated**
439
+ - Check write permissions on the output directory
440
+ - Try `--output ./` to write to the current directory
441
+
442
+ **Non-ASCII characters in paths cause issues**
443
+ - Update to v1.2.0+ which fixes Windows path handling for accented characters
444
+
445
+ **CI pipeline hangs**
446
+ - Add `--no-interactive` to skip all prompts
447
+
448
+ ---
449
+
450
+ ## 🤝 Contributing
451
+
452
+ 1. Fork the repository
453
+ 2. Clone: `git clone https://github.com/skamoll/CodeSummary.git`
454
+ 3. Install: `npm install`
455
+ 4. Test: `node bin/codesummary.js --help`
456
+ 5. Submit a pull request
457
+
458
+ ---
459
+
460
+ ## 📄 License
461
+
462
+ GNU General Public License v3.0 — see [LICENSE](LICENSE) for details.
463
+
464
+ ---
465
+
466
+ ## 📊 Roadmap
467
+
468
+ - [ ] Syntax highlighting in PDF output
469
+ - [ ] Clickable table of contents in PDF
470
+ - [x] LLM-optimised Markdown output (`--format llm`)
471
+ - [x] Versioned output filenames (`-v1`, `-v2`)
472
+ - [x] Non-interactive mode (`--no-interactive`)
473
+ - [x] RAG JSON with semantic chunking
474
+ - [ ] `--format all` (PDF + RAG + LLM in one pass)
475
+ - [ ] Git integration (document only changed files)
476
+ - [ ] CI/CD plugin for automated documentation
477
+
478
+ ---
479
+
480
+ ## 📞 Support
481
+
482
+ - Report bugs: [GitHub Issues](https://github.com/skamoll/CodeSummary/issues)
483
+ - Questions: [GitHub Discussions](https://github.com/skamoll/CodeSummary/discussions)