codesummary 1.1.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,166 +1,235 @@
1
- # Changelog
2
-
3
- All notable changes to this project will be documented in this file.
4
-
5
- The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
- and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
-
8
- ## [1.1.0] - 2025-07-31
9
-
10
- ### 🎉 Major Features Added
11
-
12
- #### 🔧 **Complete RAG System Refactoring**
13
- - **Atomic JSON Generation**: Eliminated streaming-based approach that caused JSON corruption
14
- - **100% Thread-Safe Processing**: All files processed in memory before writing
15
- - **Robust Error Handling**: No more duplicate keys or malformed JSON output
16
- - **Performance Boost**: ~107 more chunks generated with improved stability
17
-
18
- #### 📊 **Precision Offset Index System**
19
- - **Complete fileOffsets**: Format `fileId -> [start, end]` for rapid file seeking
20
- - **Detailed chunkOffsets**: Individual chunk positions with `jsonStart`, `jsonEnd`, `contentStart`, `contentEnd`
21
- - **99.8% Precision**: 509/510 chunks with valid byte-accurate offsets
22
- - **RAG-Optimized**: Enables high-performance vector database operations
23
-
24
- #### 🧠 **Enhanced Token Estimation Engine**
25
- - **Multi-Heuristic Algorithm**: Replaces simple `ceil(length/4)` with sophisticated analysis
26
- - **Language-Aware Processing**: Specialized calculations for JavaScript, Python, Java, C++, etc.
27
- - **Syntax Analysis**: Accounts for brackets, operators, and language-specific tokens
28
- - **20% More Accurate**: Example: 100 chars JavaScript goes from 25 → 30 tokens
29
-
30
- #### 📈 **Complete Processing Statistics**
31
- - **Real-Time Metrics**: Processing time, throughput, bytes written
32
- - **Quality Assurance**: Empty files count, chunks with valid offsets
33
- - **Performance Tracking**: `bytesPerSecond`, `avgFileSize`, `avgChunksPerFile`
34
- - **Error Collection**: Detailed error tracking and reporting
35
-
36
- #### 🔄 **Future-Proof Schema System**
37
- - **Schema Versioning**: `schemaVersion: "1.0"` for migration management
38
- - **Method Tracking**: `tokenEstimationMethod: "enhanced_heuristic_v1.0"`
39
- - **Schema URL**: Links to official schema definition for validation
40
- - **Backward Compatibility**: Maintains compatibility with existing consumers
41
-
42
- ### 🛠️ **Technical Improvements**
43
-
44
- #### **Code Quality & Architecture**
45
- - Eliminated 5+ problematic streaming methods (`streamingGeneration`, `writeMainBody`, etc.)
46
- - Consolidated to single `generate()` method for clarity
47
- - Removed global state variables that caused race conditions
48
- - Enhanced function detection regex for better semantic chunking
49
-
50
- #### **Performance Optimizations**
51
- - **Processing Speed**: 510 chunks generated in 56ms (vs previous inconsistent timing)
52
- - **Memory Efficiency**: 18.4 MB/s throughput with atomic processing
53
- - **Output Size**: Optimized JSON structure - 1.03 MB for comprehensive indexing
54
- - **Validation**: Built-in JSON structure validation with detailed reporting
55
-
56
- #### **Enhanced ScriptHandler**
57
- - Improved regex patterns for TypeScript interfaces, enums, class methods
58
- - Better support for `const enum`, `implements`, access modifiers
59
- - Enhanced arrow function detection with `let`, `var` support
60
- - More precise function boundary detection with brace matching
61
-
62
- ### 🐛 **Bugs Fixed**
63
-
64
- #### **Critical JSON Corruption Issues**
65
- - ❌ **Fixed**: Duplicate `index` sections in output JSON
66
- - **Fixed**: Negative `processingTimeMs` values
67
- - **Fixed**: Inconsistent chunk counts between sections
68
- - **Fixed**: Missing or incorrect byte offsets
69
- - ❌ **Fixed**: Malformed JSON due to concurrent writes
70
- - **Fixed**: Stream truncation issues with large files
71
-
72
- #### **Data Integrity Issues**
73
- - **Fixed**: Inconsistent statistics across different JSON sections
74
- - ❌ **Fixed**: Incorrect `totalBytes` calculations
75
- - ❌ **Fixed**: Missing `chunkOffsets` for seek operations
76
- - ❌ **Fixed**: Race conditions in multi-file processing
77
-
78
- ### 📊 **Performance Metrics (Before vs After)**
79
-
80
- | Metric | v1.0.2 | v1.1.0 | Improvement |
81
- |--------|--------|--------|-------------|
82
- | JSON Validity | Corrupted | 100% Valid | +100% |
83
- | Chunk Generation | ~400 chunks | 510 chunks | +27% |
84
- | Processing Time | Inconsistent | 56ms stable | Consistent |
85
- | Offset Precision | ~60% valid | 99.8% valid | +66% |
86
- | Memory Safety | Race conditions | Thread-safe | Stable |
87
- | Output Size | Bloated/corrupt | 1.03 MB optimized | Efficient |
88
-
89
- ### 🔍 **API Changes**
90
-
91
- #### **New JSON Structure Fields**
92
- ```json
93
- {
94
- "metadata": {
95
- "schemaVersion": "1.0",
96
- "schemaUrl": "https://github.com/skamoll/CodeSummary/schemas/rag-output.json",
97
- "config": {
98
- "tokenEstimationMethod": "enhanced_heuristic_v1.0"
99
- }
100
- },
101
- "index": {
102
- "chunkOffsets": {
103
- "chunk_id": {
104
- "jsonStart": 1234,
105
- "jsonEnd": 5678,
106
- "contentStart": 2000,
107
- "contentEnd": 4000,
108
- "filePath": "src/file.js"
109
- }
110
- },
111
- "fileOffsets": {
112
- "file_id": [startByte, endByte]
113
- },
114
- "statistics": {
115
- "processingTimeMs": 56,
116
- "bytesPerSecond": 18404786,
117
- "chunksWithValidOffsets": 509,
118
- "emptyFiles": 0
119
- }
120
- }
121
- }
122
- ```
123
-
124
- ### 🎯 **Use Cases Enabled**
125
-
126
- #### **RAG/Vector Database Applications**
127
- - **Rapid Content Retrieval**: Use `chunkOffsets` for instant chunk access
128
- - **Efficient File Processing**: `fileOffsets` enable selective file loading
129
- - **Quality Metrics**: Statistics help optimize chunk size and processing
130
-
131
- #### **Code Analysis Tools**
132
- - **Semantic Navigation**: Enhanced function detection for better code understanding
133
- - **Token Budget Planning**: Accurate token estimation for LLM interactions
134
- - **Processing Monitoring**: Detailed metrics for pipeline optimization
135
-
136
- ### 🔗 **Migration Guide**
137
-
138
- #### **From v1.0.x to v1.1.0**
139
- 1. **JSON Structure**: New `index` section with detailed offsets - update parsers
140
- 2. **Token Estimates**: Values may be ~20% higher due to improved accuracy
141
- 3. **Statistics**: New fields available in `index.statistics`
142
- 4. **Schema**: Check `metadata.schemaVersion` for compatibility
143
-
144
- #### **Backward Compatibility**
145
- - All existing `metadata` and `files` sections unchanged
146
- - ✅ Chunk structure remains the same
147
- - CLI interface identical
148
- - ⚠️ New `index` section - consumers should handle gracefully
149
-
150
- ---
151
-
152
- ## [1.0.2] - 2025-07-29
153
- ### Fixed
154
- - Bug fixes and stability improvements
155
- - Enhanced cross-platform compatibility
156
-
157
- ## [1.0.1] - 2025-07-28
158
- ### Added
159
- - Initial RAG functionality
160
- - Basic PDF generation
161
-
162
- ## [1.0.0] - 2025-07-27
163
- ### Added
164
- - Initial release
165
- - Core PDF generation functionality
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [1.2.0] - 2026-04-05
9
+
10
+ ### New Features
11
+
12
+ #### LLM Markdown output (`--format llm`)
13
+ - New output format that generates a single Markdown file optimised for direct use with any chat-based LLM
14
+ - Includes project header, file tree, and full file contents in fenced code blocks with language hints
15
+ - Lossless optimisations applied automatically: line ending normalisation, trailing whitespace removal, blank line collapsing, JSON compaction
16
+ - File naming follows the same versioning scheme as other formats
17
+
18
+ #### Versioned output filenames
19
+ - All output formats (PDF, RAG JSON, LLM Markdown) now use `-v1`, `-v2` suffixes when the target file already exists
20
+ - Replaces the previous timestamp-based fallback for PDF; now consistent across all formats
21
+
22
+ #### Non-interactive mode fixed
23
+ - `--no-interactive` flag now correctly skips the extension selection prompt and the large-project confirmation
24
+ - Also activates automatically when stdin is not a TTY (CI/CD environments)
25
+
26
+ ### 🔧 Improvements
27
+
28
+ #### Architecture
29
+ - Extracted shared utilities (`formatFileSize`, `getExtensionDescription`, `matchesGlobPattern`, `resolveVersionedPath`) into `src/utils.js` — no more duplicated code across modules
30
+ - `ragConfig.js` now exports the class instead of a singleton instance, eliminating shared state between runs
31
+ - `ragGenerator.js` imports switched to static; `RagConfigManager` instantiated locally
32
+ - `RagGenerator` constructor no longer accepts an unused `config` parameter
33
+ - `cli.js` uses `createRequire` for version reading — fixes fragile Windows path hack
34
+
35
+ #### Bug fixes
36
+ - `validatePath`: removed false positive that blocked valid absolute Windows paths (e.g. `C:\Users\Name\...`) when passed as `--output`
37
+ - `sanitizeInput` with `allowPath: true`: now preserves non-ASCII characters in paths (e.g. accented letters in user profile directories on Windows)
38
+ - Two additional `sanitizeInput` call sites in `cli.js` updated to pass `allowPath: true`
39
+ - `migrateConfig` side-channel (`_pendingNotification`) replaced with explicit return value
40
+
41
+ #### Extended defaults
42
+ - 19 new `allowedExtensions`: `.toml`, `.ini`, `.properties`, `.tf`, `.tfvars`, `.proto`, `.prisma`, `.dart`, `.lua`, `.r`, `.ex`, `.exs`, `.pl`, `.mk`, `.cmake`, `.mdx`, `.astro`, `.graphql`, `.gql`
43
+ - 13 previously user-only extensions promoted to defaults: `.ps1`, `.cfg`, `.conf`, `.env`, `.local`, `.service`, `.timer`, `.ino`, `.j2`, `.csv`, `.tsv`, `.crt`, `.sql`
44
+ - 18 new `excludeDirs`: `.idea`, `target`, `.gradle`, `venv`, `.venv`, `.pytest_cache`, `.mypy_cache`, `.tox`, `.terraform`, `.turbo`, `.angular`, `.svelte-kit`, `.yarn`, `.pnpm-store`, `.expo`, `.dart_tool`, `storybook-static`, `htmlcov`
45
+ - 11 new `excludeFiles`: `*.pyc`, `*.pyo`, `*.class`, `*.log`, `*.tmp`, `*.temp`, `*.swp`, `*.bak`, `*.orig`, `desktop.ini`, `ehthumbs.db`
46
+
47
+ ### 📋 Migration Notes
48
+ - No breaking changes to existing configuration or CLI flags
49
+ - Existing config files are migrated automatically on first run — new extensions, dirs, and file patterns are appended; customisations are preserved
50
+ - `--format pdf` (explicit) and bare `codesummary` behaviour unchanged
51
+
52
+ ## [1.1.1] - 2025-07-31
53
+
54
+ ### 🔧 **Fixes & Improvements**
55
+
56
+ #### **CLI Enhancements**
57
+ - **Added Version Flag**: New `--version` and `-v` flags to display current version
58
+ - **Cross-Platform Compatibility**: Fixed Windows path resolution for version detection
59
+ - **Help Documentation**: Updated help text to include version option
60
+
61
+ #### **Dependency Cleanup**
62
+ - **Removed Deprecated Crypto**: Eliminated `crypto@1.0.1` dependency (now uses built-in Node.js crypto)
63
+ - **Security Improvement**: No more npm warnings about deprecated packages
64
+ - **Cleaner Dependencies**: Reduced package footprint
65
+
66
+ #### **Bug Fixes**
67
+ - **Merge Conflicts**: Resolved conflicts between main and develop branches
68
+ - **CLI Argument Parsing**: Fixed unknown option error for `--version` flag
69
+
70
+ ### 📋 **Migration Notes**
71
+ - No breaking changes
72
+ - Existing installations will benefit from cleaner dependencies
73
+ - New `--version` flag available immediately after update
74
+
75
+ ---
76
+
77
+ ## [1.1.0] - 2025-07-31
78
+
79
+ ### 🎉 Major Features Added
80
+
81
+ #### 🔧 **Complete RAG System Refactoring**
82
+ - **Atomic JSON Generation**: Eliminated streaming-based approach that caused JSON corruption
83
+ - **100% Thread-Safe Processing**: All files processed in memory before writing
84
+ - **Robust Error Handling**: No more duplicate keys or malformed JSON output
85
+ - **Performance Boost**: ~107 more chunks generated with improved stability
86
+
87
+ #### 📊 **Precision Offset Index System**
88
+ - **Complete fileOffsets**: Format `fileId -> [start, end]` for rapid file seeking
89
+ - **Detailed chunkOffsets**: Individual chunk positions with `jsonStart`, `jsonEnd`, `contentStart`, `contentEnd`
90
+ - **99.8% Precision**: 509/510 chunks with valid byte-accurate offsets
91
+ - **RAG-Optimized**: Enables high-performance vector database operations
92
+
93
+ #### 🧠 **Enhanced Token Estimation Engine**
94
+ - **Multi-Heuristic Algorithm**: Replaces simple `ceil(length/4)` with sophisticated analysis
95
+ - **Language-Aware Processing**: Specialized calculations for JavaScript, Python, Java, C++, etc.
96
+ - **Syntax Analysis**: Accounts for brackets, operators, and language-specific tokens
97
+ - **20% More Accurate**: Example: 100 chars JavaScript goes from 25 → 30 tokens
98
+
99
+ #### 📈 **Complete Processing Statistics**
100
+ - **Real-Time Metrics**: Processing time, throughput, bytes written
101
+ - **Quality Assurance**: Empty files count, chunks with valid offsets
102
+ - **Performance Tracking**: `bytesPerSecond`, `avgFileSize`, `avgChunksPerFile`
103
+ - **Error Collection**: Detailed error tracking and reporting
104
+
105
+ #### 🔄 **Future-Proof Schema System**
106
+ - **Schema Versioning**: `schemaVersion: "1.0"` for migration management
107
+ - **Method Tracking**: `tokenEstimationMethod: "enhanced_heuristic_v1.0"`
108
+ - **Schema URL**: Links to official schema definition for validation
109
+ - **Backward Compatibility**: Maintains compatibility with existing consumers
110
+
111
+ ### 🛠️ **Technical Improvements**
112
+
113
+ #### **Code Quality & Architecture**
114
+ - Eliminated 5+ problematic streaming methods (`streamingGeneration`, `writeMainBody`, etc.)
115
+ - Consolidated to single `generate()` method for clarity
116
+ - Removed global state variables that caused race conditions
117
+ - Enhanced function detection regex for better semantic chunking
118
+
119
+ #### **Performance Optimizations**
120
+ - **Processing Speed**: 510 chunks generated in 56ms (vs previous inconsistent timing)
121
+ - **Memory Efficiency**: 18.4 MB/s throughput with atomic processing
122
+ - **Output Size**: Optimized JSON structure - 1.03 MB for comprehensive indexing
123
+ - **Validation**: Built-in JSON structure validation with detailed reporting
124
+
125
+ #### **Enhanced ScriptHandler**
126
+ - Improved regex patterns for TypeScript interfaces, enums, class methods
127
+ - Better support for `const enum`, `implements`, access modifiers
128
+ - Enhanced arrow function detection with `let`, `var` support
129
+ - More precise function boundary detection with brace matching
130
+
131
+ ### 🐛 **Bugs Fixed**
132
+
133
+ #### **Critical JSON Corruption Issues**
134
+ - **Fixed**: Duplicate `index` sections in output JSON
135
+ - ❌ **Fixed**: Negative `processingTimeMs` values
136
+ - **Fixed**: Inconsistent chunk counts between sections
137
+ - ❌ **Fixed**: Missing or incorrect byte offsets
138
+ - **Fixed**: Malformed JSON due to concurrent writes
139
+ - **Fixed**: Stream truncation issues with large files
140
+
141
+ #### **Data Integrity Issues**
142
+ - **Fixed**: Inconsistent statistics across different JSON sections
143
+ - ❌ **Fixed**: Incorrect `totalBytes` calculations
144
+ - **Fixed**: Missing `chunkOffsets` for seek operations
145
+ - **Fixed**: Race conditions in multi-file processing
146
+
147
+ ### 📊 **Performance Metrics (Before vs After)**
148
+
149
+ | Metric | v1.0.2 | v1.1.0 | Improvement |
150
+ |--------|--------|--------|-------------|
151
+ | JSON Validity | ❌ Corrupted | ✅ 100% Valid | +100% |
152
+ | Chunk Generation | ~400 chunks | 510 chunks | +27% |
153
+ | Processing Time | Inconsistent | 56ms stable | Consistent |
154
+ | Offset Precision | ~60% valid | 99.8% valid | +66% |
155
+ | Memory Safety | Race conditions | Thread-safe | Stable |
156
+ | Output Size | Bloated/corrupt | 1.03 MB optimized | Efficient |
157
+
158
+ ### 🔍 **API Changes**
159
+
160
+ #### **New JSON Structure Fields**
161
+ ```json
162
+ {
163
+ "metadata": {
164
+ "schemaVersion": "1.0",
165
+ "schemaUrl": "https://github.com/skamoll/CodeSummary/schemas/rag-output.json",
166
+ "config": {
167
+ "tokenEstimationMethod": "enhanced_heuristic_v1.0"
168
+ }
169
+ },
170
+ "index": {
171
+ "chunkOffsets": {
172
+ "chunk_id": {
173
+ "jsonStart": 1234,
174
+ "jsonEnd": 5678,
175
+ "contentStart": 2000,
176
+ "contentEnd": 4000,
177
+ "filePath": "src/file.js"
178
+ }
179
+ },
180
+ "fileOffsets": {
181
+ "file_id": [startByte, endByte]
182
+ },
183
+ "statistics": {
184
+ "processingTimeMs": 56,
185
+ "bytesPerSecond": 18404786,
186
+ "chunksWithValidOffsets": 509,
187
+ "emptyFiles": 0
188
+ }
189
+ }
190
+ }
191
+ ```
192
+
193
+ ### 🎯 **Use Cases Enabled**
194
+
195
+ #### **RAG/Vector Database Applications**
196
+ - **Rapid Content Retrieval**: Use `chunkOffsets` for instant chunk access
197
+ - **Efficient File Processing**: `fileOffsets` enable selective file loading
198
+ - **Quality Metrics**: Statistics help optimize chunk size and processing
199
+
200
+ #### **Code Analysis Tools**
201
+ - **Semantic Navigation**: Enhanced function detection for better code understanding
202
+ - **Token Budget Planning**: Accurate token estimation for LLM interactions
203
+ - **Processing Monitoring**: Detailed metrics for pipeline optimization
204
+
205
+ ### 🔗 **Migration Guide**
206
+
207
+ #### **From v1.0.x to v1.1.0**
208
+ 1. **JSON Structure**: New `index` section with detailed offsets - update parsers
209
+ 2. **Token Estimates**: Values may be ~20% higher due to improved accuracy
210
+ 3. **Statistics**: New fields available in `index.statistics`
211
+ 4. **Schema**: Check `metadata.schemaVersion` for compatibility
212
+
213
+ #### **Backward Compatibility**
214
+ - ✅ All existing `metadata` and `files` sections unchanged
215
+ - ✅ Chunk structure remains the same
216
+ - ✅ CLI interface identical
217
+ - ⚠️ New `index` section - consumers should handle gracefully
218
+
219
+ ---
220
+
221
+ ## [1.0.2] - 2025-07-29
222
+ ### Fixed
223
+ - Bug fixes and stability improvements
224
+ - Enhanced cross-platform compatibility
225
+
226
+ ## [1.0.1] - 2025-07-28
227
+ ### Added
228
+ - Initial RAG functionality
229
+ - Basic PDF generation
230
+
231
+ ## [1.0.0] - 2025-07-27
232
+ ### Added
233
+ - Initial release
234
+ - Core PDF generation functionality
166
235
  - Multi-language support