claude-self-reflect 2.7.4 → 2.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -4,7 +4,7 @@ description: Docker Compose orchestration expert for container management, servi
4
4
  tools: Read, Edit, Bash, Grep, LS
5
5
  ---
6
6
 
7
- You are a Docker orchestration specialist for the memento-stack project. You manage multi-container deployments, monitor service health, and troubleshoot container issues.
7
+ You are a Docker orchestration specialist for the claude-self-reflect project. You manage multi-container deployments, monitor service health, and troubleshoot container issues.
8
8
 
9
9
  ## Project Context
10
10
  - Main stack: Qdrant vector database + MCP server + Python importer
@@ -4,7 +4,7 @@ description: Import pipeline debugging specialist for JSONL processing, Python s
4
4
  tools: Read, Edit, Bash, Grep, Glob, LS
5
5
  ---
6
6
 
7
- You are an import pipeline debugging expert for the memento-stack project. You specialize in troubleshooting JSONL file processing, Python import scripts, and conversation chunking strategies.
7
+ You are an import pipeline debugging expert for the claude-self-reflect project. You specialize in troubleshooting JSONL file processing, Python import scripts, and conversation chunking strategies.
8
8
 
9
9
  ## Project Context
10
10
  - Processes Claude Desktop logs from ~/.claude/projects/
@@ -4,7 +4,7 @@ description: MCP (Model Context Protocol) server development expert for Claude D
4
4
  tools: Read, Edit, Bash, Grep, Glob, WebFetch
5
5
  ---
6
6
 
7
- You are an MCP server development specialist for the memento-stack project. You handle Claude Desktop integration, implement MCP tools, and ensure seamless communication between Claude and the vector database.
7
+ You are an MCP server development specialist for the claude-self-reflect project. You handle Claude Desktop integration, implement MCP tools, and ensure seamless communication between Claude and the vector database.
8
8
 
9
9
  ## Project Context
10
10
  - MCP server: claude-self-reflection
@@ -4,7 +4,7 @@ description: Qdrant vector database expert for collection management, troublesho
4
4
  tools: Read, Bash, Grep, Glob, LS, WebFetch
5
5
  ---
6
6
 
7
- You are a Qdrant vector database specialist for the memento-stack project. Your expertise covers collection management, vector search optimization, and embedding strategies.
7
+ You are a Qdrant vector database specialist for the claude-self-reflect project. Your expertise covers collection management, vector search optimization, and embedding strategies.
8
8
 
9
9
  ## Project Context
10
10
  - The system uses Qdrant for storing conversation embeddings from Claude Desktop logs
@@ -4,7 +4,7 @@ description: Search quality optimization expert for improving semantic search ac
4
4
  tools: Read, Edit, Bash, Grep, Glob, WebFetch
5
5
  ---
6
6
 
7
- You are a search optimization specialist for the memento-stack project. You improve semantic search quality, tune parameters, and analyze embedding model performance.
7
+ You are a search optimization specialist for the claude-self-reflect project. You improve semantic search quality, tune parameters, and analyze embedding model performance.
8
8
 
9
9
  ## Project Context
10
10
  - Current baseline: 66.1% search accuracy with Voyage AI
@@ -30,8 +30,11 @@ RUN mkdir -p /root/.cache/fastembed && \
30
30
  # Set working directory
31
31
  WORKDIR /app
32
32
 
33
- # Copy scripts
34
- COPY scripts/ /scripts/
33
+ # Copy application scripts
34
+ COPY scripts/ /app/scripts/
35
+
36
+ # Make watcher-loop.sh executable
37
+ RUN chmod +x /app/scripts/watcher-loop.sh
35
38
 
36
39
  # Create config directory
37
40
  RUN mkdir -p /config
@@ -41,4 +44,4 @@ ENV PYTHONUNBUFFERED=1
41
44
  ENV MALLOC_ARENA_MAX=2
42
45
 
43
46
  # Run the watcher loop
44
- CMD ["/scripts/watcher-loop.sh"]
47
+ CMD ["/app/scripts/watcher-loop.sh"]
package/README.md CHANGED
@@ -24,67 +24,12 @@
24
24
 
25
25
  Give Claude perfect memory of all your conversations. Search past discussions instantly. Never lose context again.
26
26
 
27
- **100% Local by Default** - Your conversations never leave your machine. No cloud services required, no API keys needed, complete privacy out of the box.
27
+ **🔒 100% Local by Default** **⚡ Blazing Fast Search** **🚀 Zero Configuration** **🏭 Production Ready**
28
28
 
29
- **Blazing Fast Search** - Semantic search across thousands of conversations in milliseconds. Find that discussion about database schemas from three weeks ago in seconds.
29
+ ## 🚀 Quick Install
30
30
 
31
- **Zero Configuration** - Works immediately after installation. Smart auto-detection handles everything. No manual setup, no environment variables, just install and use.
32
-
33
- **Production Ready** - Battle-tested with 600+ conversations across 24 projects. Handles mixed embedding types automatically. Scales from personal use to team deployments.
34
-
35
- ## Table of Contents
36
-
37
- - [What You Get](#what-you-get)
38
- - [Requirements](#requirements)
39
- - [Quick Install/Uninstall](#quick-installuninstall)
40
- - [The Magic](#the-magic)
41
- - [Before & After](#before--after)
42
- - [Real Examples](#real-examples-that-made-us-build-this)
43
- - [How It Works](#how-it-works)
44
- - [Import Architecture](#import-architecture)
45
- - [Using It](#using-it)
46
- - [Key Features](#key-features)
47
- - [Performance](#performance)
48
- - [Configuration](#configuration)
49
- - [Technical Stack](#the-technical-stack)
50
- - [Problems](#problems)
51
- - [What's New](#whats-new)
52
- - [Advanced Topics](#advanced-topics)
53
- - [Contributors](#contributors)
54
-
55
- ## What You Get
56
-
57
- Ask Claude about past conversations. Get actual answers. **100% local by default** - your conversations never leave your machine. Cloud-enhanced search available when you need it.
58
-
59
- **Proven at Scale**: Successfully indexed 682 conversation files with 100% reliability. No data loss, no corruption, just seamless conversation memory that works.
60
-
61
- **Before**: "I don't have access to previous conversations"
62
- **After**:
63
- ```
64
- reflection-specialist(Search FastEmbed vs cloud embedding decision)
65
- ⎿ Done (3 tool uses · 8.2k tokens · 12.4s)
66
-
67
- "Found it! Yesterday we decided on FastEmbed for local mode - better privacy,
68
- no API calls, 384-dimensional embeddings. Works offline too."
69
- ```
70
-
71
- The reflection specialist is a specialized sub-agent that Claude automatically spawns when you ask about past conversations. It searches your conversation history in its own isolated context, keeping your main chat clean and focused.
72
-
73
- Your conversations become searchable. Your decisions stay remembered. Your context persists.
74
-
75
- ## Requirements
76
-
77
- - **Docker Desktop** (macOS/Windows) or **Docker Engine** (Linux)
78
- - **Node.js** 16+ (for the setup wizard)
79
- - **Claude Desktop** app
80
-
81
- ## Quick Install/Uninstall
82
-
83
- ### Install
84
-
85
- #### Local Mode (Default - Your Data Stays Private)
86
31
  ```bash
87
- # Install and run automatic setup
32
+ # Install and run automatic setup (5 minutes, everything automatic)
88
33
  npm install -g claude-self-reflect
89
34
  claude-self-reflect setup
90
35
 
@@ -93,11 +38,12 @@ claude-self-reflect setup
93
38
  # ✅ Configure everything automatically
94
39
  # ✅ Install the MCP in Claude Code
95
40
  # ✅ Start monitoring for new conversations
96
- # ✅ Verify the reflection tools work
97
41
  # 🔒 Keep all data local - no API keys needed
98
42
  ```
99
43
 
100
- #### Cloud Mode (Better Search Accuracy)
44
+ <details open>
45
+ <summary>📡 Cloud Mode (Better Search Accuracy)</summary>
46
+
101
47
  ```bash
102
48
  # Step 1: Get your free Voyage AI key
103
49
  # Sign up at https://www.voyageai.com/ - it takes 30 seconds
@@ -108,17 +54,17 @@ claude-self-reflect setup --voyage-key=YOUR_ACTUAL_KEY_HERE
108
54
  ```
109
55
  *Note: Cloud mode provides more accurate semantic search but sends conversation data to Voyage AI for processing.*
110
56
 
111
- 5 minutes. Everything automatic. Just works.
57
+ </details>
112
58
 
113
- ## The Magic
59
+ ## The Magic
114
60
 
115
61
  ![Self Reflection vs The Grind](docs/images/red-reflection.webp)
116
62
 
117
- ## Before & After
63
+ ## 📊 Before & After
118
64
 
119
65
  ![Before and After Claude Self-Reflect](docs/diagrams/before-after-combined.webp)
120
66
 
121
- ## Real Examples That Made Us Build This
67
+ ## 💬 Real Examples
122
68
 
123
69
  ```
124
70
  You: "What was that PostgreSQL optimization we figured out?"
@@ -137,34 +83,7 @@ Claude: "3 conversations found:
137
83
  - Nov 20: Added rate limiting per authenticated connection"
138
84
  ```
139
85
 
140
- ## How It Works
141
-
142
- Your conversations → Vector embeddings → Semantic search → Claude remembers
143
-
144
- Technical details exist. You don't need them to start.
145
-
146
- ## Import Architecture
147
-
148
- Here's how your conversations get imported and prioritized:
149
-
150
- ![Import Architecture](docs/diagrams/import-architecture.png)
151
-
152
- **The system intelligently prioritizes your conversations:**
153
- - **HOT** (< 5 minutes): Switches to 2-second intervals for near real-time import
154
- - **🌡️ WARM** (< 24 hours): Normal priority, processed every 60 seconds
155
- - **❄️ COLD** (> 24 hours): Batch processed, max 5 per cycle to prevent blocking
156
-
157
- ## Using It
158
-
159
- Once installed, just talk naturally:
160
-
161
- - "What did we discuss about database optimization?"
162
- - "Find our debugging session from last week"
163
- - "Remember this solution for next time"
164
-
165
- The reflection specialist automatically activates. No special commands needed.
166
-
167
- ## Key Features
86
+ ## 🎯 Key Features
168
87
 
169
88
  ### Project-Scoped Search
170
89
  Searches are **project-aware by default**. Claude automatically searches within your current project:
@@ -182,16 +101,37 @@ Claude: [Searches across ALL your projects]
182
101
  ### ⏱️ Memory Decay
183
102
  Recent conversations matter more. Old ones fade. Like your brain, but reliable.
184
103
 
185
- ### 🚀 Performance
186
- - **Search**: <3ms average response time across 121+ collections (7.55ms max)
187
- - **Import**: Production streaming importer with 100% reliability
188
- - **Memory**: 302MB operational (60% of 500MB limit) - 96% reduction from v2.5.15
189
- - **CPU**: <1% sustained usage (99.93% reduction from 1437% peak)
190
- - **Scale**: 100% indexing success rate across all conversation types
191
- - **V2 Migration**: 100% complete - all conversations use token-aware chunking
104
+ ### Performance at Scale
105
+ - **Search**: <3ms average response time
106
+ - **Scale**: 600+ conversations across 24 projects
107
+ - **Reliability**: 100% indexing success rate
108
+ - **Memory**: 96% reduction from v2.5.15
109
+
110
+ ## 🏗️ Architecture
111
+
112
+ ![Import Architecture](docs/diagrams/import-architecture.png)
113
+
114
+ <details>
115
+ <summary>🔥 HOT/WARM/COLD Intelligent Prioritization</summary>
116
+
117
+ - **🔥 HOT** (< 5 minutes): 2-second intervals for near real-time import
118
+ - **🌡️ WARM** (< 24 hours): Normal priority with starvation prevention
119
+ - **❄️ COLD** (> 24 hours): Batch processed to prevent blocking
120
+
121
+ Files are categorized by age and processed with priority queuing to ensure newest content gets imported quickly while preventing older files from being starved.
122
+
123
+ </details>
124
+
125
+ ## 🛠️ Requirements
126
+
127
+ - **Docker Desktop** (macOS/Windows) or **Docker Engine** (Linux)
128
+ - **Node.js** 16+ (for the setup wizard)
129
+ - **Claude Desktop** app
192
130
 
131
+ ## 📖 Documentation
193
132
 
194
- ## The Technical Stack
133
+ <details>
134
+ <summary>🔧 Technical Stack</summary>
195
135
 
196
136
  - **Vector DB**: Qdrant (local, your data stays yours)
197
137
  - **Embeddings**:
@@ -200,18 +140,62 @@ Recent conversations matter more. Old ones fade. Like your brain, but reliable.
200
140
  - **MCP Server**: Python + FastMCP
201
141
  - **Search**: Semantic similarity with time decay
202
142
 
203
- ## Problems
143
+ </details>
144
+
145
+ <details>
146
+ <summary>📚 Advanced Topics</summary>
147
+
148
+ - [Performance tuning](docs/performance-guide.md)
149
+ - [Security & privacy](docs/security.md)
150
+ - [Windows setup](docs/windows-setup.md)
151
+ - [Architecture details](docs/architecture-details.md)
152
+ - [Contributing](CONTRIBUTING.md)
153
+
154
+ </details>
155
+
156
+ <details>
157
+ <summary>🐛 Troubleshooting</summary>
204
158
 
205
159
  - [Troubleshooting Guide](docs/troubleshooting.md)
206
160
  - [GitHub Issues](https://github.com/ramakay/claude-self-reflect/issues)
207
161
  - [Discussions](https://github.com/ramakay/claude-self-reflect/discussions)
208
162
 
209
- ## Upgrading to v2.5.19
163
+ </details>
164
+
165
+ <details>
166
+ <summary>🗑️ Uninstall</summary>
167
+
168
+ For complete uninstall instructions, see [docs/UNINSTALL.md](docs/UNINSTALL.md).
169
+
170
+ Quick uninstall:
171
+ ```bash
172
+ # Remove MCP server
173
+ claude mcp remove claude-self-reflect
174
+
175
+ # Stop Docker containers
176
+ docker-compose down
177
+
178
+ # Uninstall npm package
179
+ npm uninstall -g claude-self-reflect
180
+ ```
181
+
182
+ </details>
183
+
184
+ ## 📦 What's New
185
+
186
+ <details>
187
+ <summary>🎉 v2.8.0 - Latest Release</summary>
210
188
 
211
- ### 🆕 New Feature: Metadata Enrichment
212
- v2.5.19 adds searchable metadata to your conversations - concepts, files, and tools!
189
+ - **🔧 Fixed MCP Indexing**: Now correctly shows 97.1% progress (was showing 0%)
190
+ - **🔥 HOT/WARM/COLD**: Intelligent file prioritization for near real-time imports
191
+ - **📊 Enhanced Monitoring**: Real-time status with visual indicators
213
192
 
214
- #### For Existing Users
193
+ </details>
194
+
195
+ <details>
196
+ <summary>✨ v2.5.19 - Metadata Enrichment</summary>
197
+
198
+ ### For Existing Users
215
199
  ```bash
216
200
  # Update to latest version
217
201
  npm update -g claude-self-reflect
@@ -224,50 +208,30 @@ claude-self-reflect setup
224
208
  docker compose run --rm importer python /app/scripts/delta-metadata-update-safe.py
225
209
  ```
226
210
 
227
- #### What You Get
211
+ ### What You Get
228
212
  - `search_by_concept("docker")` - Find conversations by topic
229
213
  - `search_by_file("server.py")` - Find conversations that touched specific files
230
214
  - Better search accuracy with metadata-based filtering
231
215
 
232
- ## What's New
216
+ </details>
217
+
218
+ <details>
219
+ <summary>📜 Release History</summary>
233
220
 
234
- - **v2.5.19** - Metadata Enrichment! Search by concepts, files, and tools. [Full release notes](docs/releases/v2.5.19-RELEASE-NOTES.md)
235
221
  - **v2.5.18** - Security dependency updates
236
- - **v2.5.17** - Critical CPU fix and memory limit adjustment. [Full release notes](docs/releases/v2.5.17-release-notes.md)
237
- - **v2.5.16** - (Pre-release only) Initial streaming importer with CPU throttling
222
+ - **v2.5.17** - Critical CPU fix and memory limit adjustment
223
+ - **v2.5.16** - Initial streaming importer with CPU throttling
238
224
  - **v2.5.15** - Critical bug fixes and collection creation improvements
239
- - **v2.5.14** - Async importer collection fix - All conversations now searchable
240
- - **v2.5.11** - Critical cloud mode fix - Environment variables now properly passed to MCP server
241
- - **v2.5.10** - Emergency hotfix for MCP server startup failure (dead code removal)
242
- - **v2.5.6** - Tool Output Extraction - Captures git changes & tool outputs for cross-agent discovery
225
+ - **v2.5.14** - Async importer collection fix
226
+ - **v2.5.11** - Critical cloud mode fix
227
+ - **v2.5.10** - Emergency hotfix for MCP server startup
228
+ - **v2.5.6** - Tool Output Extraction
243
229
 
244
230
  [Full changelog](docs/release-history.md)
245
231
 
246
- ## Advanced Topics
247
-
248
- - [Performance tuning](docs/performance-guide.md)
249
- - [Security & privacy](docs/security.md)
250
- - [Windows setup](docs/windows-setup.md)
251
- - [Architecture details](docs/architecture-details.md)
252
- - [Contributing](CONTRIBUTING.md)
253
-
254
- ### Uninstall
255
-
256
- For complete uninstall instructions, see [docs/UNINSTALL.md](docs/UNINSTALL.md).
257
-
258
- Quick uninstall:
259
- ```bash
260
- # Remove MCP server
261
- claude mcp remove claude-self-reflect
262
-
263
- # Stop Docker containers
264
- docker-compose down
265
-
266
- # Uninstall npm package
267
- npm uninstall -g claude-self-reflect
268
- ```
232
+ </details>
269
233
 
270
- ## Contributors
234
+ ## 👥 Contributors
271
235
 
272
236
  Special thanks to our contributors:
273
237
  - **[@TheGordon](https://github.com/TheGordon)** - Fixed timestamp parsing (#10)
@@ -276,4 +240,4 @@ Special thanks to our contributors:
276
240
 
277
241
  ---
278
242
 
279
- Built with ❤️ by [ramakay](https://github.com/ramakay) for the Claude community.
243
+ Built with ❤️ by [ramakay](https://github.com/ramakay) for the Claude community.
@@ -177,21 +177,29 @@ services:
177
177
  - ./scripts:/scripts:ro
178
178
  environment:
179
179
  - QDRANT_URL=http://qdrant:6333
180
- - STATE_FILE=/config/watcher-state.json
180
+ - STATE_FILE=/config/csr-watcher.json
181
+ - LOGS_DIR=/logs # Fixed: Point to mounted volume
181
182
  - VOYAGE_KEY=${VOYAGE_KEY:-}
182
183
  - PREFER_LOCAL_EMBEDDINGS=${PREFER_LOCAL_EMBEDDINGS:-true}
183
- - HOT_WINDOW_MINUTES=${HOT_WINDOW_MINUTES:-15}
184
- - MAX_COLD_FILES_PER_CYCLE=${MAX_COLD_FILES_PER_CYCLE:-3}
185
- - MAX_MEMORY_MB=${MAX_MEMORY_MB:-300}
186
- - WATCH_INTERVAL_SECONDS=${WATCH_INTERVAL_SECONDS:-30}
187
- - MAX_FILES_PER_CYCLE=${MAX_FILES_PER_CYCLE:-10}
184
+ - ENABLE_MEMORY_DECAY=${ENABLE_MEMORY_DECAY:-false}
185
+ - DECAY_WEIGHT=${DECAY_WEIGHT:-0.3}
186
+ - DECAY_SCALE_DAYS=${DECAY_SCALE_DAYS:-90}
187
+ - CHECK_INTERVAL_S=${CHECK_INTERVAL_S:-60}
188
+ - HOT_CHECK_INTERVAL_S=${HOT_CHECK_INTERVAL_S:-2}
189
+ - HOT_WINDOW_MINUTES=${HOT_WINDOW_MINUTES:-5}
190
+ - WARM_WINDOW_HOURS=${WARM_WINDOW_HOURS:-24}
191
+ - MAX_COLD_FILES=${MAX_COLD_FILES:-5}
192
+ - MAX_WARM_WAIT_MINUTES=${MAX_WARM_WAIT_MINUTES:-30}
193
+ - MAX_MESSAGES_PER_CHUNK=${MAX_MESSAGES_PER_CHUNK:-10}
188
194
  - MAX_CHUNK_SIZE=${MAX_CHUNK_SIZE:-50} # Messages per chunk for streaming
195
+ - MEMORY_LIMIT_MB=${MEMORY_LIMIT_MB:-1000}
196
+ - MEMORY_WARNING_MB=${MEMORY_WARNING_MB:-500}
189
197
  - PYTHONUNBUFFERED=1
190
198
  - MALLOC_ARENA_MAX=2
191
- restart: "no" # Manual start only - prevent system overload
192
- profiles: ["safe-watch"] # Requires explicit profile to run
193
- mem_limit: 600m # Increased from 400m to handle large files safely
194
- memswap_limit: 600m
199
+ restart: unless-stopped
200
+ profiles: ["safe-watch", "watch"] # Requires explicit profile to run
201
+ mem_limit: 1g # Increased to 1GB to match MEMORY_LIMIT_MB
202
+ memswap_limit: 1g
195
203
  cpus: 1.0 # Single CPU core limit
196
204
 
197
205
  # MCP server for Claude integration
@@ -454,6 +454,26 @@ async function enrichMetadata() {
454
454
  }
455
455
  }
456
456
 
457
+ async function startWatcher() {
458
+ console.log('\n🔄 Starting the streaming watcher...');
459
+ console.log(' • HOT files (<5 min): 2-second processing');
460
+ console.log(' • WARM files (<24 hrs): Normal priority');
461
+ console.log(' • COLD files (>24 hrs): Batch processing');
462
+
463
+ try {
464
+ safeExec('docker', ['compose', '--profile', 'watch', 'up', '-d', 'safe-watcher'], {
465
+ cwd: projectRoot,
466
+ stdio: 'inherit'
467
+ });
468
+ console.log('✅ Watcher started successfully!');
469
+ return true;
470
+ } catch (error) {
471
+ console.log('⚠️ Could not start watcher automatically');
472
+ console.log(' You can start it manually with: docker compose --profile watch up -d');
473
+ return false;
474
+ }
475
+ }
476
+
457
477
  async function showFinalInstructions() {
458
478
  console.log('\n✅ Setup complete!');
459
479
 
@@ -461,7 +481,7 @@ async function showFinalInstructions() {
461
481
  console.log(' • 🌐 Qdrant Dashboard: http://localhost:6333/dashboard/');
462
482
  console.log(' • 📊 Status: All services running');
463
483
  console.log(' • 🔍 Search: Semantic search with memory decay enabled');
464
- console.log(' • 🚀 Import: Watcher checking every 60 seconds');
484
+ console.log(' • 🚀 Watcher: HOT/WARM/COLD prioritization active');
465
485
 
466
486
  console.log('\n📋 Quick Reference Commands:');
467
487
  console.log(' • Check status: docker compose ps');
@@ -568,6 +588,9 @@ async function main() {
568
588
  // Enrich metadata (new in v2.5.19)
569
589
  await enrichMetadata();
570
590
 
591
+ // Start the watcher
592
+ await startWatcher();
593
+
571
594
  // Show final instructions
572
595
  await showFinalInstructions();
573
596
 
@@ -54,7 +54,7 @@ class ProjectResolver:
54
54
  4. Fuzzy matching on collection names
55
55
 
56
56
  Args:
57
- user_project_name: User-provided project name (e.g., "anukruti", "Anukruti", full path)
57
+ user_project_name: User-provided project name (e.g., "example-project", "Example-Project", full path)
58
58
 
59
59
  Returns:
60
60
  List of collection names that match the project
@@ -362,7 +362,7 @@ class ProjectResolver:
362
362
 
363
363
  Examples:
364
364
  - -Users-name-projects-my-app-src -> ['my', 'app', 'src']
365
- - -Users-name-Code-freightwise-documents -> ['freightwise', 'documents']
365
+ - -Users-name-Code-example-project -> ['example', 'project']
366
366
 
367
367
  Args:
368
368
  path: Path in any format
@@ -9,6 +9,7 @@ import json
9
9
  import numpy as np
10
10
  import hashlib
11
11
  import time
12
+ import logging
12
13
 
13
14
  from fastmcp import FastMCP, Context
14
15
  from .utils import normalize_project_name
@@ -80,15 +81,23 @@ def initialize_embeddings():
80
81
  print(f"[ERROR] Failed to initialize embeddings: {e}")
81
82
  return False
82
83
 
83
- # Debug environment loading
84
- print(f"[DEBUG] Environment variables loaded:")
85
- print(f"[DEBUG] ENABLE_MEMORY_DECAY: {ENABLE_MEMORY_DECAY}")
86
- print(f"[DEBUG] USE_NATIVE_DECAY: {USE_NATIVE_DECAY}")
87
- print(f"[DEBUG] DECAY_WEIGHT: {DECAY_WEIGHT}")
88
- print(f"[DEBUG] DECAY_SCALE_DAYS: {DECAY_SCALE_DAYS}")
89
- print(f"[DEBUG] PREFER_LOCAL_EMBEDDINGS: {PREFER_LOCAL_EMBEDDINGS}")
90
- print(f"[DEBUG] EMBEDDING_MODEL: {EMBEDDING_MODEL}")
91
- print(f"[DEBUG] env_path: {env_path}")
84
+ # Debug environment loading and startup
85
+ import sys
86
+ import datetime as dt
87
+ startup_time = dt.datetime.now().isoformat()
88
+ print(f"[STARTUP] MCP Server starting at {startup_time}", file=sys.stderr)
89
+ print(f"[STARTUP] Python: {sys.version}", file=sys.stderr)
90
+ print(f"[STARTUP] Working directory: {os.getcwd()}", file=sys.stderr)
91
+ print(f"[STARTUP] Script location: {__file__}", file=sys.stderr)
92
+ print(f"[DEBUG] Environment variables loaded:", file=sys.stderr)
93
+ print(f"[DEBUG] QDRANT_URL: {QDRANT_URL}", file=sys.stderr)
94
+ print(f"[DEBUG] ENABLE_MEMORY_DECAY: {ENABLE_MEMORY_DECAY}", file=sys.stderr)
95
+ print(f"[DEBUG] USE_NATIVE_DECAY: {USE_NATIVE_DECAY}", file=sys.stderr)
96
+ print(f"[DEBUG] DECAY_WEIGHT: {DECAY_WEIGHT}", file=sys.stderr)
97
+ print(f"[DEBUG] DECAY_SCALE_DAYS: {DECAY_SCALE_DAYS}", file=sys.stderr)
98
+ print(f"[DEBUG] PREFER_LOCAL_EMBEDDINGS: {PREFER_LOCAL_EMBEDDINGS}", file=sys.stderr)
99
+ print(f"[DEBUG] EMBEDDING_MODEL: {EMBEDDING_MODEL}", file=sys.stderr)
100
+ print(f"[DEBUG] env_path: {env_path}", file=sys.stderr)
92
101
 
93
102
 
94
103
  class SearchResult(BaseModel):
@@ -124,18 +133,48 @@ indexing_status = {
124
133
  "is_checking": False
125
134
  }
126
135
 
127
- async def update_indexing_status():
136
+ # Cache for indexing status (5-second TTL)
137
+ _indexing_cache = {"result": None, "timestamp": 0}
138
+
139
+ # Setup logger
140
+ logger = logging.getLogger(__name__)
141
+
142
+ def normalize_path(path_str: str) -> str:
143
+ """Normalize path for consistent comparison across platforms.
144
+
145
+ Args:
146
+ path_str: Path string to normalize
147
+
148
+ Returns:
149
+ Normalized path string with consistent separators
150
+ """
151
+ if not path_str:
152
+ return path_str
153
+ p = Path(path_str).expanduser().resolve()
154
+ return str(p).replace('\\', '/') # Consistent separators for all platforms
155
+
156
+ async def update_indexing_status(cache_ttl: int = 5):
128
157
  """Update indexing status by checking JSONL files vs Qdrant collections.
129
- This is a lightweight check that compares file counts, not full content."""
130
- global indexing_status
158
+ This is a lightweight check that compares file counts, not full content.
159
+
160
+ Args:
161
+ cache_ttl: Cache time-to-live in seconds (default: 5)
162
+ """
163
+ global indexing_status, _indexing_cache
164
+
165
+ # Check cache first (5-second TTL to prevent performance issues)
166
+ current_time = time.time()
167
+ if _indexing_cache["result"] and current_time - _indexing_cache["timestamp"] < cache_ttl:
168
+ # Use cached result
169
+ indexing_status = _indexing_cache["result"].copy()
170
+ return
131
171
 
132
172
  # Don't run concurrent checks
133
173
  if indexing_status["is_checking"]:
134
174
  return
135
175
 
136
- # Only check every 5 minutes to avoid overhead
137
- current_time = time.time()
138
- if current_time - indexing_status["last_check"] < 300: # 5 minutes
176
+ # Check immediately on first call, then every 60 seconds to avoid overhead
177
+ if indexing_status["last_check"] > 0 and current_time - indexing_status["last_check"] < 60: # 1 minute
139
178
  return
140
179
 
141
180
  indexing_status["is_checking"] = True
@@ -151,47 +190,108 @@ async def update_indexing_status():
151
190
  jsonl_files = list(projects_dir.glob("**/*.jsonl"))
152
191
  total_files = len(jsonl_files)
153
192
 
154
- # Check imported-files.json to see what's been imported
155
- # The streaming importer uses imported-files.json with nested structure
156
- # Try multiple possible locations for the config file
193
+ # Check imported-files.json AND watcher state files to see what's been imported
194
+ # The system uses multiple state files that need to be merged
195
+ all_imported_files = set() # Use set to avoid duplicates
196
+ file_metadata = {}
197
+
198
+ # 1. Check imported-files.json (batch importer)
157
199
  possible_paths = [
158
200
  Path.home() / ".claude-self-reflect" / "config" / "imported-files.json",
159
201
  Path(__file__).parent.parent.parent / "config" / "imported-files.json",
160
202
  Path("/config/imported-files.json") # Docker path if running in container
161
203
  ]
162
204
 
163
- imported_files_path = None
164
205
  for path in possible_paths:
165
206
  if path.exists():
166
- imported_files_path = path
167
- break
207
+ try:
208
+ with open(path, 'r') as f:
209
+ imported_data = json.load(f)
210
+ imported_files_dict = imported_data.get("imported_files", {})
211
+ file_metadata.update(imported_data.get("file_metadata", {}))
212
+ # Normalize paths before adding to set
213
+ normalized_files = {normalize_path(k) for k in imported_files_dict.keys()}
214
+ all_imported_files.update(normalized_files)
215
+ except (json.JSONDecodeError, IOError) as e:
216
+ logger.debug(f"Failed to read state file {path}: {e}")
217
+ pass # Continue if file is corrupted
168
218
 
169
- if imported_files_path and imported_files_path.exists():
170
- with open(imported_files_path, 'r') as f:
171
- imported_data = json.load(f)
172
- # The actual structure has imported_files and file_metadata at the top level
173
- # NOT nested under stream_position as previously assumed
174
- imported_files_dict = imported_data.get("imported_files", {})
175
- file_metadata = imported_data.get("file_metadata", {})
176
-
177
- # Convert dict keys to list for compatibility with existing logic
178
- imported_files_list = list(imported_files_dict.keys())
179
-
180
- # Count files that have been imported
181
- for file_path in jsonl_files:
182
- # Try multiple path formats to match Docker's state file
183
- file_str = str(file_path).replace(str(Path.home()), "/logs").replace("\\", "/")
184
- # Also try without .claude/projects prefix (Docker mounts directly)
185
- file_str_alt = file_str.replace("/.claude/projects", "")
186
-
187
- # Check if file is in imported_files list (fully imported)
188
- if file_str in imported_files_list or file_str_alt in imported_files_list:
189
- indexed_files += 1
190
- # Or if it has metadata with position > 0 (partially imported)
191
- elif file_str in file_metadata and file_metadata[file_str].get("position", 0) > 0:
192
- indexed_files += 1
193
- elif file_str_alt in file_metadata and file_metadata[file_str_alt].get("position", 0) > 0:
194
- indexed_files += 1
219
+ # 2. Check csr-watcher.json (streaming watcher - local mode)
220
+ watcher_paths = [
221
+ Path.home() / ".claude-self-reflect" / "config" / "csr-watcher.json",
222
+ Path("/config/csr-watcher.json") # Docker path
223
+ ]
224
+
225
+ for path in watcher_paths:
226
+ if path.exists():
227
+ try:
228
+ with open(path, 'r') as f:
229
+ watcher_data = json.load(f)
230
+ watcher_files = watcher_data.get("imported_files", {})
231
+ # Normalize paths before adding to set
232
+ normalized_files = {normalize_path(k) for k in watcher_files.keys()}
233
+ all_imported_files.update(normalized_files)
234
+ # Add to metadata with normalized paths
235
+ for file_path, info in watcher_files.items():
236
+ normalized = normalize_path(file_path)
237
+ if normalized not in file_metadata:
238
+ file_metadata[normalized] = {
239
+ "position": 1,
240
+ "chunks": info.get("chunks", 0)
241
+ }
242
+ except (json.JSONDecodeError, IOError) as e:
243
+ logger.debug(f"Failed to read watcher state file {path}: {e}")
244
+ pass # Continue if file is corrupted
245
+
246
+ # 3. Check csr-watcher-cloud.json (streaming watcher - cloud mode)
247
+ cloud_watcher_path = Path.home() / ".claude-self-reflect" / "config" / "csr-watcher-cloud.json"
248
+ if cloud_watcher_path.exists():
249
+ try:
250
+ with open(cloud_watcher_path, 'r') as f:
251
+ cloud_data = json.load(f)
252
+ cloud_files = cloud_data.get("imported_files", {})
253
+ # Normalize paths before adding to set
254
+ normalized_files = {normalize_path(k) for k in cloud_files.keys()}
255
+ all_imported_files.update(normalized_files)
256
+ # Add to metadata with normalized paths
257
+ for file_path, info in cloud_files.items():
258
+ normalized = normalize_path(file_path)
259
+ if normalized not in file_metadata:
260
+ file_metadata[normalized] = {
261
+ "position": 1,
262
+ "chunks": info.get("chunks", 0)
263
+ }
264
+ except (json.JSONDecodeError, IOError) as e:
265
+ logger.debug(f"Failed to read cloud watcher state file {cloud_watcher_path}: {e}")
266
+ pass # Continue if file is corrupted
267
+
268
+ # Convert set to list for compatibility
269
+ imported_files_list = list(all_imported_files)
270
+
271
+ # Count files that have been imported
272
+ for file_path in jsonl_files:
273
+ # Normalize the current file path for consistent comparison
274
+ normalized_file = normalize_path(str(file_path))
275
+
276
+ # Try multiple path formats to match Docker's state file
277
+ file_str = str(file_path).replace(str(Path.home()), "/logs").replace("\\", "/")
278
+ # Also try without .claude/projects prefix (Docker mounts directly)
279
+ file_str_alt = file_str.replace("/.claude/projects", "")
280
+
281
+ # Normalize alternative paths as well
282
+ normalized_alt = normalize_path(file_str)
283
+ normalized_alt2 = normalize_path(file_str_alt)
284
+
285
+ # Check if file is in imported_files list (fully imported)
286
+ if normalized_file in imported_files_list or normalized_alt in imported_files_list or normalized_alt2 in imported_files_list:
287
+ indexed_files += 1
288
+ # Or if it has metadata with position > 0 (partially imported)
289
+ elif normalized_file in file_metadata and file_metadata[normalized_file].get("position", 0) > 0:
290
+ indexed_files += 1
291
+ elif normalized_alt in file_metadata and file_metadata[normalized_alt].get("position", 0) > 0:
292
+ indexed_files += 1
293
+ elif normalized_alt2 in file_metadata and file_metadata[normalized_alt2].get("position", 0) > 0:
294
+ indexed_files += 1
195
295
 
196
296
  # Update status
197
297
  indexing_status["last_check"] = current_time
@@ -203,9 +303,14 @@ async def update_indexing_status():
203
303
  indexing_status["percentage"] = (indexed_files / total_files) * 100
204
304
  else:
205
305
  indexing_status["percentage"] = 100.0
306
+
307
+ # Update cache
308
+ _indexing_cache["result"] = indexing_status.copy()
309
+ _indexing_cache["timestamp"] = current_time
206
310
 
207
311
  except Exception as e:
208
312
  print(f"[WARNING] Failed to update indexing status: {e}")
313
+ logger.error(f"Failed to update indexing status: {e}", exc_info=True)
209
314
  finally:
210
315
  indexing_status["is_checking"] = False
211
316
 
@@ -1422,4 +1527,8 @@ if __name__ == "__main__":
1422
1527
  sys.exit(0)
1423
1528
 
1424
1529
  # Normal MCP server operation
1530
+ print(f"[STARTUP] Starting FastMCP server in stdio mode...", file=sys.stderr)
1531
+ print(f"[STARTUP] Server name: {mcp.name}", file=sys.stderr)
1532
+ print(f"[STARTUP] Calling mcp.run()...", file=sys.stderr)
1425
1533
  mcp.run()
1534
+ print(f"[STARTUP] Server exited normally", file=sys.stderr)
@@ -5,6 +5,7 @@ Designed for <20ms execution time to support status bars and shell scripts.
5
5
  """
6
6
 
7
7
  import json
8
+ import time
8
9
  from pathlib import Path
9
10
  from collections import defaultdict
10
11
 
@@ -53,11 +54,36 @@ def normalize_file_path(file_path: str) -> str:
53
54
  return file_path
54
55
 
55
56
 
57
+ def get_watcher_status() -> dict:
58
+ """Get streaming watcher status if available."""
59
+ watcher_state_file = Path.home() / "config" / "csr-watcher.json"
60
+
61
+ if not watcher_state_file.exists():
62
+ return {"running": False, "status": "not configured"}
63
+
64
+ try:
65
+ with open(watcher_state_file) as f:
66
+ state = json.load(f)
67
+
68
+ # Check if watcher is active (modified recently)
69
+ file_age = time.time() - watcher_state_file.stat().st_mtime
70
+ is_active = file_age < 120 # Active if updated in last 2 minutes
71
+
72
+ return {
73
+ "running": is_active,
74
+ "files_processed": len(state.get("imported_files", {})),
75
+ "last_update_seconds": int(file_age),
76
+ "status": "🟢 active" if is_active else "🔴 inactive"
77
+ }
78
+ except:
79
+ return {"running": False, "status": "error reading state"}
80
+
81
+
56
82
  def get_status() -> dict:
57
83
  """Get indexing status with overall stats and per-project breakdown.
58
84
 
59
85
  Returns:
60
- dict: JSON structure with overall and per-project indexing status
86
+ dict: JSON structure with overall and per-project indexing status, plus watcher status
61
87
  """
62
88
  projects_dir = Path.home() / ".claude" / "projects"
63
89
  project_stats = defaultdict(lambda: {"indexed": 0, "total": 0})
@@ -154,6 +180,9 @@ def get_status() -> dict:
154
180
  "total": stats["total"]
155
181
  }
156
182
 
183
+ # Add watcher status
184
+ result["watcher"] = get_watcher_status()
185
+
157
186
  return result
158
187
 
159
188
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "claude-self-reflect",
3
- "version": "2.7.4",
3
+ "version": "2.8.1",
4
4
  "description": "Give Claude perfect memory of all your conversations - Installation wizard for Python MCP server",
5
5
  "keywords": [
6
6
  "claude",
@@ -1,374 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Streaming importer with true line-by-line processing to prevent OOM.
4
- Processes JSONL files without loading entire file into memory.
5
- """
6
-
7
- import json
8
- import os
9
- import sys
10
- import hashlib
11
- import gc
12
- from pathlib import Path
13
- from datetime import datetime
14
- from typing import List, Dict, Any, Optional
15
- import logging
16
-
17
- # Add the project root to the Python path
18
- project_root = Path(__file__).parent.parent
19
- sys.path.insert(0, str(project_root))
20
-
21
- from qdrant_client import QdrantClient
22
- from qdrant_client.models import PointStruct, Distance, VectorParams
23
-
24
- # Set up logging
25
- logging.basicConfig(
26
- level=logging.INFO,
27
- format='%(asctime)s - %(levelname)s - %(message)s'
28
- )
29
- logger = logging.getLogger(__name__)
30
-
31
- # Environment variables
32
- QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
33
- STATE_FILE = os.getenv("STATE_FILE", "/config/imported-files.json")
34
- PREFER_LOCAL_EMBEDDINGS = os.getenv("PREFER_LOCAL_EMBEDDINGS", "true").lower() == "true"
35
- VOYAGE_API_KEY = os.getenv("VOYAGE_KEY")
36
- MAX_CHUNK_SIZE = int(os.getenv("MAX_CHUNK_SIZE", "50")) # Messages per chunk
37
-
38
- # Initialize Qdrant client
39
- client = QdrantClient(url=QDRANT_URL)
40
-
41
- # Initialize embedding provider
42
- embedding_provider = None
43
- embedding_dimension = None
44
-
45
- if PREFER_LOCAL_EMBEDDINGS or not VOYAGE_API_KEY:
46
- logger.info("Using local embeddings (fastembed)")
47
- from fastembed import TextEmbedding
48
- embedding_provider = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
49
- embedding_dimension = 384
50
- collection_suffix = "local"
51
- else:
52
- logger.info("Using Voyage AI embeddings")
53
- import voyageai
54
- embedding_provider = voyageai.Client(api_key=VOYAGE_API_KEY)
55
- embedding_dimension = 1024
56
- collection_suffix = "voyage"
57
-
58
- def normalize_project_name(project_name: str) -> str:
59
- """Normalize project name for consistency."""
60
- return project_name.replace("-Users-ramakrishnanannaswamy-projects-", "").replace("-", "_").lower()
61
-
62
- def get_collection_name(project_path: Path) -> str:
63
- """Generate collection name from project path."""
64
- normalized = normalize_project_name(project_path.name)
65
- name_hash = hashlib.md5(normalized.encode()).hexdigest()[:8]
66
- return f"conv_{name_hash}_{collection_suffix}"
67
-
68
- def ensure_collection(collection_name: str):
69
- """Ensure collection exists with correct configuration."""
70
- collections = client.get_collections().collections
71
- if not any(c.name == collection_name for c in collections):
72
- logger.info(f"Creating collection: {collection_name}")
73
- client.create_collection(
74
- collection_name=collection_name,
75
- vectors_config=VectorParams(size=embedding_dimension, distance=Distance.COSINE)
76
- )
77
-
78
- def generate_embeddings(texts: List[str]) -> List[List[float]]:
79
- """Generate embeddings for texts."""
80
- if PREFER_LOCAL_EMBEDDINGS or not VOYAGE_API_KEY:
81
- embeddings = list(embedding_provider.passage_embed(texts))
82
- return [emb.tolist() if hasattr(emb, 'tolist') else emb for emb in embeddings]
83
- else:
84
- response = embedding_provider.embed(texts, model="voyage-3")
85
- return response.embeddings
86
-
87
- def process_and_upload_chunk(messages: List[Dict[str, Any]], chunk_index: int,
88
- conversation_id: str, created_at: str,
89
- metadata: Dict[str, Any], collection_name: str,
90
- project_path: Path) -> int:
91
- """Process and immediately upload a single chunk."""
92
- if not messages:
93
- return 0
94
-
95
- # Extract text content
96
- texts = []
97
- for msg in messages:
98
- role = msg.get("role", "unknown")
99
- content = msg.get("content", "")
100
- if content:
101
- texts.append(f"{role.upper()}: {content}")
102
-
103
- if not texts:
104
- return 0
105
-
106
- chunk_text = "\n".join(texts)
107
-
108
- try:
109
- # Generate embedding
110
- embeddings = generate_embeddings([chunk_text])
111
-
112
- # Create point ID
113
- point_id = hashlib.md5(
114
- f"{conversation_id}_{chunk_index}".encode()
115
- ).hexdigest()[:16]
116
-
117
- # Create payload
118
- payload = {
119
- "text": chunk_text,
120
- "conversation_id": conversation_id,
121
- "chunk_index": chunk_index,
122
- "timestamp": created_at,
123
- "project": normalize_project_name(project_path.name),
124
- "start_role": messages[0].get("role", "unknown") if messages else "unknown",
125
- "message_count": len(messages)
126
- }
127
-
128
- # Add metadata
129
- if metadata:
130
- payload.update(metadata)
131
-
132
- # Create point
133
- point = PointStruct(
134
- id=int(point_id, 16) % (2**63),
135
- vector=embeddings[0],
136
- payload=payload
137
- )
138
-
139
- # Upload immediately
140
- client.upsert(
141
- collection_name=collection_name,
142
- points=[point],
143
- wait=True
144
- )
145
-
146
- return 1
147
-
148
- except Exception as e:
149
- logger.error(f"Error processing chunk {chunk_index}: {e}")
150
- return 0
151
-
152
- def extract_metadata_single_pass(file_path: str) -> tuple[Dict[str, Any], str]:
153
- """Extract metadata in a single pass, return metadata and first timestamp."""
154
- metadata = {
155
- "files_analyzed": [],
156
- "files_edited": [],
157
- "tools_used": [],
158
- "concepts": []
159
- }
160
-
161
- first_timestamp = None
162
-
163
- try:
164
- with open(file_path, 'r', encoding='utf-8') as f:
165
- for line in f:
166
- if not line.strip():
167
- continue
168
-
169
- try:
170
- data = json.loads(line)
171
-
172
- # Get timestamp from first valid entry
173
- if first_timestamp is None and 'timestamp' in data:
174
- first_timestamp = data.get('timestamp')
175
-
176
- # Extract tool usage from messages
177
- if 'message' in data and data['message']:
178
- msg = data['message']
179
- if msg.get('content'):
180
- content = msg['content']
181
- if isinstance(content, list):
182
- for item in content:
183
- if isinstance(item, dict) and item.get('type') == 'tool_use':
184
- tool_name = item.get('name', '')
185
- if tool_name and tool_name not in metadata['tools_used']:
186
- metadata['tools_used'].append(tool_name)
187
-
188
- # Extract file references
189
- if 'input' in item:
190
- input_data = item['input']
191
- if isinstance(input_data, dict):
192
- if 'file_path' in input_data:
193
- file_ref = input_data['file_path']
194
- if file_ref not in metadata['files_analyzed']:
195
- metadata['files_analyzed'].append(file_ref)
196
- if 'path' in input_data:
197
- file_ref = input_data['path']
198
- if file_ref not in metadata['files_analyzed']:
199
- metadata['files_analyzed'].append(file_ref)
200
-
201
- except json.JSONDecodeError:
202
- continue
203
- except Exception:
204
- continue
205
-
206
- except Exception as e:
207
- logger.warning(f"Error extracting metadata: {e}")
208
-
209
- return metadata, first_timestamp or datetime.now().isoformat()
210
-
211
- def stream_import_file(jsonl_file: Path, collection_name: str, project_path: Path) -> int:
212
- """Stream import a single JSONL file without loading it into memory."""
213
- logger.info(f"Streaming import of {jsonl_file.name}")
214
-
215
- # Extract metadata in first pass (lightweight)
216
- metadata, created_at = extract_metadata_single_pass(str(jsonl_file))
217
-
218
- # Stream messages and process in chunks
219
- chunk_buffer = []
220
- chunk_index = 0
221
- total_chunks = 0
222
- conversation_id = jsonl_file.stem
223
-
224
- try:
225
- with open(jsonl_file, 'r', encoding='utf-8') as f:
226
- for line_num, line in enumerate(f, 1):
227
- line = line.strip()
228
- if not line:
229
- continue
230
-
231
- try:
232
- data = json.loads(line)
233
-
234
- # Skip non-message lines
235
- if data.get('type') == 'summary':
236
- continue
237
-
238
- # Extract message if present
239
- if 'message' in data and data['message']:
240
- msg = data['message']
241
- if msg.get('role') and msg.get('content'):
242
- # Extract content
243
- content = msg['content']
244
- if isinstance(content, list):
245
- text_parts = []
246
- for item in content:
247
- if isinstance(item, dict) and item.get('type') == 'text':
248
- text_parts.append(item.get('text', ''))
249
- elif isinstance(item, str):
250
- text_parts.append(item)
251
- content = '\n'.join(text_parts)
252
-
253
- if content:
254
- chunk_buffer.append({
255
- 'role': msg['role'],
256
- 'content': content
257
- })
258
-
259
- # Process chunk when buffer reaches MAX_CHUNK_SIZE
260
- if len(chunk_buffer) >= MAX_CHUNK_SIZE:
261
- chunks = process_and_upload_chunk(
262
- chunk_buffer, chunk_index, conversation_id,
263
- created_at, metadata, collection_name, project_path
264
- )
265
- total_chunks += chunks
266
- chunk_buffer = []
267
- chunk_index += 1
268
-
269
- # Force garbage collection after each chunk
270
- gc.collect()
271
-
272
- # Log progress
273
- if chunk_index % 10 == 0:
274
- logger.info(f"Processed {chunk_index} chunks from {jsonl_file.name}")
275
-
276
- except json.JSONDecodeError:
277
- logger.debug(f"Skipping invalid JSON at line {line_num}")
278
- except Exception as e:
279
- logger.debug(f"Error processing line {line_num}: {e}")
280
-
281
- # Process remaining messages
282
- if chunk_buffer:
283
- chunks = process_and_upload_chunk(
284
- chunk_buffer, chunk_index, conversation_id,
285
- created_at, metadata, collection_name, project_path
286
- )
287
- total_chunks += chunks
288
-
289
- logger.info(f"Imported {total_chunks} chunks from {jsonl_file.name}")
290
- return total_chunks
291
-
292
- except Exception as e:
293
- logger.error(f"Failed to import {jsonl_file}: {e}")
294
- return 0
295
-
296
- def load_state() -> dict:
297
- """Load import state."""
298
- if os.path.exists(STATE_FILE):
299
- try:
300
- with open(STATE_FILE, 'r') as f:
301
- return json.load(f)
302
- except:
303
- pass
304
- return {"imported_files": {}}
305
-
306
- def save_state(state: dict):
307
- """Save import state."""
308
- os.makedirs(os.path.dirname(STATE_FILE), exist_ok=True)
309
- with open(STATE_FILE, 'w') as f:
310
- json.dump(state, f, indent=2)
311
-
312
- def should_import_file(file_path: Path, state: dict) -> bool:
313
- """Check if file should be imported."""
314
- file_str = str(file_path)
315
- if file_str in state.get("imported_files", {}):
316
- file_info = state["imported_files"][file_str]
317
- last_modified = file_path.stat().st_mtime
318
- if file_info.get("last_modified") == last_modified:
319
- logger.info(f"Skipping unchanged file: {file_path.name}")
320
- return False
321
- return True
322
-
323
- def update_file_state(file_path: Path, state: dict, chunks: int):
324
- """Update state for imported file."""
325
- file_str = str(file_path)
326
- state["imported_files"][file_str] = {
327
- "imported_at": datetime.now().isoformat(),
328
- "last_modified": file_path.stat().st_mtime,
329
- "chunks": chunks
330
- }
331
-
332
- def main():
333
- """Main import function."""
334
- # Load state
335
- state = load_state()
336
- logger.info(f"Loaded state with {len(state.get('imported_files', {}))} previously imported files")
337
-
338
- # Find all projects
339
- logs_dir = Path(os.getenv("LOGS_DIR", "/logs"))
340
- project_dirs = [d for d in logs_dir.iterdir() if d.is_dir()]
341
- logger.info(f"Found {len(project_dirs)} projects to import")
342
-
343
- total_imported = 0
344
-
345
- for project_dir in project_dirs:
346
- # Get collection name
347
- collection_name = get_collection_name(project_dir)
348
- logger.info(f"Importing project: {project_dir.name} -> {collection_name}")
349
-
350
- # Ensure collection exists
351
- ensure_collection(collection_name)
352
-
353
- # Find JSONL files
354
- jsonl_files = sorted(project_dir.glob("*.jsonl"))
355
-
356
- # Limit files per cycle if specified
357
- max_files = int(os.getenv("MAX_FILES_PER_CYCLE", "1000"))
358
- jsonl_files = jsonl_files[:max_files]
359
-
360
- for jsonl_file in jsonl_files:
361
- if should_import_file(jsonl_file, state):
362
- chunks = stream_import_file(jsonl_file, collection_name, project_dir)
363
- if chunks > 0:
364
- update_file_state(jsonl_file, state, chunks)
365
- save_state(state)
366
- total_imported += 1
367
-
368
- # Force GC after each file
369
- gc.collect()
370
-
371
- logger.info(f"Import complete: processed {total_imported} files")
372
-
373
- if __name__ == "__main__":
374
- main()