@ngocsangairvds/vsaf 3.0.11 → 3.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,1306 +1,4 @@
1
1
  ---
2
2
  name: graphify
3
- description: any input (code, docs, papers, images) knowledge graph clustered communities → HTML + JSON + audit report
4
- trigger: /graphify
3
+ description: DEPRECATED skill đã bị loại bỏ khỏi dự án. Không sử dụng.
5
4
  ---
6
-
7
- # /graphify
8
-
9
- Turn any folder of files into a navigable knowledge graph with community detection, an honest audit trail, and three outputs: interactive HTML, GraphRAG-ready JSON, and a plain-language GRAPH_REPORT.md.
10
-
11
- ## Usage
12
-
13
- ```
14
- /graphify # full pipeline on current directory → Obsidian vault
15
- /graphify <path> # full pipeline on specific path
16
- /graphify <path> --mode deep # thorough extraction, richer INFERRED edges
17
- /graphify <path> --update # incremental - re-extract only new/changed files
18
- /graphify <path> --directed # build directed graph (preserves edge direction: source→target)
19
- /graphify <path> --whisper-model medium # use a larger Whisper model for better transcription accuracy
20
- /graphify <path> --cluster-only # rerun clustering on existing graph
21
- /graphify <path> --no-viz # skip visualization, just report + JSON
22
- /graphify <path> --html # (HTML is generated by default - this flag is a no-op)
23
- /graphify <path> --svg # also export graph.svg (embeds in Notion, GitHub)
24
- /graphify <path> --graphml # export graph.graphml (Gephi, yEd)
25
- /graphify <path> --neo4j # generate graphify-out/cypher.txt for Neo4j
26
- /graphify <path> --neo4j-push bolt://localhost:7687 # push directly to Neo4j
27
- /graphify <path> --mcp # start MCP stdio server for agent access
28
- /graphify <path> --watch # watch folder, auto-rebuild on code changes (no LLM needed)
29
- /graphify <path> --wiki # build agent-crawlable wiki (index.md + one article per community)
30
- /graphify <path> --obsidian --obsidian-dir ~/vaults/my-project # write vault to custom path (e.g. existing vault)
31
- /graphify add <url> # fetch URL, save to ./raw, update graph
32
- /graphify add <url> --author "Name" # tag who wrote it
33
- /graphify add <url> --contributor "Name" # tag who added it to the corpus
34
- /graphify query "<question>" # BFS traversal - broad context
35
- /graphify query "<question>" --dfs # DFS - trace a specific path
36
- /graphify query "<question>" --budget 1500 # cap answer at N tokens
37
- /graphify path "AuthModule" "Database" # shortest path between two concepts
38
- /graphify explain "SwinTransformer" # plain-language explanation of a node
39
- ```
40
-
41
- ## What graphify is for
42
-
43
- graphify is built around Andrej Karpathy's /raw folder workflow: drop anything into a folder - papers, tweets, screenshots, code, notes - and get a structured knowledge graph that shows you what you didn't know was connected.
44
-
45
- Three things it does that Claude alone cannot:
46
- 1. **Persistent graph** - relationships are stored in `graphify-out/graph.json` and survive across sessions. Ask questions weeks later without re-reading everything.
47
- 2. **Honest audit trail** - every edge is tagged EXTRACTED, INFERRED, or AMBIGUOUS. You know what was found vs invented.
48
- 3. **Cross-document surprise** - community detection finds connections between concepts in different files that you would never think to ask about directly.
49
-
50
- Use it for:
51
- - A codebase you're new to (understand architecture before touching anything)
52
- - A reading list (papers + tweets + notes → one navigable graph)
53
- - A research corpus (citation graph + concept graph in one)
54
- - Your personal /raw folder (drop everything in, let it grow, query it)
55
-
56
- ## What You Must Do When Invoked
57
-
58
- If no path was given, use `.` (current directory). Do not ask the user for a path.
59
-
60
- Follow these steps in order. Do not skip steps.
61
-
62
- ### Step 1 - Ensure graphify is installed
63
-
64
- ```bash
65
- # Detect the correct Python interpreter (handles pipx, venv, system installs)
66
- GRAPHIFY_BIN=$(which graphify 2>/dev/null)
67
- if [ -n "$GRAPHIFY_BIN" ]; then
68
- PYTHON=$(head -1 "$GRAPHIFY_BIN" | tr -d '#!')
69
- case "$PYTHON" in
70
- *[!a-zA-Z0-9/_.-]*) PYTHON="python3" ;;
71
- esac
72
- else
73
- PYTHON="python3"
74
- fi
75
- "$PYTHON" -c "import graphify" 2>/dev/null || "$PYTHON" -m pip install graphifyy -q 2>/dev/null || "$PYTHON" -m pip install graphifyy -q --break-system-packages 2>&1 | tail -3
76
- # Write interpreter path for all subsequent steps (persists across invocations)
77
- mkdir -p graphify-out
78
- "$PYTHON" -c "import sys; open('graphify-out/.graphify_python', 'w').write(sys.executable)"
79
- ```
80
-
81
- If the import succeeds, print nothing and move straight to Step 2.
82
-
83
- **In every subsequent bash block, replace `python3` with `$(cat graphify-out/.graphify_python)` to use the correct interpreter.**
84
-
85
- ### Step 2 - Detect files
86
-
87
- ```bash
88
- $(cat graphify-out/.graphify_python) -c "
89
- import json
90
- from graphify.detect import detect
91
- from pathlib import Path
92
- result = detect(Path('INPUT_PATH'))
93
- print(json.dumps(result))
94
- " > graphify-out/.graphify_detect.json
95
- ```
96
-
97
- Replace INPUT_PATH with the actual path the user provided. Do NOT cat or print the JSON - read it silently and present a clean summary instead:
98
-
99
- ```
100
- Corpus: X files · ~Y words
101
- code: N files (.py .ts .go ...)
102
- docs: N files (.md .txt ...)
103
- papers: N files (.pdf ...)
104
- images: N files
105
- video: N files (.mp4 .mp3 ...)
106
- ```
107
-
108
- Omit any category with 0 files from the summary.
109
-
110
- Then act on it:
111
- - If `total_files` is 0: stop with "No supported files found in [path]."
112
- - If `skipped_sensitive` is non-empty: mention file count skipped, not the file names.
113
- - If `total_words` > 2,000,000 OR `total_files` > 200: show the warning and the top 5 subdirectories by file count, then ask which subfolder to run on. Wait for the user's answer before proceeding.
114
- - Otherwise: proceed directly to Step 2.5 if video files were detected, or Step 3 if not.
115
-
116
- ### Step 2.5 - Transcribe video / audio files (only if video files detected)
117
-
118
- Skip this step entirely if `detect` returned zero `video` files.
119
-
120
- Video and audio files cannot be read directly. Transcribe them to text first, then treat the transcripts as doc files in Step 3.
121
-
122
- **Strategy:** Read the god nodes from `graphify-out/.graphify_detect.json` (or the analysis file if it exists from a previous run). You are already a language model — write a one-sentence domain hint yourself from those labels. Then pass it to Whisper as the initial prompt. No separate API call needed.
123
-
124
- **However**, if the corpus has *only* video files and no other docs/code, use the generic fallback prompt: `"Use proper punctuation and paragraph breaks."`
125
-
126
- **Step 1 - Write the Whisper prompt yourself.**
127
-
128
- Read the top god node labels from detect output or analysis, then compose a short domain hint sentence, for example:
129
-
130
- - Labels: `transformer, attention, encoder, decoder` → `"Machine learning research on transformer architectures and attention mechanisms. Use proper punctuation and paragraph breaks."`
131
- - Labels: `kubernetes, deployment, pod, helm` → `"DevOps discussion about Kubernetes deployments and Helm charts. Use proper punctuation and paragraph breaks."`
132
-
133
- Set it as `WHISPER_PROMPT` to use in the next command.
134
-
135
- **Step 2 - Transcribe:**
136
-
137
- ```bash
138
- GRAPHIFY_WHISPER_MODEL=base # or whatever --whisper-model the user passed
139
- $(cat graphify-out/.graphify_python) -c "
140
- import json, os
141
- from pathlib import Path
142
- from graphify.transcribe import transcribe_all
143
-
144
- detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text())
145
- video_files = detect.get('files', {}).get('video', [])
146
- prompt = os.environ.get('GRAPHIFY_WHISPER_PROMPT', 'Use proper punctuation and paragraph breaks.')
147
-
148
- transcript_paths = transcribe_all(video_files, initial_prompt=prompt)
149
- print(json.dumps(transcript_paths))
150
- " > graphify-out/.graphify_transcripts.json
151
- ```
152
-
153
- After transcription:
154
- - Read the transcript paths from `graphify-out/.graphify_transcripts.json`
155
- - Add them to the docs list before dispatching semantic subagents in Step 3B
156
- - Print how many transcripts were created: `Transcribed N video file(s) -> treating as docs`
157
- - If transcription fails for a file, print a warning and continue with the rest
158
-
159
- **Whisper model:** Default is `base`. If the user passed `--whisper-model <name>`, set `GRAPHIFY_WHISPER_MODEL=<name>` in the environment before running the command above.
160
-
161
- ### Step 3 - Extract entities and relationships
162
-
163
- **Before starting:** note whether `--mode deep` was given. You must pass `DEEP_MODE=true` to every subagent in Step B2 if it was. Track this from the original invocation - do not lose it.
164
-
165
- This step has two parts: **structural extraction** (deterministic, free) and **semantic extraction** (Claude, costs tokens).
166
-
167
- **Run Part A (AST) and Part B (semantic) in parallel. Dispatch all semantic subagents AND start AST extraction in the same message. Both can run simultaneously since they operate on different file types. Merge results in Part C as before.**
168
-
169
- Note: Parallelizing AST + semantic saves 5-15s on large corpora. AST is deterministic and fast; start it while subagents are processing docs/papers.
170
-
171
- #### Part A - Structural extraction for code files
172
-
173
- For any code files detected, run AST extraction in parallel with Part B subagents:
174
-
175
- ```bash
176
- $(cat graphify-out/.graphify_python) -c "
177
- import sys, json
178
- from graphify.extract import collect_files, extract
179
- from pathlib import Path
180
- import json
181
-
182
- code_files = []
183
- detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text())
184
- for f in detect.get('files', {}).get('code', []):
185
- code_files.extend(collect_files(Path(f)) if Path(f).is_dir() else [Path(f)])
186
-
187
- if code_files:
188
- result = extract(code_files)
189
- Path('graphify-out/.graphify_ast.json').write_text(json.dumps(result, indent=2))
190
- print(f'AST: {len(result[\"nodes\"])} nodes, {len(result[\"edges\"])} edges')
191
- else:
192
- Path('graphify-out/.graphify_ast.json').write_text(json.dumps({'nodes':[],'edges':[],'input_tokens':0,'output_tokens':0}))
193
- print('No code files - skipping AST extraction')
194
- "
195
- ```
196
-
197
- #### Part B - Semantic extraction (parallel subagents)
198
-
199
- **Fast path:** If detection found zero docs, papers, and images (code-only corpus), skip Part B entirely and go straight to Part C. AST handles code - there is nothing for semantic subagents to do.
200
-
201
- **MANDATORY: You MUST use the Agent tool here. Reading files yourself one-by-one is forbidden - it is 5-10x slower. If you do not use the Agent tool you are doing this wrong.**
202
-
203
- Before dispatching subagents, print a timing estimate:
204
- - Load `total_words` and file counts from `graphify-out/.graphify_detect.json`
205
- - Estimate agents needed: `ceil(uncached_non_code_files / 22)` (chunk size is 20-25)
206
- - Estimate time: ~45s per agent batch (they run in parallel, so total ≈ 45s × ceil(agents/parallel_limit))
207
- - Print: "Semantic extraction: ~N files → X agents, estimated ~Ys"
208
-
209
- **Step B0 - Check extraction cache first**
210
-
211
- Before dispatching any subagents, check which files already have cached extraction results:
212
-
213
- ```bash
214
- $(cat graphify-out/.graphify_python) -c "
215
- import json
216
- from graphify.cache import check_semantic_cache
217
- from pathlib import Path
218
-
219
- detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text())
220
- all_files = [f for files in detect['files'].values() for f in files]
221
-
222
- cached_nodes, cached_edges, cached_hyperedges, uncached = check_semantic_cache(all_files)
223
-
224
- if cached_nodes or cached_edges or cached_hyperedges:
225
- Path('graphify-out/.graphify_cached.json').write_text(json.dumps({'nodes': cached_nodes, 'edges': cached_edges, 'hyperedges': cached_hyperedges}))
226
- Path('graphify-out/.graphify_uncached.txt').write_text('\n'.join(uncached))
227
- print(f'Cache: {len(all_files)-len(uncached)} files hit, {len(uncached)} files need extraction')
228
- "
229
- ```
230
-
231
- Only dispatch subagents for files listed in `graphify-out/.graphify_uncached.txt`. If all files are cached, skip to Part C directly.
232
-
233
- **Step B1 - Split into chunks**
234
-
235
- Load files from `graphify-out/.graphify_uncached.txt`. Split into chunks of 20-25 files each. Each image gets its own chunk (vision needs separate context). When splitting, group files from the same directory together so related artifacts land in the same chunk and cross-file relationships are more likely to be extracted.
236
-
237
- **Step B2 - Dispatch ALL subagents in a single message**
238
-
239
- Call the Agent tool multiple times IN THE SAME RESPONSE - one call per chunk. This is the only way they run in parallel. If you make one Agent call, wait, then make another, you are doing it sequentially and defeating the purpose.
240
-
241
- **IMPORTANT - subagent type:** Always use `subagent_type="general-purpose"`. Do NOT use `Explore` - it is read-only and cannot write chunk files to disk, which silently drops extraction results. General-purpose has Write and Bash access which the subagent needs.
242
-
243
- Concrete example for 3 chunks:
244
- ```
245
- [Agent tool call 1: files 1-15, subagent_type="general-purpose"]
246
- [Agent tool call 2: files 16-30, subagent_type="general-purpose"]
247
- [Agent tool call 3: files 31-45, subagent_type="general-purpose"]
248
- ```
249
- All three in one message. Not three separate messages.
250
-
251
- Each subagent receives this exact prompt (substitute FILE_LIST, CHUNK_NUM, TOTAL_CHUNKS, and DEEP_MODE):
252
-
253
- ```
254
- You are a graphify extraction subagent. Read the files listed and extract a knowledge graph fragment.
255
- Output ONLY valid JSON matching the schema below - no explanation, no markdown fences, no preamble.
256
-
257
- Files (chunk CHUNK_NUM of TOTAL_CHUNKS):
258
- FILE_LIST
259
-
260
- Rules:
261
- - EXTRACTED: relationship explicit in source (import, call, citation, "see §3.2")
262
- - INFERRED: reasonable inference (shared data structure, implied dependency)
263
- - AMBIGUOUS: uncertain - flag for review, do not omit
264
-
265
- Code files: focus on semantic edges AST cannot find (call relationships, shared data, arch patterns).
266
- Do not re-extract imports - AST already has those.
267
- Doc/paper files: extract named concepts, entities, citations. Also extract rationale — sections that explain WHY a decision was made, trade-offs chosen, or design intent. These become nodes with `rationale_for` edges pointing to the concept they explain.
268
- Image files: use vision to understand what the image IS - do not just OCR.
269
- UI screenshot: layout patterns, design decisions, key elements, purpose.
270
- Chart: metric, trend/insight, data source.
271
- Tweet/post: claim as node, author, concepts mentioned.
272
- Diagram: components and connections.
273
- Research figure: what it demonstrates, method, result.
274
- Handwritten/whiteboard: ideas and arrows, mark uncertain readings AMBIGUOUS.
275
-
276
- DEEP_MODE (if --mode deep was given): be aggressive with INFERRED edges - indirect deps,
277
- shared assumptions, latent couplings. Mark uncertain ones AMBIGUOUS instead of omitting.
278
-
279
- Semantic similarity: if two concepts in this chunk solve the same problem or represent the same idea without any structural link (no import, no call, no citation), add a `semantically_similar_to` edge marked INFERRED with a confidence_score reflecting how similar they are (0.6-0.95). Examples:
280
- - Two functions that both validate user input but never call each other
281
- - A class in code and a concept in a paper that describe the same algorithm
282
- - Two error types that handle the same failure mode differently
283
- Only add these when the similarity is genuinely non-obvious and cross-cutting. Do not add them for trivially similar things.
284
-
285
- Hyperedges: if 3 or more nodes clearly participate together in a shared concept, flow, or pattern that is not captured by pairwise edges alone, add a hyperedge to a top-level `hyperedges` array. Examples:
286
- - All classes that implement a common protocol or interface
287
- - All functions in an authentication flow (even if they don't all call each other)
288
- - All concepts from a paper section that form one coherent idea
289
- Use sparingly — only when the group relationship adds information beyond the pairwise edges. Maximum 3 hyperedges per chunk.
290
-
291
- If a file has YAML frontmatter (--- ... ---), copy source_url, captured_at, author,
292
- contributor onto every node from that file.
293
-
294
- confidence_score is REQUIRED on every edge - never omit it, never use 0.5 as a default:
295
- - EXTRACTED edges: confidence_score = 1.0 always
296
- - INFERRED edges: reason about each edge individually.
297
- Direct structural evidence (shared data structure, clear dependency): 0.8-0.9.
298
- Reasonable inference with some uncertainty: 0.6-0.7.
299
- Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5.
300
- - AMBIGUOUS edges: 0.1-0.3
301
-
302
- Output exactly this JSON (no other text):
303
- {"nodes":[{"id":"filestem_entityname","label":"Human Readable Name","file_type":"code|document|paper|image","source_file":"relative/path","source_location":null,"source_url":null,"captured_at":null,"author":null,"contributor":null}],"edges":[{"source":"node_id","target":"node_id","relation":"calls|implements|references|cites|conceptually_related_to|shares_data_with|semantically_similar_to|rationale_for","confidence":"EXTRACTED|INFERRED|AMBIGUOUS","confidence_score":1.0,"source_file":"relative/path","source_location":null,"weight":1.0}],"hyperedges":[{"id":"snake_case_id","label":"Human Readable Label","nodes":["node_id1","node_id2","node_id3"],"relation":"participate_in|implement|form","confidence":"EXTRACTED|INFERRED","confidence_score":0.75,"source_file":"relative/path"}],"input_tokens":0,"output_tokens":0}
304
- ```
305
-
306
- **Step B3 - Collect, cache, and merge**
307
-
308
- Wait for all subagents. For each result:
309
- - Check that `graphify-out/.graphify_chunk_NN.json` exists on disk — this is the success signal
310
- - If the file exists and contains valid JSON with `nodes` and `edges`, include it and save to cache
311
- - If the file is missing, the subagent was likely dispatched as read-only (Explore type) — print a warning: "chunk N missing from disk — subagent may have been read-only. Re-run with general-purpose agent." Do not silently skip.
312
- - If a subagent failed or returned invalid JSON, print a warning and skip that chunk - do not abort
313
-
314
- If more than half the chunks failed or are missing, stop and tell the user to re-run and ensure `subagent_type="general-purpose"` is used.
315
-
316
- Save new results to cache:
317
- ```bash
318
- $(cat graphify-out/.graphify_python) -c "
319
- import json
320
- from graphify.cache import save_semantic_cache
321
- from pathlib import Path
322
-
323
- new = json.loads(Path('graphify-out/.graphify_semantic_new.json').read_text()) if Path('graphify-out/.graphify_semantic_new.json').exists() else {'nodes':[],'edges':[],'hyperedges':[]}
324
- saved = save_semantic_cache(new.get('nodes', []), new.get('edges', []), new.get('hyperedges', []))
325
- print(f'Cached {saved} files')
326
- "
327
- ```
328
-
329
- Merge cached + new results into `graphify-out/.graphify_semantic.json`:
330
- ```bash
331
- $(cat graphify-out/.graphify_python) -c "
332
- import json
333
- from pathlib import Path
334
-
335
- cached = json.loads(Path('graphify-out/.graphify_cached.json').read_text()) if Path('graphify-out/.graphify_cached.json').exists() else {'nodes':[],'edges':[],'hyperedges':[]}
336
- new = json.loads(Path('graphify-out/.graphify_semantic_new.json').read_text()) if Path('graphify-out/.graphify_semantic_new.json').exists() else {'nodes':[],'edges':[],'hyperedges':[]}
337
-
338
- all_nodes = cached['nodes'] + new.get('nodes', [])
339
- all_edges = cached['edges'] + new.get('edges', [])
340
- all_hyperedges = cached.get('hyperedges', []) + new.get('hyperedges', [])
341
- seen = set()
342
- deduped = []
343
- for n in all_nodes:
344
- if n['id'] not in seen:
345
- seen.add(n['id'])
346
- deduped.append(n)
347
-
348
- merged = {
349
- 'nodes': deduped,
350
- 'edges': all_edges,
351
- 'hyperedges': all_hyperedges,
352
- 'input_tokens': new.get('input_tokens', 0),
353
- 'output_tokens': new.get('output_tokens', 0),
354
- }
355
- Path('graphify-out/.graphify_semantic.json').write_text(json.dumps(merged, indent=2))
356
- print(f'Extraction complete - {len(deduped)} nodes, {len(all_edges)} edges ({len(cached[\"nodes\"])} from cache, {len(new.get(\"nodes\",[]))} new)')
357
- "
358
- ```
359
- Clean up temp files: `rm -f graphify-out/.graphify_cached.json graphify-out/.graphify_uncached.txt graphify-out/.graphify_semantic_new.json`
360
-
361
- #### Part C - Merge AST + semantic into final extraction
362
-
363
- ```bash
364
- $(cat graphify-out/.graphify_python) -c "
365
- import sys, json
366
- from pathlib import Path
367
-
368
- ast = json.loads(Path('graphify-out/.graphify_ast.json').read_text())
369
- sem = json.loads(Path('graphify-out/.graphify_semantic.json').read_text())
370
-
371
- # Merge: AST nodes first, semantic nodes deduplicated by id
372
- seen = {n['id'] for n in ast['nodes']}
373
- merged_nodes = list(ast['nodes'])
374
- for n in sem['nodes']:
375
- if n['id'] not in seen:
376
- merged_nodes.append(n)
377
- seen.add(n['id'])
378
-
379
- merged_edges = ast['edges'] + sem['edges']
380
- merged_hyperedges = sem.get('hyperedges', [])
381
- merged = {
382
- 'nodes': merged_nodes,
383
- 'edges': merged_edges,
384
- 'hyperedges': merged_hyperedges,
385
- 'input_tokens': sem.get('input_tokens', 0),
386
- 'output_tokens': sem.get('output_tokens', 0),
387
- }
388
- Path('graphify-out/.graphify_extract.json').write_text(json.dumps(merged, indent=2))
389
- total = len(merged_nodes)
390
- edges = len(merged_edges)
391
- print(f'Merged: {total} nodes, {edges} edges ({len(ast[\"nodes\"])} AST + {len(sem[\"nodes\"])} semantic)')
392
- "
393
- ```
394
-
395
- ### Step 4 - Build graph, cluster, analyze, generate outputs
396
-
397
- **Before starting:** note whether `--directed` was given. If so, pass `directed=True` to `build_from_json()` in the code block below. This builds a `DiGraph` that preserves edge direction (source→target) instead of the default undirected `Graph`.
398
-
399
- ```bash
400
- mkdir -p graphify-out
401
- $(cat graphify-out/.graphify_python) -c "
402
- import sys, json
403
- from graphify.build import build_from_json
404
- from graphify.cluster import cluster, score_all
405
- from graphify.analyze import god_nodes, surprising_connections, suggest_questions
406
- from graphify.report import generate
407
- from graphify.export import to_json
408
- from pathlib import Path
409
-
410
- extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
411
- detection = json.loads(Path('graphify-out/.graphify_detect.json').read_text())
412
-
413
- G = build_from_json(extraction)
414
- communities = cluster(G)
415
- cohesion = score_all(G, communities)
416
- tokens = {'input': extraction.get('input_tokens', 0), 'output': extraction.get('output_tokens', 0)}
417
- gods = god_nodes(G)
418
- surprises = surprising_connections(G, communities)
419
- labels = {cid: 'Community ' + str(cid) for cid in communities}
420
- # Placeholder questions - regenerated with real labels in Step 5
421
- questions = suggest_questions(G, communities, labels)
422
-
423
- report = generate(G, communities, cohesion, labels, gods, surprises, detection, tokens, 'INPUT_PATH', suggested_questions=questions)
424
- Path('graphify-out/GRAPH_REPORT.md').write_text(report)
425
- to_json(G, communities, 'graphify-out/graph.json')
426
-
427
- analysis = {
428
- 'communities': {str(k): v for k, v in communities.items()},
429
- 'cohesion': {str(k): v for k, v in cohesion.items()},
430
- 'gods': gods,
431
- 'surprises': surprises,
432
- 'questions': questions,
433
- }
434
- Path('graphify-out/.graphify_analysis.json').write_text(json.dumps(analysis, indent=2))
435
- if G.number_of_nodes() == 0:
436
- print('ERROR: Graph is empty - extraction produced no nodes.')
437
- print('Possible causes: all files were skipped, binary-only corpus, or extraction failed.')
438
- raise SystemExit(1)
439
- print(f'Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges, {len(communities)} communities')
440
- "
441
- ```
442
-
443
- If this step prints `ERROR: Graph is empty`, stop and tell the user what happened - do not proceed to labeling or visualization.
444
-
445
- Replace INPUT_PATH with the actual path.
446
-
447
- ### Step 5 - Label communities
448
-
449
- Read `graphify-out/.graphify_analysis.json`. For each community key, look at its node labels and write a 2-5 word plain-language name (e.g. "Attention Mechanism", "Training Pipeline", "Data Loading").
450
-
451
- Then regenerate the report and save the labels for the visualizer:
452
-
453
- ```bash
454
- $(cat graphify-out/.graphify_python) -c "
455
- import sys, json
456
- from graphify.build import build_from_json
457
- from graphify.cluster import score_all
458
- from graphify.analyze import god_nodes, surprising_connections, suggest_questions
459
- from graphify.report import generate
460
- from pathlib import Path
461
-
462
- extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
463
- detection = json.loads(Path('graphify-out/.graphify_detect.json').read_text())
464
- analysis = json.loads(Path('graphify-out/.graphify_analysis.json').read_text())
465
-
466
- G = build_from_json(extraction)
467
- communities = {int(k): v for k, v in analysis['communities'].items()}
468
- cohesion = {int(k): v for k, v in analysis['cohesion'].items()}
469
- tokens = {'input': extraction.get('input_tokens', 0), 'output': extraction.get('output_tokens', 0)}
470
-
471
- # LABELS - replace these with the names you chose above
472
- labels = LABELS_DICT
473
-
474
- # Regenerate questions with real community labels (labels affect question phrasing)
475
- questions = suggest_questions(G, communities, labels)
476
-
477
- report = generate(G, communities, cohesion, labels, analysis['gods'], analysis['surprises'], detection, tokens, 'INPUT_PATH', suggested_questions=questions)
478
- Path('graphify-out/GRAPH_REPORT.md').write_text(report)
479
- Path('graphify-out/.graphify_labels.json').write_text(json.dumps({str(k): v for k, v in labels.items()}))
480
- print('Report updated with community labels')
481
- "
482
- ```
483
-
484
- Replace `LABELS_DICT` with the actual dict you constructed (e.g. `{0: "Attention Mechanism", 1: "Training Pipeline"}`).
485
- Replace INPUT_PATH with the actual path.
486
-
487
- ### Step 6 - Generate Obsidian vault (opt-in) + HTML
488
-
489
- **Generate HTML always** (unless `--no-viz`). **Obsidian vault only if `--obsidian` was explicitly given** — skip it otherwise, it generates one file per node.
490
-
491
- If `--obsidian` was given:
492
-
493
- - If `--obsidian-dir <path>` was also given, use that path as the vault directory. Otherwise default to `graphify-out/obsidian`.
494
-
495
- ```bash
496
- $(cat graphify-out/.graphify_python) -c "
497
- import sys, json
498
- from graphify.build import build_from_json
499
- from graphify.export import to_obsidian, to_canvas
500
- from pathlib import Path
501
-
502
- extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
503
- analysis = json.loads(Path('graphify-out/.graphify_analysis.json').read_text())
504
- labels_raw = json.loads(Path('graphify-out/.graphify_labels.json').read_text()) if Path('graphify-out/.graphify_labels.json').exists() else {}
505
-
506
- G = build_from_json(extraction)
507
- communities = {int(k): v for k, v in analysis['communities'].items()}
508
- cohesion = {int(k): v for k, v in analysis['cohesion'].items()}
509
- labels = {int(k): v for k, v in labels_raw.items()}
510
-
511
- obsidian_dir = 'OBSIDIAN_DIR' # replace with --obsidian-dir value, or 'graphify-out/obsidian' if not given
512
-
513
- n = to_obsidian(G, communities, obsidian_dir, community_labels=labels or None, cohesion=cohesion)
514
- print(f'Obsidian vault: {n} notes in {obsidian_dir}/')
515
-
516
- to_canvas(G, communities, f'{obsidian_dir}/graph.canvas', community_labels=labels or None)
517
- print(f'Canvas: {obsidian_dir}/graph.canvas - open in Obsidian for structured community layout')
518
- print()
519
- print(f'Open {obsidian_dir}/ as a vault in Obsidian.')
520
- print(' Graph view - nodes colored by community (set automatically)')
521
- print(' graph.canvas - structured layout with communities as groups')
522
- print(' _COMMUNITY_* - overview notes with cohesion scores and dataview queries')
523
- "
524
- ```
525
-
526
- Generate the HTML graph (always, unless `--no-viz`):
527
-
528
- ```bash
529
- $(cat graphify-out/.graphify_python) -c "
530
- import sys, json
531
- from graphify.build import build_from_json
532
- from graphify.export import to_html
533
- from pathlib import Path
534
-
535
- extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
536
- analysis = json.loads(Path('graphify-out/.graphify_analysis.json').read_text())
537
- labels_raw = json.loads(Path('graphify-out/.graphify_labels.json').read_text()) if Path('graphify-out/.graphify_labels.json').exists() else {}
538
-
539
- G = build_from_json(extraction)
540
- communities = {int(k): v for k, v in analysis['communities'].items()}
541
- labels = {int(k): v for k, v in labels_raw.items()}
542
-
543
- if G.number_of_nodes() > 5000:
544
- print(f'Graph has {G.number_of_nodes()} nodes - too large for HTML viz. Use Obsidian vault instead.')
545
- else:
546
- to_html(G, communities, 'graphify-out/graph.html', community_labels=labels or None)
547
- print('graph.html written - open in any browser, no server needed')
548
- "
549
- ```
550
-
551
- ### Step 6b - Wiki (only if --wiki flag)
552
-
553
- **Only run this step if `--wiki` was explicitly given in the original command.**
554
-
555
- Run this before Step 9 (cleanup) so `.graphify_labels.json` is still available.
556
-
557
- ```bash
558
- $(cat graphify-out/.graphify_python) -c "
559
- import json
560
- from graphify.build import build_from_json
561
- from graphify.wiki import to_wiki
562
- from graphify.analyze import god_nodes
563
- from pathlib import Path
564
-
565
- extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
566
- analysis = json.loads(Path('graphify-out/.graphify_analysis.json').read_text())
567
- labels_raw = json.loads(Path('graphify-out/.graphify_labels.json').read_text()) if Path('graphify-out/.graphify_labels.json').exists() else {}
568
-
569
- G = build_from_json(extraction)
570
- communities = {int(k): v for k, v in analysis['communities'].items()}
571
- cohesion = {int(k): v for k, v in analysis['cohesion'].items()}
572
- labels = {int(k): v for k, v in labels_raw.items()}
573
- gods = god_nodes(G)
574
-
575
- n = to_wiki(G, communities, 'graphify-out/wiki', community_labels=labels or None, cohesion=cohesion, god_nodes_data=gods)
576
- print(f'Wiki: {n} articles written to graphify-out/wiki/')
577
- print(' graphify-out/wiki/index.md -> agent entry point')
578
- "
579
- ```
580
-
581
- ### Step 7 - Neo4j export (only if --neo4j or --neo4j-push flag)
582
-
583
- **If `--neo4j`** - generate a Cypher file for manual import:
584
-
585
- ```bash
586
- $(cat graphify-out/.graphify_python) -c "
587
- import sys, json
588
- from graphify.build import build_from_json
589
- from graphify.export import to_cypher
590
- from pathlib import Path
591
-
592
- G = build_from_json(json.loads(Path('graphify-out/.graphify_extract.json').read_text()))
593
- to_cypher(G, 'graphify-out/cypher.txt')
594
- print('cypher.txt written - import with: cypher-shell < graphify-out/cypher.txt')
595
- "
596
- ```
597
-
598
- **If `--neo4j-push <uri>`** - push directly to a running Neo4j instance. Ask the user for credentials if not provided:
599
-
600
- ```bash
601
- $(cat graphify-out/.graphify_python) -c "
602
- import sys, json
603
- from graphify.build import build_from_json
604
- from graphify.cluster import cluster
605
- from graphify.export import push_to_neo4j
606
- from pathlib import Path
607
-
608
- extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
609
- analysis = json.loads(Path('graphify-out/.graphify_analysis.json').read_text())
610
- G = build_from_json(extraction)
611
- communities = {int(k): v for k, v in analysis['communities'].items()}
612
-
613
- result = push_to_neo4j(G, uri='NEO4J_URI', user='NEO4J_USER', password='NEO4J_PASSWORD', communities=communities)
614
- print(f'Pushed to Neo4j: {result[\"nodes\"]} nodes, {result[\"edges\"]} edges')
615
- "
616
- ```
617
-
618
- Replace `NEO4J_URI`, `NEO4J_USER`, `NEO4J_PASSWORD` with actual values. Default URI is `bolt://localhost:7687`, default user is `neo4j`. Uses MERGE - safe to re-run without creating duplicates.
619
-
620
- ### Step 7b - SVG export (only if --svg flag)
621
-
622
- ```bash
623
- $(cat graphify-out/.graphify_python) -c "
624
- import sys, json
625
- from graphify.build import build_from_json
626
- from graphify.export import to_svg
627
- from pathlib import Path
628
-
629
- extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
630
- analysis = json.loads(Path('graphify-out/.graphify_analysis.json').read_text())
631
- labels_raw = json.loads(Path('graphify-out/.graphify_labels.json').read_text()) if Path('graphify-out/.graphify_labels.json').exists() else {}
632
-
633
- G = build_from_json(extraction)
634
- communities = {int(k): v for k, v in analysis['communities'].items()}
635
- labels = {int(k): v for k, v in labels_raw.items()}
636
-
637
- to_svg(G, communities, 'graphify-out/graph.svg', community_labels=labels or None)
638
- print('graph.svg written - embeds in Obsidian, Notion, GitHub READMEs')
639
- "
640
- ```
641
-
642
- ### Step 7c - GraphML export (only if --graphml flag)
643
-
644
- ```bash
645
- $(cat graphify-out/.graphify_python) -c "
646
- import json
647
- from graphify.build import build_from_json
648
- from graphify.export import to_graphml
649
- from pathlib import Path
650
-
651
- extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
652
- analysis = json.loads(Path('graphify-out/.graphify_analysis.json').read_text())
653
-
654
- G = build_from_json(extraction)
655
- communities = {int(k): v for k, v in analysis['communities'].items()}
656
-
657
- to_graphml(G, communities, 'graphify-out/graph.graphml')
658
- print('graph.graphml written - open in Gephi, yEd, or any GraphML tool')
659
- "
660
- ```
661
-
662
- ### Step 7d - MCP server (only if --mcp flag)
663
-
664
- ```bash
665
- python3 -m graphify.serve graphify-out/graph.json
666
- ```
667
-
668
- This starts a stdio MCP server that exposes tools: `query_graph`, `get_node`, `get_neighbors`, `get_community`, `god_nodes`, `graph_stats`, `shortest_path`. Add to Claude Desktop or any MCP-compatible agent orchestrator so other agents can query the graph live.
669
-
670
- To configure in Claude Desktop, add to `claude_desktop_config.json`:
671
- ```json
672
- {
673
- "mcpServers": {
674
- "graphify": {
675
- "command": "python3",
676
- "args": ["-m", "graphify.serve", "/absolute/path/to/graphify-out/graph.json"]
677
- }
678
- }
679
- }
680
- ```
681
-
682
- ### Step 8 - Token reduction benchmark (only if total_words > 5000)
683
-
684
- If `total_words` from `graphify-out/.graphify_detect.json` is greater than 5,000, run:
685
-
686
- ```bash
687
- $(cat graphify-out/.graphify_python) -c "
688
- import json
689
- from graphify.benchmark import run_benchmark, print_benchmark
690
- from pathlib import Path
691
-
692
- detection = json.loads(Path('graphify-out/.graphify_detect.json').read_text())
693
- result = run_benchmark('graphify-out/graph.json', corpus_words=detection['total_words'])
694
- print_benchmark(result)
695
- "
696
- ```
697
-
698
- Print the output directly in chat. If `total_words <= 5000`, skip silently - the graph value is structural clarity, not token compression, for small corpora.
699
-
700
- ---
701
-
702
- ### Step 9 - Save manifest, update cost tracker, clean up, and report
703
-
704
- ```bash
705
- $(cat graphify-out/.graphify_python) -c "
706
- import json
707
- from pathlib import Path
708
- from datetime import datetime, timezone
709
- from graphify.detect import save_manifest
710
-
711
- # Save manifest for --update
712
- detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text())
713
- save_manifest(detect['files'])
714
-
715
- # Update cumulative cost tracker
716
- extract = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
717
- input_tok = extract.get('input_tokens', 0)
718
- output_tok = extract.get('output_tokens', 0)
719
-
720
- cost_path = Path('graphify-out/cost.json')
721
- if cost_path.exists():
722
- cost = json.loads(cost_path.read_text())
723
- else:
724
- cost = {'runs': [], 'total_input_tokens': 0, 'total_output_tokens': 0}
725
-
726
- cost['runs'].append({
727
- 'date': datetime.now(timezone.utc).isoformat(),
728
- 'input_tokens': input_tok,
729
- 'output_tokens': output_tok,
730
- 'files': detect.get('total_files', 0),
731
- })
732
- cost['total_input_tokens'] += input_tok
733
- cost['total_output_tokens'] += output_tok
734
- cost_path.write_text(json.dumps(cost, indent=2))
735
-
736
- print(f'This run: {input_tok:,} input tokens, {output_tok:,} output tokens')
737
- print(f'All time: {cost[\"total_input_tokens\"]:,} input, {cost[\"total_output_tokens\"]:,} output ({len(cost[\"runs\"])} runs)')
738
- "
739
- rm -f graphify-out/.graphify_detect.json graphify-out/.graphify_extract.json graphify-out/.graphify_ast.json graphify-out/.graphify_semantic.json graphify-out/.graphify_analysis.json graphify-out/.graphify_labels.json
740
- rm -f graphify-out/.needs_update 2>/dev/null || true
741
- ```
742
-
743
- Tell the user (omit the obsidian line unless --obsidian was given):
744
- ```
745
- Graph complete. Outputs in PATH_TO_DIR/graphify-out/
746
-
747
- graph.html - interactive graph, open in browser
748
- GRAPH_REPORT.md - audit report
749
- graph.json - raw graph data
750
- obsidian/ - Obsidian vault (only if --obsidian was given)
751
- ```
752
-
753
- If graphify saved you time, consider supporting it: https://github.com/sponsors/safishamsi
754
-
755
- Replace PATH_TO_DIR with the actual absolute path of the directory that was processed.
756
-
757
- Then paste these sections from GRAPH_REPORT.md directly into the chat:
758
- - God Nodes
759
- - Surprising Connections
760
- - Suggested Questions
761
-
762
- Do NOT paste the full report - just those three sections. Keep it concise.
763
-
764
- Then immediately offer to explore. Pick the single most interesting suggested question from the report - the one that crosses the most community boundaries or has the most surprising bridge node - and ask:
765
-
766
- > "The most interesting question this graph can answer: **[question]**. Want me to trace it?"
767
-
768
- If the user says yes, run `/graphify query "[question]"` on the graph and walk them through the answer using the graph structure - which nodes connect, which community boundaries get crossed, what the path reveals. Keep going as long as they want to explore. Each answer should end with a natural follow-up ("this connects to X - want to go deeper?") so the session feels like navigation, not a one-shot report.
769
-
770
- The graph is the map. Your job after the pipeline is to be the guide.
771
-
772
- ---
773
-
774
- ## Interpreter guard for subcommands
775
-
776
- Before running any subcommand below (`--update`, `--cluster-only`, `query`, `path`, `explain`, `add`), check that `.graphify_python` exists. If it's missing (e.g. user deleted `graphify-out/`), re-resolve the interpreter first:
777
-
778
- ```bash
779
- if [ ! -f graphify-out/.graphify_python ]; then
780
- GRAPHIFY_BIN=$(which graphify 2>/dev/null)
781
- if [ -n "$GRAPHIFY_BIN" ]; then
782
- PYTHON=$(head -1 "$GRAPHIFY_BIN" | tr -d '#!')
783
- case "$PYTHON" in *[!a-zA-Z0-9/_.-]*) PYTHON="python3" ;; esac
784
- else
785
- PYTHON="python3"
786
- fi
787
- mkdir -p graphify-out
788
- "$PYTHON" -c "import sys; open('graphify-out/.graphify_python', 'w').write(sys.executable)"
789
- fi
790
- ```
791
-
792
- ## For --update (incremental re-extraction)
793
-
794
- Use when you've added or modified files since the last run. Only re-extracts changed files - saves tokens and time.
795
-
796
- ```bash
797
- $(cat graphify-out/.graphify_python) -c "
798
- import sys, json
799
- from graphify.detect import detect_incremental, save_manifest
800
- from pathlib import Path
801
-
802
- result = detect_incremental(Path('INPUT_PATH'))
803
- new_total = result.get('new_total', 0)
804
- print(json.dumps(result, indent=2))
805
- Path('graphify-out/.graphify_incremental.json').write_text(json.dumps(result))
806
- if new_total == 0:
807
- print('No files changed since last run. Nothing to update.')
808
- raise SystemExit(0)
809
- print(f'{new_total} new/changed file(s) to re-extract.')
810
- "
811
- ```
812
-
813
- If new files exist, first check whether all changed files are code files:
814
-
815
- ```bash
816
- $(cat graphify-out/.graphify_python) -c "
817
- import json
818
- from pathlib import Path
819
-
820
- result = json.loads(open('graphify-out/.graphify_incremental.json').read()) if Path('graphify-out/.graphify_incremental.json').exists() else {}
821
- code_exts = {'.py','.ts','.js','.go','.rs','.java','.cpp','.c','.rb','.swift','.kt','.cs','.scala','.php','.cc','.cxx','.hpp','.h','.kts','.lua','.toc'}
822
- new_files = result.get('new_files', {})
823
- all_changed = [f for files in new_files.values() for f in files]
824
- code_only = all(Path(f).suffix.lower() in code_exts for f in all_changed)
825
- print('code_only:', code_only)
826
- "
827
- ```
828
-
829
- If `code_only` is True: print `[graphify update] Code-only changes detected - skipping semantic extraction (no LLM needed)`, run only Step 3A (AST) on the changed files, skip Step 3B entirely (no subagents), then go straight to merge and Steps 4–8.
830
-
831
- If `code_only` is False (any changed file is a doc/paper/image): run the full Steps 3A–3C pipeline as normal.
832
-
833
- Then:
834
-
835
- ```bash
836
- $(cat graphify-out/.graphify_python) -c "
837
- import sys, json
838
- from graphify.build import build_from_json
839
- from graphify.export import to_json
840
- from networkx.readwrite import json_graph
841
- import networkx as nx
842
- from pathlib import Path
843
-
844
- # Load existing graph
845
- existing_data = json.loads(Path('graphify-out/graph.json').read_text())
846
- G_existing = json_graph.node_link_graph(existing_data, edges='links')
847
-
848
- # Load new extraction
849
- new_extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
850
- G_new = build_from_json(new_extraction)
851
-
852
- # Prune nodes from deleted files
853
- incremental = json.loads(Path('graphify-out/.graphify_incremental.json').read_text())
854
- deleted = set(incremental.get('deleted_files', []))
855
- if deleted:
856
- to_remove = [n for n, d in G_existing.nodes(data=True) if d.get('source_file') in deleted]
857
- G_existing.remove_nodes_from(to_remove)
858
- print(f'Pruned {len(to_remove)} ghost nodes from {len(deleted)} deleted file(s)')
859
-
860
- # Merge: new nodes/edges into existing graph
861
- G_existing.update(G_new)
862
- print(f'Merged: {G_existing.number_of_nodes()} nodes, {G_existing.number_of_edges()} edges')
863
- "
864
- ```
865
-
866
- Then run Steps 4–8 on the merged graph as normal.
867
-
868
- After Step 4, show the graph diff:
869
-
870
- ```bash
871
- $(cat graphify-out/.graphify_python) -c "
872
- import json
873
- from graphify.analyze import graph_diff
874
- from graphify.build import build_from_json
875
- from networkx.readwrite import json_graph
876
- import networkx as nx
877
- from pathlib import Path
878
-
879
- # Load old graph (before update) from backup written before merge
880
- old_data = json.loads(Path('graphify-out/.graphify_old.json').read_text()) if Path('graphify-out/.graphify_old.json').exists() else None
881
- new_extract = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
882
- G_new = build_from_json(new_extract)
883
-
884
- if old_data:
885
- G_old = json_graph.node_link_graph(old_data, edges='links')
886
- diff = graph_diff(G_old, G_new)
887
- print(diff['summary'])
888
- if diff['new_nodes']:
889
- print('New nodes:', ', '.join(n['label'] for n in diff['new_nodes'][:5]))
890
- if diff['new_edges']:
891
- print('New edges:', len(diff['new_edges']))
892
- "
893
- ```
894
-
895
- Before the merge step, save the old graph: `cp graphify-out/graph.json graphify-out/.graphify_old.json`
896
- Clean up after: `rm -f graphify-out/.graphify_old.json`
897
-
898
- ---
899
-
900
- ## For --cluster-only
901
-
902
- Skip Steps 1–3. Load the existing graph from `graphify-out/graph.json` and re-run clustering:
903
-
904
- ```bash
905
- $(cat graphify-out/.graphify_python) -c "
906
- import sys, json
907
- from graphify.cluster import cluster, score_all
908
- from graphify.analyze import god_nodes, surprising_connections
909
- from graphify.report import generate
910
- from graphify.export import to_json
911
- from networkx.readwrite import json_graph
912
- import networkx as nx
913
- from pathlib import Path
914
-
915
- data = json.loads(Path('graphify-out/graph.json').read_text())
916
- G = json_graph.node_link_graph(data, edges='links')
917
-
918
- detection = {'total_files': 0, 'total_words': 99999, 'needs_graph': True, 'warning': None,
919
- 'files': {'code': [], 'document': [], 'paper': []}}
920
- tokens = {'input': 0, 'output': 0}
921
-
922
- communities = cluster(G)
923
- cohesion = score_all(G, communities)
924
- gods = god_nodes(G)
925
- surprises = surprising_connections(G, communities)
926
- labels = {cid: 'Community ' + str(cid) for cid in communities}
927
-
928
- report = generate(G, communities, cohesion, labels, gods, surprises, detection, tokens, '.')
929
- Path('graphify-out/GRAPH_REPORT.md').write_text(report)
930
- to_json(G, communities, 'graphify-out/graph.json')
931
-
932
- analysis = {
933
- 'communities': {str(k): v for k, v in communities.items()},
934
- 'cohesion': {str(k): v for k, v in cohesion.items()},
935
- 'gods': gods,
936
- 'surprises': surprises,
937
- }
938
- Path('graphify-out/.graphify_analysis.json').write_text(json.dumps(analysis, indent=2))
939
- print(f'Re-clustered: {len(communities)} communities')
940
- "
941
- ```
942
-
943
- Then run Steps 5–9 as normal (label communities, generate viz, benchmark, clean up, report).
944
-
945
- ---
946
-
947
- ## For /graphify query
948
-
949
- Two traversal modes - choose based on the question:
950
-
951
- | Mode | Flag | Best for |
952
- |------|------|----------|
953
- | BFS (default) | _(none)_ | "What is X connected to?" - broad context, nearest neighbors first |
954
- | DFS | `--dfs` | "How does X reach Y?" - trace a specific chain or dependency path |
955
-
956
- First check the graph exists:
957
- ```bash
958
- $(cat graphify-out/.graphify_python) -c "
959
- from pathlib import Path
960
- if not Path('graphify-out/graph.json').exists():
961
- print('ERROR: No graph found. Run /graphify <path> first to build the graph.')
962
- raise SystemExit(1)
963
- "
964
- ```
965
- If it fails, stop and tell the user to run `/graphify <path>` first.
966
-
967
- Load `graphify-out/graph.json`, then:
968
-
969
- 1. Find the 1-3 nodes whose label best matches key terms in the question.
970
- 2. Run the appropriate traversal from each starting node.
971
- 3. Read the subgraph - node labels, edge relations, confidence tags, source locations.
972
- 4. Answer using **only** what the graph contains. Quote `source_location` when citing a specific fact.
973
- 5. If the graph lacks enough information, say so - do not hallucinate edges.
974
-
975
- ```bash
976
- $(cat graphify-out/.graphify_python) -c "
977
- import sys, json
978
- from networkx.readwrite import json_graph
979
- import networkx as nx
980
- from pathlib import Path
981
-
982
- data = json.loads(Path('graphify-out/graph.json').read_text())
983
- G = json_graph.node_link_graph(data, edges='links')
984
-
985
- question = 'QUESTION'
986
- mode = 'MODE' # 'bfs' or 'dfs'
987
- terms = [t.lower() for t in question.split() if len(t) > 3]
988
-
989
- # Find best-matching start nodes
990
- scored = []
991
- for nid, ndata in G.nodes(data=True):
992
- label = ndata.get('label', '').lower()
993
- score = sum(1 for t in terms if t in label)
994
- if score > 0:
995
- scored.append((score, nid))
996
- scored.sort(reverse=True)
997
- start_nodes = [nid for _, nid in scored[:3]]
998
-
999
- if not start_nodes:
1000
- print('No matching nodes found for query terms:', terms)
1001
- sys.exit(0)
1002
-
1003
- subgraph_nodes = set()
1004
- subgraph_edges = []
1005
-
1006
- if mode == 'dfs':
1007
- # DFS: follow one path as deep as possible before backtracking.
1008
- # Depth-limited to 6 to avoid traversing the whole graph.
1009
- visited = set()
1010
- stack = [(n, 0) for n in reversed(start_nodes)]
1011
- while stack:
1012
- node, depth = stack.pop()
1013
- if node in visited or depth > 6:
1014
- continue
1015
- visited.add(node)
1016
- subgraph_nodes.add(node)
1017
- for neighbor in G.neighbors(node):
1018
- if neighbor not in visited:
1019
- stack.append((neighbor, depth + 1))
1020
- subgraph_edges.append((node, neighbor))
1021
- else:
1022
- # BFS: explore all neighbors layer by layer up to depth 3.
1023
- frontier = set(start_nodes)
1024
- subgraph_nodes = set(start_nodes)
1025
- for _ in range(3):
1026
- next_frontier = set()
1027
- for n in frontier:
1028
- for neighbor in G.neighbors(n):
1029
- if neighbor not in subgraph_nodes:
1030
- next_frontier.add(neighbor)
1031
- subgraph_edges.append((n, neighbor))
1032
- subgraph_nodes.update(next_frontier)
1033
- frontier = next_frontier
1034
-
1035
- # Token-budget aware output: rank by relevance, cut at budget (~4 chars/token)
1036
- token_budget = BUDGET # default 2000
1037
- char_budget = token_budget * 4
1038
-
1039
- # Score each node by term overlap for ranked output
1040
- def relevance(nid):
1041
- label = G.nodes[nid].get('label', '').lower()
1042
- return sum(1 for t in terms if t in label)
1043
-
1044
- ranked_nodes = sorted(subgraph_nodes, key=relevance, reverse=True)
1045
-
1046
- lines = [f'Traversal: {mode.upper()} | Start: {[G.nodes[n].get(\"label\",n) for n in start_nodes]} | {len(subgraph_nodes)} nodes']
1047
- for nid in ranked_nodes:
1048
- d = G.nodes[nid]
1049
- lines.append(f' NODE {d.get(\"label\", nid)} [src={d.get(\"source_file\",\"\")} loc={d.get(\"source_location\",\"\")}]')
1050
- for u, v in subgraph_edges:
1051
- if u in subgraph_nodes and v in subgraph_nodes:
1052
- d = G.edges[u, v]
1053
- lines.append(f' EDGE {G.nodes[u].get(\"label\",u)} --{d.get(\"relation\",\"\")} [{d.get(\"confidence\",\"\")}]--> {G.nodes[v].get(\"label\",v)}')
1054
-
1055
- output = '\n'.join(lines)
1056
- if len(output) > char_budget:
1057
- output = output[:char_budget] + f'\n... (truncated at ~{token_budget} token budget - use --budget N for more)'
1058
- print(output)
1059
- "
1060
- ```
1061
-
1062
- Replace `QUESTION` with the user's actual question, `MODE` with `bfs` or `dfs`, and `BUDGET` with the token budget (default `2000`, or whatever `--budget N` specifies). Then answer based on the subgraph output above.
1063
-
1064
- After writing the answer, save it back into the graph so it improves future queries:
1065
-
1066
- ```bash
1067
- $(cat graphify-out/.graphify_python) -m graphify save-result --question "QUESTION" --answer "ANSWER" --type query --nodes NODE1 NODE2
1068
- ```
1069
-
1070
- Replace `QUESTION` with the question, `ANSWER` with your full answer text, `SOURCE_NODES` with the list of node labels you cited. This closes the feedback loop: the next `--update` will extract this Q&A as a node in the graph.
1071
-
1072
- ---
1073
-
1074
- ## For /graphify path
1075
-
1076
- Find the shortest path between two named concepts in the graph.
1077
-
1078
- First check the graph exists:
1079
- ```bash
1080
- $(cat graphify-out/.graphify_python) -c "
1081
- from pathlib import Path
1082
- if not Path('graphify-out/graph.json').exists():
1083
- print('ERROR: No graph found. Run /graphify <path> first to build the graph.')
1084
- raise SystemExit(1)
1085
- "
1086
- ```
1087
- If it fails, stop and tell the user to run `/graphify <path>` first.
1088
-
1089
- ```bash
1090
- $(cat graphify-out/.graphify_python) -c "
1091
- import json, sys
1092
- import networkx as nx
1093
- from networkx.readwrite import json_graph
1094
- from pathlib import Path
1095
-
1096
- data = json.loads(Path('graphify-out/graph.json').read_text())
1097
- G = json_graph.node_link_graph(data, edges='links')
1098
-
1099
- a_term = 'NODE_A'
1100
- b_term = 'NODE_B'
1101
-
1102
- def find_node(term):
1103
- term = term.lower()
1104
- scored = sorted(
1105
- [(sum(1 for w in term.split() if w in G.nodes[n].get('label','').lower()), n)
1106
- for n in G.nodes()],
1107
- reverse=True
1108
- )
1109
- return scored[0][1] if scored and scored[0][0] > 0 else None
1110
-
1111
- src = find_node(a_term)
1112
- tgt = find_node(b_term)
1113
-
1114
- if not src or not tgt:
1115
- print(f'Could not find nodes matching: {a_term!r} or {b_term!r}')
1116
- sys.exit(0)
1117
-
1118
- try:
1119
- path = nx.shortest_path(G, src, tgt)
1120
- print(f'Shortest path ({len(path)-1} hops):')
1121
- for i, nid in enumerate(path):
1122
- label = G.nodes[nid].get('label', nid)
1123
- if i < len(path) - 1:
1124
- edge = G.edges[nid, path[i+1]]
1125
- rel = edge.get('relation', '')
1126
- conf = edge.get('confidence', '')
1127
- print(f' {label} --{rel}--> [{conf}]')
1128
- else:
1129
- print(f' {label}')
1130
- except nx.NetworkXNoPath:
1131
- print(f'No path found between {a_term!r} and {b_term!r}')
1132
- except nx.NodeNotFound as e:
1133
- print(f'Node not found: {e}')
1134
- "
1135
- ```
1136
-
1137
- Replace `NODE_A` and `NODE_B` with the actual concept names from the user. Then explain the path in plain language - what each hop means, why it's significant.
1138
-
1139
- After writing the explanation, save it back:
1140
-
1141
- ```bash
1142
- $(cat graphify-out/.graphify_python) -m graphify save-result --question "Path from NODE_A to NODE_B" --answer "ANSWER" --type path_query --nodes NODE_A NODE_B
1143
- ```
1144
-
1145
- ---
1146
-
1147
- ## For /graphify explain
1148
-
1149
- Give a plain-language explanation of a single node - everything connected to it.
1150
-
1151
- First check the graph exists:
1152
- ```bash
1153
- $(cat graphify-out/.graphify_python) -c "
1154
- from pathlib import Path
1155
- if not Path('graphify-out/graph.json').exists():
1156
- print('ERROR: No graph found. Run /graphify <path> first to build the graph.')
1157
- raise SystemExit(1)
1158
- "
1159
- ```
1160
- If it fails, stop and tell the user to run `/graphify <path>` first.
1161
-
1162
- ```bash
1163
- $(cat graphify-out/.graphify_python) -c "
1164
- import json, sys
1165
- import networkx as nx
1166
- from networkx.readwrite import json_graph
1167
- from pathlib import Path
1168
-
1169
- data = json.loads(Path('graphify-out/graph.json').read_text())
1170
- G = json_graph.node_link_graph(data, edges='links')
1171
-
1172
- term = 'NODE_NAME'
1173
- term_lower = term.lower()
1174
-
1175
- # Find best matching node
1176
- scored = sorted(
1177
- [(sum(1 for w in term_lower.split() if w in G.nodes[n].get('label','').lower()), n)
1178
- for n in G.nodes()],
1179
- reverse=True
1180
- )
1181
- if not scored or scored[0][0] == 0:
1182
- print(f'No node matching {term!r}')
1183
- sys.exit(0)
1184
-
1185
- nid = scored[0][1]
1186
- data_n = G.nodes[nid]
1187
- print(f'NODE: {data_n.get(\"label\", nid)}')
1188
- print(f' source: {data_n.get(\"source_file\",\"unknown\")}')
1189
- print(f' type: {data_n.get(\"file_type\",\"unknown\")}')
1190
- print(f' degree: {G.degree(nid)}')
1191
- print()
1192
- print('CONNECTIONS:')
1193
- for neighbor in G.neighbors(nid):
1194
- edge = G.edges[nid, neighbor]
1195
- nlabel = G.nodes[neighbor].get('label', neighbor)
1196
- rel = edge.get('relation', '')
1197
- conf = edge.get('confidence', '')
1198
- src_file = G.nodes[neighbor].get('source_file', '')
1199
- print(f' --{rel}--> {nlabel} [{conf}] ({src_file})')
1200
- "
1201
- ```
1202
-
1203
- Replace `NODE_NAME` with the concept the user asked about. Then write a 3-5 sentence explanation of what this node is, what it connects to, and why those connections are significant. Use the source locations as citations.
1204
-
1205
- After writing the explanation, save it back:
1206
-
1207
- ```bash
1208
- $(cat graphify-out/.graphify_python) -m graphify save-result --question "Explain NODE_NAME" --answer "ANSWER" --type explain --nodes NODE_NAME
1209
- ```
1210
-
1211
- ---
1212
-
1213
- ## For /graphify add
1214
-
1215
- Fetch a URL and add it to the corpus, then update the graph.
1216
-
1217
- ```bash
1218
- $(cat graphify-out/.graphify_python) -c "
1219
- import sys
1220
- from graphify.ingest import ingest
1221
- from pathlib import Path
1222
-
1223
- try:
1224
- out = ingest('URL', Path('./raw'), author='AUTHOR', contributor='CONTRIBUTOR')
1225
- print(f'Saved to {out}')
1226
- except ValueError as e:
1227
- print(f'error: {e}', file=sys.stderr)
1228
- sys.exit(1)
1229
- except RuntimeError as e:
1230
- print(f'error: {e}', file=sys.stderr)
1231
- sys.exit(1)
1232
- "
1233
- ```
1234
-
1235
- Replace `URL` with the actual URL, `AUTHOR` with the user's name if provided, `CONTRIBUTOR` likewise. If the command exits with an error, tell the user what went wrong - do not silently continue. After a successful save, automatically run the `--update` pipeline on `./raw` to merge the new file into the existing graph.
1236
-
1237
- Supported URL types (auto-detected):
1238
- - YouTube / any video URL → audio downloaded via yt-dlp, transcribed to `.txt` on next run (requires `pip install 'graphifyy[video]'`)
1239
- - Twitter/X → fetched via oEmbed, saved as `.md` with tweet text and author
1240
- - arXiv → abstract + metadata saved as `.md`
1241
- - PDF → downloaded as `.pdf`
1242
- - Images (.png/.jpg/.webp) → downloaded, Claude vision extracts on next run
1243
- - Any webpage → converted to markdown via html2text
1244
-
1245
- ---
1246
-
1247
- ## For --watch
1248
-
1249
- Start a background watcher that monitors a folder and auto-updates the graph when files change.
1250
-
1251
- ```bash
1252
- python3 -m graphify.watch INPUT_PATH --debounce 3
1253
- ```
1254
-
1255
- Replace INPUT_PATH with the folder to watch. Behavior depends on what changed:
1256
-
1257
- - **Code files only (.py, .ts, .go, etc.):** re-runs AST extraction + rebuild + cluster immediately, no LLM needed. `graph.json` and `GRAPH_REPORT.md` are updated automatically.
1258
- - **Docs, papers, or images:** writes a `graphify-out/needs_update` flag and prints a notification to run `/graphify --update` (LLM semantic re-extraction required).
1259
-
1260
- Debounce (default 3s): waits until file activity stops before triggering, so a wave of parallel agent writes doesn't trigger a rebuild per file.
1261
-
1262
- Press Ctrl+C to stop.
1263
-
1264
- For agentic workflows: run `--watch` in a background terminal. Code changes from agent waves are picked up automatically between waves. If agents are also writing docs or notes, you'll need a manual `/graphify --update` after those waves.
1265
-
1266
- ---
1267
-
1268
- ## For git commit hook
1269
-
1270
- Install a post-commit hook that auto-rebuilds the graph after every commit. No background process needed - triggers once per commit, works with any editor.
1271
-
1272
- ```bash
1273
- graphify hook install # install
1274
- graphify hook uninstall # remove
1275
- graphify hook status # check
1276
- ```
1277
-
1278
- After every `git commit`, the hook detects which code files changed (via `git diff HEAD~1`), re-runs AST extraction on those files, and rebuilds `graph.json` and `GRAPH_REPORT.md`. Doc/image changes are ignored by the hook - run `/graphify --update` manually for those.
1279
-
1280
- If a post-commit hook already exists, graphify appends to it rather than replacing it.
1281
-
1282
- ---
1283
-
1284
- ## For native CLAUDE.md integration
1285
-
1286
- Run once per project to make graphify always-on in Claude Code sessions:
1287
-
1288
- ```bash
1289
- graphify claude install
1290
- ```
1291
-
1292
- This writes a `## graphify` section to the local `CLAUDE.md` that instructs Claude to check the graph before answering codebase questions and rebuild it after code changes. No manual `/graphify` needed in future sessions.
1293
-
1294
- ```bash
1295
- graphify claude uninstall # remove the section
1296
- ```
1297
-
1298
- ---
1299
-
1300
- ## Honesty Rules
1301
-
1302
- - Never invent an edge. If unsure, use AMBIGUOUS.
1303
- - Never skip the corpus check warning.
1304
- - Always show token cost in the report.
1305
- - Never hide cohesion scores behind symbols - show the raw number.
1306
- - Never run HTML viz on a graph with more than 5,000 nodes without warning the user.