@zeyue0329/xiaoma-cli 1.16.1-next.1 → 1.17.1-next.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "$schema": "https://json.schemastore.org/package.json",
3
3
  "name": "@zeyue0329/xiaoma-cli",
4
- "version": "1.16.1-next.1",
4
+ "version": "1.17.1-next.0",
5
5
  "description": "XiaoMa Universal AI Agent Framework",
6
6
  "keywords": [
7
7
  "agile",
@@ -11,3 +11,4 @@ Core,xiaoma-review-adversarial-general,Adversarial Review,AR,"Use for quality as
11
11
  Core,xiaoma-review-edge-case-hunter,Edge Case Hunter Review,ECH,Use alongside adversarial review for orthogonal coverage — method-driven not attitude-driven.,,[path],anytime,,,false,,
12
12
  Core,xiaoma-spec,Spec,SP,"Use to distill any intent input (brief, PRD, transcript, brain dump, design folder, mixed multi-source) into a succinct, no-fluff SPEC.md contract + companions that downstream work derives from. Locks the WHAT before the HOW. Works for software, game design, research, editorial, policy, business, anything intent-bearing. Validation mode also available.",,[path],anytime,,,false,{output_folder}/specs/spec-{slug},SPEC.md + companion files
13
13
  Core,xiaoma-customize,XiaoMa Customize,BC,"Use when you want to change how an agent or workflow behaves — add persistent facts, swap templates, insert activation hooks, or customize menus. Scans what's customizable, picks the right scope (agent vs workflow), writes the override to _xiaoma/custom/, and verifies the merge. No TOML hand-authoring required.",,,anytime,,,false,{project-root}/_xiaoma/custom,TOML override files
14
+ Core,xiaoma-heap-dump-analysis,Heap Dump Leak Analysis,HDA,Use when you have a JVM .hprof / OOM heap dump and need the true root cause of a memory leak — which collection retains the leaked objects and why they are never released.,,[path to .hprof] [suspected class],anytime,,,false,report located next to the dump,root-cause memory-leak report
@@ -0,0 +1,137 @@
1
+ ---
2
+ name: xiaoma-heap-dump-analysis
3
+ description: Analyze a JVM heap dump (.hprof) to find the TRUE root cause of a memory leak — not just "which objects are large" but "who retains them and why they are never released". Drives a class histogram, reverse-reference GC-root tracing, optional Eclipse MAT cross-validation, and precise collection/field measurement into a verified report. Use when the user provides a JVM .hprof / heap dump / OOM dump (or says 内存泄漏 / 堆转储 / 堆 dump(.hprof)分析) and asks to analyze it, find the leak, or 找出内存泄漏的真实原因. NOT for thread dumps (jstack), GC logs, or hs_err crash logs — this is heap (.hprof) memory-leak analysis only.
4
+ argument-hint: "[path to .hprof] [optional: suspected class name]"
5
+ ---
6
+
7
+ # Heap Dump Leak Analysis
8
+
9
+ ## Overview
10
+
11
+ This skill finds the **true root cause** of a JVM heap memory leak from a `.hprof` dump. A histogram alone only tells you *what* objects are numerous — it points at generic containers (`char[]`, `String`, `HashMap$Node`) and rarely names the bug. The real question is **who retains the leaked objects on a GC-root path, and why they are never released.**
12
+
13
+ The method is **multi-evidence and self-validating**: every conclusion is reached by at least two independent routes (a self-written reverse-reference tracer + Eclipse MAT's dominator tree, plus exact field/collection measurement). You act as a JVM memory forensics specialist.
14
+
15
+ The bundled `scripts/` are pure-stdlib streaming HPROF parsers (no third-party deps, memory usage independent of object count) and work on multi-GB dumps with tens of millions of objects. **MAT is optional** — the scripts alone can locate the holder chain; MAT is used as cross-validation when available.
16
+
17
+ **Out of scope (non-goals).** This skill analyzes JVM **heap dumps (`.hprof`)** for memory leaks only. It does **not** handle thread dumps (`jstack` / `*.tdump`), GC logs, or `hs_err_pid*` crash logs — those are different artifacts needing different tooling. The bundled parsers assume the HPROF binary format and now hard-fail with a clear `[err]` if the input doesn't start with the `JAVA PROFILE` magic. If the user hands you one of those other dump/log types, say so and stop rather than forcing it through this pipeline.
18
+
19
+ ## Conventions
20
+
21
+ - Bare paths (e.g. `scripts/trace_referrers.py`, `resources/methodology.md`) resolve from this skill's root.
22
+ - All scripts run with the system `python3` (3.8+), no dependencies. Run any script with `--help` first.
23
+ - Class names accept dot or slash form: `com.foo.Bar` == `com/foo/Bar`.
24
+ - **Large dumps — run in background and poll.** Every script does a full streaming pass
25
+ (~1–2 min per few GB; `trace_referrers` does one pass per hop). Don't block the foreground:
26
+
27
+ ```bash
28
+ nohup python3 scripts/trace_referrers.py <dump> <class> --hops 6 > /tmp/trace.out 2>&1 &
29
+ until grep -q '\[done\]' /tmp/trace.out 2>/dev/null; do sleep 5; done
30
+ cat /tmp/trace.out
31
+ ```
32
+
33
+ Scripts emit progress to stderr (`[passA]`, `heap segment #N`, per-hop headers) and finish
34
+ with a `[done]` marker — poll for it.
35
+
36
+ ## Prerequisites & Data Safety — DO THIS FIRST
37
+
38
+ Before any analysis, run the safety preflight. **These steps prevent the two failures that wasted the most time in practice.**
39
+
40
+ 1. **Locate and verify the dump.** Confirm it is HPROF: the first 13 bytes are `JAVA PROFILE`. Note its size.
41
+
42
+ 2. **⚠️ Special characters in the filename break tools.** Filenames containing `%`, spaces, or other shell/URI-sensitive characters (common when JVMs write `-XX:HeapDumpPath=app_%p.hprof`) cause Eclipse MAT and many launchers to fail silently. **Immediately create a hardlink with a clean name** and use it everywhere downstream:
43
+ ```bash
44
+ ln '/path/to/app_%p.hprof' '/path/to/app_clean.hprof' # hardlink, instant, no extra disk
45
+ ```
46
+
47
+ 3. **⚠️ Protect the data — the dump may be the only copy.** A heap dump is irreplaceable, and external tooling (uploaders, splitters) may move or delete it mid-analysis. **Create a second hardlink as a backup** so the inode survives even if the original name is removed:
48
+ ```bash
49
+ ln '/path/to/app_clean.hprof' ~/app_dump_backup.hprof
50
+ ```
51
+ (A hardlink shares the inode — the data lives as long as ANY link exists, at zero extra disk cost.)
52
+
53
+ 4. **Check tooling.** `python3 --version` (required). `java -version` (JDK 11+, only needed if using MAT). MAT presence is optional — see `resources/mat-headless-runbook.md` for install + headless invocation.
54
+
55
+ 5. **If the user named a suspected class**, note it for Stage 2. Otherwise Stage 1's histogram will surface candidates.
56
+
57
+ ## Stages
58
+
59
+ | # | Stage | Tool | Purpose |
60
+ |---|-------|------|---------|
61
+ | 1 | Histogram | `scripts/hprof_histogram.py` | Find which classes are abnormally numerous / large — especially business & third-party classes |
62
+ | 2 | Reverse trace | `scripts/trace_referrers.py` | Walk *up* from a suspect class to its GC-root holder chain (the "path to GC roots") |
63
+ | 3 | MAT cross-check *(optional)* | `resources/mat-headless-runbook.md` | Run MAT's Leak Suspects + dominator tree headless; corroborate Stage 2 |
64
+ | 4 | Precise measure | `scripts/inspect_objects.py` | Nail the exact entry count of the suspect collection / the exact value of config fields |
65
+ | 5 | Report | — | Synthesize: symptom → multi-evidence → leak chain → mechanism → fix |
66
+
67
+ ### Stage 1: Histogram — what is abnormal
68
+
69
+ ```bash
70
+ python3 scripts/hprof_histogram.py <dump.hprof> --top 50
71
+ ```
72
+ Read the four leaderboards. The **business/third-party leaderboards (JDK excluded)** are the most diagnostic — they point at *your* objects, not generic containers. Look for a domain object whose instance count is wildly higher than the number of live "real" things it should represent (e.g. session objects ≫ live TCP connections). That mismatch is the leak signature; that class is the Stage 2 suspect.
73
+
74
+ Sanity-anchor the count against reality: compare the suspect against `io.netty.channel.socket.nio.NioSocketChannel`, `sun.nio.ch.SocketChannelImpl`, `java.io.FileDescriptor` (live connections), or a domain "online user" object. A large gap = retained zombies.
75
+
76
+ ### Stage 2: Reverse trace — who retains them
77
+
78
+ ```bash
79
+ python3 scripts/trace_referrers.py <dump.hprof> <suspect-class> --hops 6
80
+ ```
81
+ Each hop reports who references the current set, **by referrer class and by exact field name**, and flags two GC-root signals:
82
+ - **★ static field holder** — a `static` field of some class directly references the objects (classic static-cache / registry leak).
83
+ - **★ referrer is itself a GC root** — thread, JNI global, sticky class, etc.
84
+
85
+ Follow the chain until it converges on a single container (e.g. one `ConcurrentHashMap$Node[]` table, one singleton holder) or hits a ★ anchor. That holder + field is the leak's retention point — the collection that should have been pruned but wasn't.
86
+
87
+ **Expect the object graph to contain cycles** (e.g. `ClientHead.clientsBox → ClientsBox → map → ClientHead`); the reverse-BFS `next` set will balloon in later hops. That is normal — focus on the *first* hops where a single container/field clearly dominates, and on the ★ anchors.
88
+
89
+ ### Stage 3: MAT cross-validation *(OPTIONAL — skip if MAT isn't installed)*
90
+
91
+ **Quick gate:** if `/Applications/MemoryAnalyzer.app` (or a `ParseHeapDump.sh` on PATH) is
92
+ absent, **skip this stage entirely** — the pure-Python Stages 1/2/4 already locate and prove
93
+ the holder chain. Do this stage only when MAT is available and you want the authoritative
94
+ dominator-tree + retained-size corroboration.
95
+
96
+ If Eclipse MAT is available, generate the official Leak Suspects report headless and confirm it names the same holder. **Follow `resources/mat-headless-runbook.md` exactly** — the macOS `.app` launcher fails silently (`exit 14`) from a headless shell; you must invoke the Equinox launcher jar directly with a raised `-Xmx`, and the dump filename must be free of special characters (Stage Prereqs already handled this).
97
+
98
+ MAT's "Problem Suspect 1" (a dominator occupying the bulk of the heap) and its `Node[N]` accumulation point should match the container Stage 2 converged on. Two independent methods agreeing = high confidence.
99
+
100
+ ### Stage 4: Precise measurement — nail it
101
+
102
+ Turn inference into hard numbers. First list the holder's fields, then measure the collection and/or read config fields:
103
+ ```bash
104
+ # List the holder class's fields (decide what to read)
105
+ python3 scripts/inspect_objects.py <dump.hprof> --class <holder-class>
106
+ # Measure the suspect collection(s) — entry count via table capacity + non-empty buckets
107
+ python3 scripts/inspect_objects.py <dump.hprof> --class <holder-class> --map-fields <field1,field2>
108
+ # Read config / state fields (e.g. is a timeout misconfigured?)
109
+ python3 scripts/inspect_objects.py <dump.hprof> --class <config-class> --fields <field1,field2>
110
+ # Measure a STATIC cache/registry map (static fields are GC-root-level holders — a top leak source)
111
+ python3 scripts/inspect_objects.py <dump.hprof> --class <util-class> --static-fields <field1,field2>
112
+ ```
113
+ This distinguishes the real leak collection from innocent ones (a sibling map with thousands of entries is not the 100k+ one), and reads the actual runtime config that governs cleanup (heartbeat/timeout values, flags). **Note:** `ConcurrentHashMap.baseCount`/`HashMap.size` can read low under concurrency; trust **non-empty bucket count** + **table capacity (a power of two)** for the real magnitude.
114
+
115
+ ### Stage 5: Report
116
+
117
+ Synthesize a report in the user's language (Chinese if the user wrote in Chinese). Required structure:
118
+
119
+ 1. **One-line conclusion** — the true root cause in a sentence.
120
+ 2. **Heap overview** — file size, object count, app stack (framework versions if visible).
121
+ 3. **Evidence (multi-route)** — a table: histogram counts, reverse-trace holder chain, MAT suspect %, exact measured entry counts. Show they agree.
122
+ 4. **Leak chain** — `GC root → … → holder.field (collection) → leaked objects → their retained subtree`. Quantify the retained subtree (what fills the heap).
123
+ 5. **Mechanism root cause** — *why* the collection is never pruned (missing remove on disconnect, listener never deregistered, unbounded cache, misconfigured timeout, framework bug, …). Cite the measured config/code evidence. Be explicit about what is *proven* vs *inferred*.
124
+ 6. **Fix recommendations** — ordered: permanent fix (code/version upgrade), config correction, guardrails (limits/monitoring/alerts), temporary mitigation (heap bump + rolling restart / scheduled cleanup).
125
+ 7. **Data-safety note** — if the original file was renamed/deleted, tell the user which hardlink now holds the data.
126
+
127
+ ## Graceful degradation & scaling
128
+
129
+ - **No MAT / MAT won't run** → Stages 1, 2, 4 (pure-Python) are fully sufficient to locate and prove the holder chain. MAT is corroboration, not a dependency.
130
+ - **Very large dumps / multiple suspects** → run Stage 2 on each suspect; the scripts are streaming and re-runnable. You may fan out independent traces as subagents and synthesize.
131
+ - **Unfamiliar framework** → after Stage 2 names the holder class + field, look up that library's source for the add/remove lifecycle of that collection to explain the *mechanism* (Stage 5). Quote method names.
132
+
133
+ ## Resources
134
+
135
+ - `resources/methodology.md` — the full five-stage leak-hunting methodology, decision heuristics, and a worked end-to-end example.
136
+ - `resources/mat-headless-runbook.md` — installing Eclipse MAT and running it **headless** (the exit-14 pitfall, Equinox launcher invocation, `-Xmx`, reading the report zips with `textutil`).
137
+ - `resources/hprof-internals.md` — HPROF binary format reference and how the bundled scripts parse it (so you can extend them for a new question).
@@ -0,0 +1,157 @@
1
+ # HPROF Binary Format — Reference & How the Scripts Parse It
2
+
3
+ The bundled scripts are self-contained streaming parsers of the HPROF binary format. This
4
+ note documents the format and the parsing strategy so you can **extend the scripts** for a
5
+ new question (e.g. "dump every field of object X", "list the keys of map Y", "histogram of
6
+ retained subtree of class Z").
7
+
8
+ ## File layout
9
+
10
+ ```
11
+ [format string, NUL-terminated] e.g. "JAVA PROFILE 1.0.1" / "1.0.2"
12
+ [identifier size : u4] usually 8 (64-bit, no compressed-oop in the dump format)
13
+ [timestamp : u8]
14
+ [record]* a flat sequence of top-level records
15
+ ```
16
+
17
+ Each top-level record:
18
+
19
+ ```
20
+ [tag : u1][time : u4][length : u4][body : <length> bytes]
21
+ ```
22
+
23
+ ### Top-level tags used
24
+
25
+ | tag | name | body |
26
+ |-----|------|------|
27
+ | 0x01 | STRING (UTF-8) | `id` + UTF-8 bytes → the string table (class names, field names) |
28
+ | 0x02 | LOAD_CLASS | class serial(u4) + **class object id** + stack serial(u4) + **name string id** |
29
+ | 0x0C | HEAP_DUMP | a container of heap sub-records |
30
+ | 0x1C | HEAP_DUMP_SEGMENT | same as 0x0C, just chunked (large dumps emit several) |
31
+ | 0x2C | HEAP_DUMP_END | marker |
32
+
33
+ Other tags (UTF8 stack frames/traces, alloc sites, etc.) are skipped via `length`.
34
+
35
+ ## Heap-dump sub-records
36
+
37
+ Inside a 0x0C/0x1C body is a sequence of sub-records, each led by a 1-byte sub-tag. To stay
38
+ aligned you must consume **exactly** the right number of bytes for every sub-tag — including
39
+ GC-root records you don't care about.
40
+
41
+ ### GC roots (fixed sizes — must skip precisely)
42
+
43
+ | sub-tag | root kind | after the leading `id`, extra bytes |
44
+ |---------|-----------|-------------------------------------|
45
+ | 0xFF | UNKNOWN | 0 |
46
+ | 0x01 | JNI GLOBAL | + 1 `id` (JNI ref) |
47
+ | 0x02 | JNI LOCAL | + u4 + u4 (= 8) |
48
+ | 0x03 | JAVA FRAME | + u4 + u4 (= 8) |
49
+ | 0x04 | NATIVE STACK | + u4 (= 4) |
50
+ | 0x05 | STICKY CLASS | 0 |
51
+ | 0x06 | THREAD BLOCK | + u4 (= 4) |
52
+ | 0x07 | MONITOR USED | 0 |
53
+ | 0x08 | THREAD OBJECT | + u4 + u4 (= 8) |
54
+
55
+ The leading `id` of each is the rooted object — collect these into a set to detect
56
+ "referrer is itself a GC root."
57
+
58
+ ### 0x20 CLASS_DUMP (the tricky one)
59
+
60
+ ```
61
+ class object id : id
62
+ stack trace serial : u4
63
+ super class object id : id
64
+ class loader id : id
65
+ signers id : id
66
+ protection domain id : id
67
+ reserved : id
68
+ reserved : id
69
+ instance size : u4
70
+ constant pool count : u2, then each: cp-index(u2) + type(u1) + value(size-by-type)
71
+ static fields count : u2, then each: name string id(id) + type(u1) + value(size-by-type)
72
+ instance fields count : u2, then each: name string id(id) + type(u1)
73
+ ```
74
+
75
+ From CLASS_DUMP the scripts cache: `super` (to walk the chain), the **instance field list**
76
+ `(type, name-id)` in declaration order, and **static field references** (object-typed
77
+ statics — these are GC-root-level holders and how static-cache leaks are detected).
78
+
79
+ ### 0x21 INSTANCE_DUMP
80
+
81
+ ```
82
+ object id : id
83
+ stack trace serial : u4
84
+ class object id : id
85
+ num bytes that follow : u4
86
+ field values : <num bytes> (raw, laid out per the class's field layout)
87
+ ```
88
+
89
+ **Field layout order:** the values are this class's instance fields in declaration order,
90
+ **then the super class's, then its super's**, … each field occupying `size-by-type` bytes.
91
+ To read a specific field you compute its byte offset by walking
92
+ `class → super → super…`, summing field sizes; reference fields (type 2) hold an `id`.
93
+
94
+ ### 0x22 OBJECT_ARRAY_DUMP
95
+
96
+ ```
97
+ array object id : id
98
+ stack serial : u4
99
+ num elements : u4
100
+ array class id : id
101
+ elements : num × id
102
+ ```
103
+
104
+ ### 0x23 PRIMITIVE_ARRAY_DUMP
105
+
106
+ ```
107
+ array object id : id
108
+ stack serial : u4
109
+ num elements : u4
110
+ element type : u1
111
+ elements : num × size-by-type
112
+ ```
113
+
114
+ ## Basic type tags & sizes
115
+
116
+ | tag | type | size |
117
+ |-----|------|------|
118
+ | 2 | object (ref) | `id_size` (usually 8) |
119
+ | 4 | boolean | 1 |
120
+ | 5 | char | 2 |
121
+ | 6 | float | 4 |
122
+ | 7 | double | 8 |
123
+ | 8 | byte | 1 |
124
+ | 9 | short | 2 |
125
+ | 10 | int | 4 |
126
+ | 11 | long | 8 |
127
+
128
+ ## Parsing strategy used by the scripts
129
+
130
+ - **Streaming, low memory.** Read each top-level record header (9 bytes), then read the
131
+ body only for records of interest (STRING, LOAD_CLASS, HEAP_DUMP); `seek()` past the rest.
132
+ Heap-dump bodies are walked sub-record by sub-record over a `memoryview` (no copies).
133
+ - **Class metadata is cached once**, object counts/fields are never all held in memory — so
134
+ memory use is independent of object count (works on 40M-object dumps).
135
+ - **Reference offsets** (`get_refoffs`) precompute, per class, the byte offsets of all
136
+ reference-typed fields (self + supers) for fast referrer scanning in `trace_referrers.py`.
137
+ - **Reverse tracing = repeated full streaming passes.** Each hop re-scans the file, testing
138
+ every object's references against the current target id-set. Slower than building an
139
+ in-memory graph but uses trivial memory and is robust on huge dumps.
140
+
141
+ ## Extending the scripts
142
+
143
+ Common one-off questions and where to hook in:
144
+
145
+ - **Print every field of a specific object id** → in a heap walk, when
146
+ `INSTANCE_DUMP.object id == target`, decode each field with the class's
147
+ `layout` (see `inspect_objects.py`'s `read_val` + `layout`).
148
+ - **Enumerate a map's keys/values** → read the map's `table` (a `Node[]`), then for each
149
+ non-null `Node` follow `key`/`val`/`next`; resolve value object classes via the
150
+ class-id → name table.
151
+ - **Retained-style grouping** → for a target class, in one pass collect each instance's
152
+ outgoing refs, then attribute their shallow sizes (approximate; real dominator/retained
153
+ needs MAT).
154
+
155
+ Keep the GC-root and CLASS_DUMP skip logic byte-exact — a single miscounted field
156
+ desynchronizes the entire stream. The scripts raise on an unknown sub-tag, which is the
157
+ canary for a misalignment.
@@ -0,0 +1,108 @@
1
+ # Eclipse MAT — Headless Runbook (and pitfalls)
2
+
3
+ Eclipse Memory Analyzer (MAT) is the gold standard for `.hprof` analysis: it builds a
4
+ dominator tree and produces a **Leak Suspects** report. This skill uses it only as
5
+ **cross-validation** for the pure-Python scripts — but when it runs, it is authoritative.
6
+ This runbook captures the exact invocation and the traps that make it fail *silently*.
7
+
8
+ ## Install
9
+
10
+ macOS (Apple Silicon or Intel), via Homebrew:
11
+
12
+ ```bash
13
+ brew install --cask mat # cask is named "mat" (Eclipse Memory Analyzer)
14
+ # installs to /Applications/MemoryAnalyzer.app
15
+ ```
16
+
17
+ Other platforms: download the **Standalone / RCP** build from
18
+ <https://eclipse.dev/mat/download/> (needs a JDK 17+ on PATH).
19
+
20
+ ## ⚠️ Pitfall 1 — the macOS `.app` launcher exits 14, silently
21
+
22
+ Running `…/MemoryAnalyzer.app/Contents/Eclipse/ParseHeapDump.sh` (which calls the
23
+ `Contents/MacOS/MemoryAnalyzer` binary) from a **headless shell** (no GUI/WindowServer
24
+ session) fails with **exit code 14, zero output, no index files**. The Cocoa launcher
25
+ forces `-XstartOnFirstThread` and cannot reach the window server.
26
+
27
+ **Do not fight `ParseHeapDump.sh`.** Invoke the Equinox launcher jar directly with your own
28
+ JDK — fully headless, no GUI bootstrap:
29
+
30
+ ```bash
31
+ ECL=/Applications/MemoryAnalyzer.app/Contents/Eclipse
32
+ LAUNCHER=$(ls "$ECL"/plugins/org.eclipse.equinox.launcher_*.jar)
33
+
34
+ java -Xmx12g \
35
+ --add-exports=java.base/jdk.internal.org.objectweb.asm=ALL-UNNAMED \
36
+ -Djava.awt.headless=true \
37
+ -Dosgi.install.area="file:$ECL/" \
38
+ -Dosgi.configuration.area=file:/tmp/mat_cfg/ \
39
+ -jar "$LAUNCHER" \
40
+ -consoleLog -nosplash -data /tmp/mat_ws \
41
+ -application org.eclipse.mat.api.parse \
42
+ '/path/to/dump_clean.hprof' \
43
+ org.eclipse.mat.api:suspects org.eclipse.mat.api:top_components
44
+ ```
45
+
46
+ Notes:
47
+
48
+ - `-Dosgi.install.area` / `-Dosgi.configuration.area` are **required** — without them the
49
+ launcher loads no MAT bundle and the application is a no-op (the same silent exit 14).
50
+ Point `configuration.area` at a writable dir (the install `configuration/` is read-only).
51
+ - `-Xmx`: give MAT roughly **the dump size or more** (12g for a 2.6 GB dump is comfortable;
52
+ MAT's indexing is efficient but needs headroom). If you instead use `ParseHeapDump.sh` on
53
+ Linux/CI, raise `-Xmx` by editing `MemoryAnalyzer.ini`'s `-vmargs` block.
54
+ - Report ids: `org.eclipse.mat.api:suspects` (Leak Suspects),
55
+ `org.eclipse.mat.api:top_components`, `org.eclipse.mat.api:overview`.
56
+
57
+ ## ⚠️ Pitfall 2 — special characters in the dump filename
58
+
59
+ `%`, spaces, etc. (common from `-XX:HeapDumpPath=app_%p.hprof`) break Equinox/URI handling
60
+ and cause the *same* silent exit 14. **Always analyze a clean-named hardlink** (see the
61
+ skill's Prerequisites). This also protects the data if the original is later deleted.
62
+
63
+ ## ⚠️ Pitfall 3 — Gatekeeper quarantine (macOS)
64
+
65
+ A freshly cask-installed app may be quarantined. If launch is blocked:
66
+
67
+ ```bash
68
+ xattr -dr com.apple.quarantine /Applications/MemoryAnalyzer.app
69
+ ```
70
+
71
+ ## Outputs
72
+
73
+ A successful parse writes, next to the dump:
74
+
75
+ - Index files: `*.index`, `*.idx.index`, `*.domIn.index`, `*.domOut.index`,
76
+ `*.o2ret.index`, `*.threads`, … (the dominator tree + retained-size indices).
77
+ - Report zips: `<name>_Leak_Suspects.zip`, `<name>_Top_Components.zip`,
78
+ `<name>_System_Overview.zip`.
79
+
80
+ The presence of `*.index` files is the reliable "parsing actually started" signal when
81
+ watching progress.
82
+
83
+ ## Reading the report headlessly
84
+
85
+ The zips are HTML. Convert to text with macOS `textutil` (or any html-to-text tool):
86
+
87
+ ```bash
88
+ cd /tmp && rm -rf leak && mkdir leak && cd leak
89
+ unzip -o '/path/to/<name>_Leak_Suspects.zip' >/dev/null
90
+ textutil -convert txt -stdout index.html | sed '/^[[:space:]]*$/d' | head -120
91
+ ```
92
+
93
+ What to extract from **Leak Suspects**:
94
+
95
+ - **Problem Suspect 1**: "one instance of `<HolderClass>` … occupies `N` bytes (`P`%) …"
96
+ → the dominator and its retained-heap share.
97
+ - "The memory is accumulated in one instance of `<X>[capacity]`" → the accumulation point
98
+ (often a `ConcurrentHashMap$Node[]` / `Object[]` table).
99
+ - "Thread `…` has a local variable … on the shortest path to …" → the GC-root path.
100
+
101
+ Confirm this holder matches what `trace_referrers.py` converged on. Detailed dominator
102
+ trees and per-object paths live in `pages/*.html` and `_Top_Components` / `_System_Overview`.
103
+
104
+ ## Quick decision
105
+
106
+ - MAT runs → use it to corroborate + get retained sizes. Strongest evidence.
107
+ - MAT won't run after ~2 attempts → **don't keep fighting it.** The pure-Python Stages 1/2/4
108
+ fully locate and prove the holder chain on their own.
@@ -0,0 +1,148 @@
1
+ # Heap Leak Hunting — Methodology
2
+
3
+ The single idea behind this skill: **a histogram tells you *what* is numerous; it does not
4
+ tell you *who retains it* or *why it is never freed*.** Generic containers (`char[]`,
5
+ `String`, `HashMap$Node`, `Object[]`) always top the histogram and almost never name the
6
+ bug. Root-causing a leak means walking *up the reference graph* to the GC-root holder, then
7
+ explaining the *mechanism* that keeps the collection growing.
8
+
9
+ Every conclusion should be reached by **at least two independent routes** so it is not an
10
+ artifact of one tool.
11
+
12
+ ---
13
+
14
+ ## Stage 1 — Histogram: find the abnormal class
15
+
16
+ Run `scripts/hprof_histogram.py`. Ignore the all-classes board first; read the
17
+ **business / third-party board (JDK excluded)**. You are hunting for a *domain* class whose
18
+ instance count is wildly larger than the number of real things it should represent.
19
+
20
+ **Anchor the count against reality.** Pick objects whose count equals "live work":
21
+
22
+ - `io.netty.channel.socket.nio.NioSocketChannel`, `sun.nio.ch.SocketChannelImpl`,
23
+ `java.io.FileDescriptor` → number of live TCP connections / sockets.
24
+ - a domain "online user / active session" object, thread-pool size, etc.
25
+
26
+ If a session/handler/listener/entry class has, say, 190k instances while live connections
27
+ are 40k and active users are 6k, the ~180k surplus is **retained zombies**. That class is
28
+ your Stage-2 suspect.
29
+
30
+ Cross-check internal consistency: leaked parents usually drag fixed-ratio children. If
31
+ `SuspectA = N`, look for classes at `≈N`, `≈2N`, `≈k·N` — they confirm a whole subtree
32
+ leaks together (e.g. each leaked session keeps 1 handshake + 2 transport states + 1 store).
33
+
34
+ ---
35
+
36
+ ## Stage 2 — Reverse-reference trace: who retains them
37
+
38
+ Run `scripts/trace_referrers.py <dump> <suspect-class>`. It implements MAT's
39
+ "path to GC roots" in pure Python: from every instance of the suspect, find referrers via
40
+ **instance fields, array elements, and static fields**, hop by hop, until reaching a
41
+ GC root or a single accumulation container.
42
+
43
+ Read each hop for:
44
+
45
+ - **The dominant referrer field.** `('java.util.concurrent.ConcurrentHashMap$Node', 'val')`
46
+ with a count ≈ suspect count means "they live as *values* of a ConcurrentHashMap."
47
+ Next hops climb `Node → Node[] table → ConcurrentHashMap → holder.field`.
48
+ - **★ static field holder** — a `static` field directly references the set. This is the
49
+ smoking gun for static-registry / static-cache leaks; the holder is a GC root by itself.
50
+ - **★ referrer is itself a GC root** — thread (local var / thread object), JNI global,
51
+ sticky class.
52
+
53
+ **Convergence is the goal.** You are looking for the hop where many objects funnel into
54
+ *one* container instance (one big `Node[]` table, one singleton holder). That container's
55
+ owning field is the retention point.
56
+
57
+ **Cycles are expected.** Frameworks wire back-references (child holds a pointer to the
58
+ registry that holds the child), so the reverse-BFS frontier *grows* after a few hops. Don't
59
+ chase the ballooning frontier — read the *early* hops and the ★ anchors.
60
+
61
+ ---
62
+
63
+ ## Stage 3 — MAT cross-validation (optional)
64
+
65
+ If Eclipse MAT runs (see `mat-headless-runbook.md`), its **Leak Suspects** report should
66
+ independently name the same holder ("Problem Suspect 1: one instance of X occupies N% …
67
+ accumulated in one `Node[capacity]`"). Two unrelated methods agreeing turns a strong
68
+ inference into a confident finding. MAT also gives **retained size** (true cost of the
69
+ subtree), which the histogram's shallow size cannot.
70
+
71
+ ---
72
+
73
+ ## Stage 4 — Precise measurement: turn inference into numbers
74
+
75
+ Run `scripts/inspect_objects.py` to:
76
+
77
+ - **Measure the suspect collection's true size** — `--map-fields`. Distinguish the real
78
+ leak map (100k+ entries, huge power-of-two table) from innocent siblings (thousands).
79
+ - **Read the runtime config that governs cleanup** — `--fields`. Heartbeat/timeout values,
80
+ feature flags, max sizes. A misconfigured timeout (e.g. 100× the default) often explains
81
+ *why* cleanup never fires.
82
+
83
+ > **Trust non-empty bucket count + table capacity, not `size`/`baseCount`.**
84
+ > `ConcurrentHashMap` spreads its counter across `counterCells` under contention, so
85
+ > `baseCount` can read absurdly low (e.g. 552 for a map with ~190k entries). The
86
+ > `Node[]` table capacity (a power of two) and the non-empty bucket count are reliable.
87
+ > The script flags this automatically (⚠️) when `size` and non-empty buckets differ by >10×.
88
+
89
+ When comparing sibling collections, remember **they may hold different things**: a registry
90
+ keyed by session-id can be huge while a sibling keyed by live channel stays small (it only
91
+ holds currently-connected transports, not every session). Don't expect every map's size to
92
+ equal the live-connection count — judge each against *what it is supposed to hold*. The leak
93
+ is the one whose size has no business being that large.
94
+
95
+ ---
96
+
97
+ ## Stage 5 — Mechanism & report
98
+
99
+ Locating the holder is half the job; explain **why it grows unbounded**:
100
+
101
+ - **Static cache / registry never pruned** — entries added on event A, removal depends on
102
+ event B that doesn't always happen.
103
+ - **Listener / callback never deregistered** — observer holds the subject (or vice-versa).
104
+ - **Connection / session not removed on disconnect** — cleanup depends on a timeout or an
105
+ explicit close callback that some paths skip.
106
+ - **Unbounded queue / buffer** — producer outruns consumer.
107
+ - **`ThreadLocal` on a pooled thread** — value outlives the request.
108
+ - **`ClassLoader` leak** — a long-lived object pins a webapp/plugin classloader.
109
+ - **Misconfigured timeout / TTL** — cleanup mechanism exists but is effectively disabled
110
+ (set to 0) or set so large it never fires.
111
+
112
+ To pin the mechanism, read the owning library's source for the **add/remove lifecycle** of
113
+ that exact collection, and quote the method names. State plainly what is *proven from the
114
+ heap* vs *inferred*.
115
+
116
+ ---
117
+
118
+ ## Worked example (anonymized, real case)
119
+
120
+ > This is a real, anonymized case shown to illustrate the flow end-to-end. For a *new* dump,
121
+ > derive every number yourself with the scripts — **do not pattern-match your conclusion to
122
+ > this example.** A different leak will have a different holder class, field, and mechanism;
123
+ > the *method* transfers, the answer does not.
124
+
125
+ **Symptom.** 2.6 GB dump, ~40M objects. Histogram: a Socket.IO session class `ClientHead`
126
+ = 192,639; live `NioSocketChannel` = 41,773; active session-wrapper = 6,530. → ~186k
127
+ zombie sessions.
128
+
129
+ **Reverse trace.** `ClientHead ← ConcurrentHashMap$Node.val (194,844) ← Node[] table ←
130
+ ClientsBox (singleton) ← GC root (Netty I/O thread → SocketIOChannelInitializer)`. Sibling
131
+ ratios confirmed: `TransportState ≈ 2× ClientHead`, message queues `≈ 2× ClientHead`.
132
+
133
+ **MAT.** Leak Suspects: "one `ClientsBox` occupies 64.67% … accumulated in
134
+ `ConcurrentHashMap$Node[524288]`." Same holder. ✓
135
+
136
+ **Measure.** `ClientsBox.uuid2clients`: table 524288, non-empty buckets 160,989 (≈190k);
137
+ `channel2clients`: 2,027; business static maps: 939 / 14,496. → leak is **only**
138
+ `uuid2clients`. Config: `pingTimeout=120000` (2× default), `upgradeTimeout=1,000,000` (100×
139
+ default).
140
+
141
+ **Mechanism.** `AuthorizeHandler.authorize()` does `clientsBox.addClient()` on handshake;
142
+ removal depends on ping-timeout or explicit disconnect. Only ~18k timeout tasks exist for
143
+ 190k sessions → most zombies have no pending cleanup. The blown-up `upgradeTimeout`
144
+ amplifies retention of mid-upgrade/aborted polling connections.
145
+
146
+ **Fix.** Upgrade the library; restore `upgradeTimeout`/`pingTimeout` to sane values; add
147
+ ingress rate-limit/auth to stop invalid handshakes; monitor `uuid2clients.size`; mitigate
148
+ with heap bump + rolling restart / scheduled stale-session cleanup.