@zeyue0329/xiaoma-cli 1.17.0 → 1.18.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/src/core-skills/module-help.csv +1 -0
- package/src/core-skills/xiaoma-heap-dump-analysis/SKILL.md +137 -0
- package/src/core-skills/xiaoma-heap-dump-analysis/resources/hprof-internals.md +157 -0
- package/src/core-skills/xiaoma-heap-dump-analysis/resources/mat-headless-runbook.md +108 -0
- package/src/core-skills/xiaoma-heap-dump-analysis/resources/methodology.md +148 -0
- package/src/core-skills/xiaoma-heap-dump-analysis/scripts/hprof_histogram.py +215 -0
- package/src/core-skills/xiaoma-heap-dump-analysis/scripts/inspect_objects.py +335 -0
- package/src/core-skills/xiaoma-heap-dump-analysis/scripts/trace_referrers.py +305 -0
- package/src/core-skills/xiaoma-heap-dump-analysis/xiaoma-skill-manifest.yaml +15 -0
- package/tools/installer/ide/_config-driven.js +2 -2
- package/tools/installer/modules/official-modules.js +4 -0
package/package.json
CHANGED
|
@@ -11,3 +11,4 @@ Core,xiaoma-review-adversarial-general,Adversarial Review,AR,"Use for quality as
|
|
|
11
11
|
Core,xiaoma-review-edge-case-hunter,Edge Case Hunter Review,ECH,Use alongside adversarial review for orthogonal coverage — method-driven not attitude-driven.,,[path],anytime,,,false,,
|
|
12
12
|
Core,xiaoma-spec,Spec,SP,"Use to distill any intent input (brief, PRD, transcript, brain dump, design folder, mixed multi-source) into a succinct, no-fluff SPEC.md contract + companions that downstream work derives from. Locks the WHAT before the HOW. Works for software, game design, research, editorial, policy, business, anything intent-bearing. Validation mode also available.",,[path],anytime,,,false,{output_folder}/specs/spec-{slug},SPEC.md + companion files
|
|
13
13
|
Core,xiaoma-customize,XiaoMa Customize,BC,"Use when you want to change how an agent or workflow behaves — add persistent facts, swap templates, insert activation hooks, or customize menus. Scans what's customizable, picks the right scope (agent vs workflow), writes the override to _xiaoma/custom/, and verifies the merge. No TOML hand-authoring required.",,,anytime,,,false,{project-root}/_xiaoma/custom,TOML override files
|
|
14
|
+
Core,xiaoma-heap-dump-analysis,Heap Dump Leak Analysis,HDA,Use when you have a JVM .hprof / OOM heap dump and need the true root cause of a memory leak — which collection retains the leaked objects and why they are never released.,,[path to .hprof] [suspected class],anytime,,,false,report located next to the dump,root-cause memory-leak report
|
|
@@ -0,0 +1,137 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: xiaoma-heap-dump-analysis
|
|
3
|
+
description: Analyze a JVM heap dump (.hprof) to find the TRUE root cause of a memory leak — not just "which objects are large" but "who retains them and why they are never released". Drives a class histogram, reverse-reference GC-root tracing, optional Eclipse MAT cross-validation, and precise collection/field measurement into a verified report. Use when the user provides a JVM .hprof / heap dump / OOM dump (or says 内存泄漏 / 堆转储 / 堆 dump(.hprof)分析) and asks to analyze it, find the leak, or 找出内存泄漏的真实原因. NOT for thread dumps (jstack), GC logs, or hs_err crash logs — this is heap (.hprof) memory-leak analysis only.
|
|
4
|
+
argument-hint: "[path to .hprof] [optional: suspected class name]"
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Heap Dump Leak Analysis
|
|
8
|
+
|
|
9
|
+
## Overview
|
|
10
|
+
|
|
11
|
+
This skill finds the **true root cause** of a JVM heap memory leak from a `.hprof` dump. A histogram alone only tells you *what* objects are numerous — it points at generic containers (`char[]`, `String`, `HashMap$Node`) and rarely names the bug. The real question is **who retains the leaked objects on a GC-root path, and why they are never released.**
|
|
12
|
+
|
|
13
|
+
The method is **multi-evidence and self-validating**: every conclusion is reached by at least two independent routes (a self-written reverse-reference tracer + Eclipse MAT's dominator tree, plus exact field/collection measurement). You act as a JVM memory forensics specialist.
|
|
14
|
+
|
|
15
|
+
The bundled `scripts/` are pure-stdlib streaming HPROF parsers (no third-party deps, memory usage independent of object count) and work on multi-GB dumps with tens of millions of objects. **MAT is optional** — the scripts alone can locate the holder chain; MAT is used as cross-validation when available.
|
|
16
|
+
|
|
17
|
+
**Out of scope (non-goals).** This skill analyzes JVM **heap dumps (`.hprof`)** for memory leaks only. It does **not** handle thread dumps (`jstack` / `*.tdump`), GC logs, or `hs_err_pid*` crash logs — those are different artifacts needing different tooling. The bundled parsers assume the HPROF binary format and now hard-fail with a clear `[err]` if the input doesn't start with the `JAVA PROFILE` magic. If the user hands you one of those other dump/log types, say so and stop rather than forcing it through this pipeline.
|
|
18
|
+
|
|
19
|
+
## Conventions
|
|
20
|
+
|
|
21
|
+
- Bare paths (e.g. `scripts/trace_referrers.py`, `resources/methodology.md`) resolve from this skill's root.
|
|
22
|
+
- All scripts run with the system `python3` (3.8+), no dependencies. Run any script with `--help` first.
|
|
23
|
+
- Class names accept dot or slash form: `com.foo.Bar` == `com/foo/Bar`.
|
|
24
|
+
- **Large dumps — run in background and poll.** Every script does a full streaming pass
|
|
25
|
+
(~1–2 min per few GB; `trace_referrers` does one pass per hop). Don't block the foreground:
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
nohup python3 scripts/trace_referrers.py <dump> <class> --hops 6 > /tmp/trace.out 2>&1 &
|
|
29
|
+
until grep -q '\[done\]' /tmp/trace.out 2>/dev/null; do sleep 5; done
|
|
30
|
+
cat /tmp/trace.out
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
Scripts emit progress to stderr (`[passA]`, `heap segment #N`, per-hop headers) and finish
|
|
34
|
+
with a `[done]` marker — poll for it.
|
|
35
|
+
|
|
36
|
+
## Prerequisites & Data Safety — DO THIS FIRST
|
|
37
|
+
|
|
38
|
+
Before any analysis, run the safety preflight. **These steps prevent the two failures that wasted the most time in practice.**
|
|
39
|
+
|
|
40
|
+
1. **Locate and verify the dump.** Confirm it is HPROF: the first 13 bytes are `JAVA PROFILE`. Note its size.
|
|
41
|
+
|
|
42
|
+
2. **⚠️ Special characters in the filename break tools.** Filenames containing `%`, spaces, or other shell/URI-sensitive characters (common when JVMs write `-XX:HeapDumpPath=app_%p.hprof`) cause Eclipse MAT and many launchers to fail silently. **Immediately create a hardlink with a clean name** and use it everywhere downstream:
|
|
43
|
+
```bash
|
|
44
|
+
ln '/path/to/app_%p.hprof' '/path/to/app_clean.hprof' # hardlink, instant, no extra disk
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
3. **⚠️ Protect the data — the dump may be the only copy.** A heap dump is irreplaceable, and external tooling (uploaders, splitters) may move or delete it mid-analysis. **Create a second hardlink as a backup** so the inode survives even if the original name is removed:
|
|
48
|
+
```bash
|
|
49
|
+
ln '/path/to/app_clean.hprof' ~/app_dump_backup.hprof
|
|
50
|
+
```
|
|
51
|
+
(A hardlink shares the inode — the data lives as long as ANY link exists, at zero extra disk cost.)
|
|
52
|
+
|
|
53
|
+
4. **Check tooling.** `python3 --version` (required). `java -version` (JDK 11+, only needed if using MAT). MAT presence is optional — see `resources/mat-headless-runbook.md` for install + headless invocation.
|
|
54
|
+
|
|
55
|
+
5. **If the user named a suspected class**, note it for Stage 2. Otherwise Stage 1's histogram will surface candidates.
|
|
56
|
+
|
|
57
|
+
## Stages
|
|
58
|
+
|
|
59
|
+
| # | Stage | Tool | Purpose |
|
|
60
|
+
|---|-------|------|---------|
|
|
61
|
+
| 1 | Histogram | `scripts/hprof_histogram.py` | Find which classes are abnormally numerous / large — especially business & third-party classes |
|
|
62
|
+
| 2 | Reverse trace | `scripts/trace_referrers.py` | Walk *up* from a suspect class to its GC-root holder chain (the "path to GC roots") |
|
|
63
|
+
| 3 | MAT cross-check *(optional)* | `resources/mat-headless-runbook.md` | Run MAT's Leak Suspects + dominator tree headless; corroborate Stage 2 |
|
|
64
|
+
| 4 | Precise measure | `scripts/inspect_objects.py` | Nail the exact entry count of the suspect collection / the exact value of config fields |
|
|
65
|
+
| 5 | Report | — | Synthesize: symptom → multi-evidence → leak chain → mechanism → fix |
|
|
66
|
+
|
|
67
|
+
### Stage 1: Histogram — what is abnormal
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
python3 scripts/hprof_histogram.py <dump.hprof> --top 50
|
|
71
|
+
```
|
|
72
|
+
Read the four leaderboards. The **business/third-party leaderboards (JDK excluded)** are the most diagnostic — they point at *your* objects, not generic containers. Look for a domain object whose instance count is wildly higher than the number of live "real" things it should represent (e.g. session objects ≫ live TCP connections). That mismatch is the leak signature; that class is the Stage 2 suspect.
|
|
73
|
+
|
|
74
|
+
Sanity-anchor the count against reality: compare the suspect against `io.netty.channel.socket.nio.NioSocketChannel`, `sun.nio.ch.SocketChannelImpl`, `java.io.FileDescriptor` (live connections), or a domain "online user" object. A large gap = retained zombies.
|
|
75
|
+
|
|
76
|
+
### Stage 2: Reverse trace — who retains them
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
python3 scripts/trace_referrers.py <dump.hprof> <suspect-class> --hops 6
|
|
80
|
+
```
|
|
81
|
+
Each hop reports who references the current set, **by referrer class and by exact field name**, and flags two GC-root signals:
|
|
82
|
+
- **★ static field holder** — a `static` field of some class directly references the objects (classic static-cache / registry leak).
|
|
83
|
+
- **★ referrer is itself a GC root** — thread, JNI global, sticky class, etc.
|
|
84
|
+
|
|
85
|
+
Follow the chain until it converges on a single container (e.g. one `ConcurrentHashMap$Node[]` table, one singleton holder) or hits a ★ anchor. That holder + field is the leak's retention point — the collection that should have been pruned but wasn't.
|
|
86
|
+
|
|
87
|
+
**Expect the object graph to contain cycles** (e.g. `ClientHead.clientsBox → ClientsBox → map → ClientHead`); the reverse-BFS `next` set will balloon in later hops. That is normal — focus on the *first* hops where a single container/field clearly dominates, and on the ★ anchors.
|
|
88
|
+
|
|
89
|
+
### Stage 3: MAT cross-validation *(OPTIONAL — skip if MAT isn't installed)*
|
|
90
|
+
|
|
91
|
+
**Quick gate:** if `/Applications/MemoryAnalyzer.app` (or a `ParseHeapDump.sh` on PATH) is
|
|
92
|
+
absent, **skip this stage entirely** — the pure-Python Stages 1/2/4 already locate and prove
|
|
93
|
+
the holder chain. Do this stage only when MAT is available and you want the authoritative
|
|
94
|
+
dominator-tree + retained-size corroboration.
|
|
95
|
+
|
|
96
|
+
If Eclipse MAT is available, generate the official Leak Suspects report headless and confirm it names the same holder. **Follow `resources/mat-headless-runbook.md` exactly** — the macOS `.app` launcher fails silently (`exit 14`) from a headless shell; you must invoke the Equinox launcher jar directly with a raised `-Xmx`, and the dump filename must be free of special characters (Stage Prereqs already handled this).
|
|
97
|
+
|
|
98
|
+
MAT's "Problem Suspect 1" (a dominator occupying the bulk of the heap) and its `Node[N]` accumulation point should match the container Stage 2 converged on. Two independent methods agreeing = high confidence.
|
|
99
|
+
|
|
100
|
+
### Stage 4: Precise measurement — nail it
|
|
101
|
+
|
|
102
|
+
Turn inference into hard numbers. First list the holder's fields, then measure the collection and/or read config fields:
|
|
103
|
+
```bash
|
|
104
|
+
# List the holder class's fields (decide what to read)
|
|
105
|
+
python3 scripts/inspect_objects.py <dump.hprof> --class <holder-class>
|
|
106
|
+
# Measure the suspect collection(s) — entry count via table capacity + non-empty buckets
|
|
107
|
+
python3 scripts/inspect_objects.py <dump.hprof> --class <holder-class> --map-fields <field1,field2>
|
|
108
|
+
# Read config / state fields (e.g. is a timeout misconfigured?)
|
|
109
|
+
python3 scripts/inspect_objects.py <dump.hprof> --class <config-class> --fields <field1,field2>
|
|
110
|
+
# Measure a STATIC cache/registry map (static fields are GC-root-level holders — a top leak source)
|
|
111
|
+
python3 scripts/inspect_objects.py <dump.hprof> --class <util-class> --static-fields <field1,field2>
|
|
112
|
+
```
|
|
113
|
+
This distinguishes the real leak collection from innocent ones (a sibling map with thousands of entries is not the 100k+ one), and reads the actual runtime config that governs cleanup (heartbeat/timeout values, flags). **Note:** `ConcurrentHashMap.baseCount`/`HashMap.size` can read low under concurrency; trust **non-empty bucket count** + **table capacity (a power of two)** for the real magnitude.
|
|
114
|
+
|
|
115
|
+
### Stage 5: Report
|
|
116
|
+
|
|
117
|
+
Synthesize a report in the user's language (Chinese if the user wrote in Chinese). Required structure:
|
|
118
|
+
|
|
119
|
+
1. **One-line conclusion** — the true root cause in a sentence.
|
|
120
|
+
2. **Heap overview** — file size, object count, app stack (framework versions if visible).
|
|
121
|
+
3. **Evidence (multi-route)** — a table: histogram counts, reverse-trace holder chain, MAT suspect %, exact measured entry counts. Show they agree.
|
|
122
|
+
4. **Leak chain** — `GC root → … → holder.field (collection) → leaked objects → their retained subtree`. Quantify the retained subtree (what fills the heap).
|
|
123
|
+
5. **Mechanism root cause** — *why* the collection is never pruned (missing remove on disconnect, listener never deregistered, unbounded cache, misconfigured timeout, framework bug, …). Cite the measured config/code evidence. Be explicit about what is *proven* vs *inferred*.
|
|
124
|
+
6. **Fix recommendations** — ordered: permanent fix (code/version upgrade), config correction, guardrails (limits/monitoring/alerts), temporary mitigation (heap bump + rolling restart / scheduled cleanup).
|
|
125
|
+
7. **Data-safety note** — if the original file was renamed/deleted, tell the user which hardlink now holds the data.
|
|
126
|
+
|
|
127
|
+
## Graceful degradation & scaling
|
|
128
|
+
|
|
129
|
+
- **No MAT / MAT won't run** → Stages 1, 2, 4 (pure-Python) are fully sufficient to locate and prove the holder chain. MAT is corroboration, not a dependency.
|
|
130
|
+
- **Very large dumps / multiple suspects** → run Stage 2 on each suspect; the scripts are streaming and re-runnable. You may fan out independent traces as subagents and synthesize.
|
|
131
|
+
- **Unfamiliar framework** → after Stage 2 names the holder class + field, look up that library's source for the add/remove lifecycle of that collection to explain the *mechanism* (Stage 5). Quote method names.
|
|
132
|
+
|
|
133
|
+
## Resources
|
|
134
|
+
|
|
135
|
+
- `resources/methodology.md` — the full five-stage leak-hunting methodology, decision heuristics, and a worked end-to-end example.
|
|
136
|
+
- `resources/mat-headless-runbook.md` — installing Eclipse MAT and running it **headless** (the exit-14 pitfall, Equinox launcher invocation, `-Xmx`, reading the report zips with `textutil`).
|
|
137
|
+
- `resources/hprof-internals.md` — HPROF binary format reference and how the bundled scripts parse it (so you can extend them for a new question).
|
|
@@ -0,0 +1,157 @@
|
|
|
1
|
+
# HPROF Binary Format — Reference & How the Scripts Parse It
|
|
2
|
+
|
|
3
|
+
The bundled scripts are self-contained streaming parsers of the HPROF binary format. This
|
|
4
|
+
note documents the format and the parsing strategy so you can **extend the scripts** for a
|
|
5
|
+
new question (e.g. "dump every field of object X", "list the keys of map Y", "histogram of
|
|
6
|
+
retained subtree of class Z").
|
|
7
|
+
|
|
8
|
+
## File layout
|
|
9
|
+
|
|
10
|
+
```
|
|
11
|
+
[format string, NUL-terminated] e.g. "JAVA PROFILE 1.0.1" / "1.0.2"
|
|
12
|
+
[identifier size : u4] usually 8 (64-bit, no compressed-oop in the dump format)
|
|
13
|
+
[timestamp : u8]
|
|
14
|
+
[record]* a flat sequence of top-level records
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
Each top-level record:
|
|
18
|
+
|
|
19
|
+
```
|
|
20
|
+
[tag : u1][time : u4][length : u4][body : <length> bytes]
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
### Top-level tags used
|
|
24
|
+
|
|
25
|
+
| tag | name | body |
|
|
26
|
+
|-----|------|------|
|
|
27
|
+
| 0x01 | STRING (UTF-8) | `id` + UTF-8 bytes → the string table (class names, field names) |
|
|
28
|
+
| 0x02 | LOAD_CLASS | class serial(u4) + **class object id** + stack serial(u4) + **name string id** |
|
|
29
|
+
| 0x0C | HEAP_DUMP | a container of heap sub-records |
|
|
30
|
+
| 0x1C | HEAP_DUMP_SEGMENT | same as 0x0C, just chunked (large dumps emit several) |
|
|
31
|
+
| 0x2C | HEAP_DUMP_END | marker |
|
|
32
|
+
|
|
33
|
+
Other tags (UTF8 stack frames/traces, alloc sites, etc.) are skipped via `length`.
|
|
34
|
+
|
|
35
|
+
## Heap-dump sub-records
|
|
36
|
+
|
|
37
|
+
Inside a 0x0C/0x1C body is a sequence of sub-records, each led by a 1-byte sub-tag. To stay
|
|
38
|
+
aligned you must consume **exactly** the right number of bytes for every sub-tag — including
|
|
39
|
+
GC-root records you don't care about.
|
|
40
|
+
|
|
41
|
+
### GC roots (fixed sizes — must skip precisely)
|
|
42
|
+
|
|
43
|
+
| sub-tag | root kind | after the leading `id`, extra bytes |
|
|
44
|
+
|---------|-----------|-------------------------------------|
|
|
45
|
+
| 0xFF | UNKNOWN | 0 |
|
|
46
|
+
| 0x01 | JNI GLOBAL | + 1 `id` (JNI ref) |
|
|
47
|
+
| 0x02 | JNI LOCAL | + u4 + u4 (= 8) |
|
|
48
|
+
| 0x03 | JAVA FRAME | + u4 + u4 (= 8) |
|
|
49
|
+
| 0x04 | NATIVE STACK | + u4 (= 4) |
|
|
50
|
+
| 0x05 | STICKY CLASS | 0 |
|
|
51
|
+
| 0x06 | THREAD BLOCK | + u4 (= 4) |
|
|
52
|
+
| 0x07 | MONITOR USED | 0 |
|
|
53
|
+
| 0x08 | THREAD OBJECT | + u4 + u4 (= 8) |
|
|
54
|
+
|
|
55
|
+
The leading `id` of each is the rooted object — collect these into a set to detect
|
|
56
|
+
"referrer is itself a GC root."
|
|
57
|
+
|
|
58
|
+
### 0x20 CLASS_DUMP (the tricky one)
|
|
59
|
+
|
|
60
|
+
```
|
|
61
|
+
class object id : id
|
|
62
|
+
stack trace serial : u4
|
|
63
|
+
super class object id : id
|
|
64
|
+
class loader id : id
|
|
65
|
+
signers id : id
|
|
66
|
+
protection domain id : id
|
|
67
|
+
reserved : id
|
|
68
|
+
reserved : id
|
|
69
|
+
instance size : u4
|
|
70
|
+
constant pool count : u2, then each: cp-index(u2) + type(u1) + value(size-by-type)
|
|
71
|
+
static fields count : u2, then each: name string id(id) + type(u1) + value(size-by-type)
|
|
72
|
+
instance fields count : u2, then each: name string id(id) + type(u1)
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
From CLASS_DUMP the scripts cache: `super` (to walk the chain), the **instance field list**
|
|
76
|
+
`(type, name-id)` in declaration order, and **static field references** (object-typed
|
|
77
|
+
statics — these are GC-root-level holders and how static-cache leaks are detected).
|
|
78
|
+
|
|
79
|
+
### 0x21 INSTANCE_DUMP
|
|
80
|
+
|
|
81
|
+
```
|
|
82
|
+
object id : id
|
|
83
|
+
stack trace serial : u4
|
|
84
|
+
class object id : id
|
|
85
|
+
num bytes that follow : u4
|
|
86
|
+
field values : <num bytes> (raw, laid out per the class's field layout)
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
**Field layout order:** the values are this class's instance fields in declaration order,
|
|
90
|
+
**then the super class's, then its super's**, … each field occupying `size-by-type` bytes.
|
|
91
|
+
To read a specific field you compute its byte offset by walking
|
|
92
|
+
`class → super → super…`, summing field sizes; reference fields (type 2) hold an `id`.
|
|
93
|
+
|
|
94
|
+
### 0x22 OBJECT_ARRAY_DUMP
|
|
95
|
+
|
|
96
|
+
```
|
|
97
|
+
array object id : id
|
|
98
|
+
stack serial : u4
|
|
99
|
+
num elements : u4
|
|
100
|
+
array class id : id
|
|
101
|
+
elements : num × id
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
### 0x23 PRIMITIVE_ARRAY_DUMP
|
|
105
|
+
|
|
106
|
+
```
|
|
107
|
+
array object id : id
|
|
108
|
+
stack serial : u4
|
|
109
|
+
num elements : u4
|
|
110
|
+
element type : u1
|
|
111
|
+
elements : num × size-by-type
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
## Basic type tags & sizes
|
|
115
|
+
|
|
116
|
+
| tag | type | size |
|
|
117
|
+
|-----|------|------|
|
|
118
|
+
| 2 | object (ref) | `id_size` (usually 8) |
|
|
119
|
+
| 4 | boolean | 1 |
|
|
120
|
+
| 5 | char | 2 |
|
|
121
|
+
| 6 | float | 4 |
|
|
122
|
+
| 7 | double | 8 |
|
|
123
|
+
| 8 | byte | 1 |
|
|
124
|
+
| 9 | short | 2 |
|
|
125
|
+
| 10 | int | 4 |
|
|
126
|
+
| 11 | long | 8 |
|
|
127
|
+
|
|
128
|
+
## Parsing strategy used by the scripts
|
|
129
|
+
|
|
130
|
+
- **Streaming, low memory.** Read each top-level record header (9 bytes), then read the
|
|
131
|
+
body only for records of interest (STRING, LOAD_CLASS, HEAP_DUMP); `seek()` past the rest.
|
|
132
|
+
Heap-dump bodies are walked sub-record by sub-record over a `memoryview` (no copies).
|
|
133
|
+
- **Class metadata is cached once**, object counts/fields are never all held in memory — so
|
|
134
|
+
memory use is independent of object count (works on 40M-object dumps).
|
|
135
|
+
- **Reference offsets** (`get_refoffs`) precompute, per class, the byte offsets of all
|
|
136
|
+
reference-typed fields (self + supers) for fast referrer scanning in `trace_referrers.py`.
|
|
137
|
+
- **Reverse tracing = repeated full streaming passes.** Each hop re-scans the file, testing
|
|
138
|
+
every object's references against the current target id-set. Slower than building an
|
|
139
|
+
in-memory graph but uses trivial memory and is robust on huge dumps.
|
|
140
|
+
|
|
141
|
+
## Extending the scripts
|
|
142
|
+
|
|
143
|
+
Common one-off questions and where to hook in:
|
|
144
|
+
|
|
145
|
+
- **Print every field of a specific object id** → in a heap walk, when
|
|
146
|
+
`INSTANCE_DUMP.object id == target`, decode each field with the class's
|
|
147
|
+
`layout` (see `inspect_objects.py`'s `read_val` + `layout`).
|
|
148
|
+
- **Enumerate a map's keys/values** → read the map's `table` (a `Node[]`), then for each
|
|
149
|
+
non-null `Node` follow `key`/`val`/`next`; resolve value object classes via the
|
|
150
|
+
class-id → name table.
|
|
151
|
+
- **Retained-style grouping** → for a target class, in one pass collect each instance's
|
|
152
|
+
outgoing refs, then attribute their shallow sizes (approximate; real dominator/retained
|
|
153
|
+
needs MAT).
|
|
154
|
+
|
|
155
|
+
Keep the GC-root and CLASS_DUMP skip logic byte-exact — a single miscounted field
|
|
156
|
+
desynchronizes the entire stream. The scripts raise on an unknown sub-tag, which is the
|
|
157
|
+
canary for a misalignment.
|
|
@@ -0,0 +1,108 @@
|
|
|
1
|
+
# Eclipse MAT — Headless Runbook (and pitfalls)
|
|
2
|
+
|
|
3
|
+
Eclipse Memory Analyzer (MAT) is the gold standard for `.hprof` analysis: it builds a
|
|
4
|
+
dominator tree and produces a **Leak Suspects** report. This skill uses it only as
|
|
5
|
+
**cross-validation** for the pure-Python scripts — but when it runs, it is authoritative.
|
|
6
|
+
This runbook captures the exact invocation and the traps that make it fail *silently*.
|
|
7
|
+
|
|
8
|
+
## Install
|
|
9
|
+
|
|
10
|
+
macOS (Apple Silicon or Intel), via Homebrew:
|
|
11
|
+
|
|
12
|
+
```bash
|
|
13
|
+
brew install --cask mat # cask is named "mat" (Eclipse Memory Analyzer)
|
|
14
|
+
# installs to /Applications/MemoryAnalyzer.app
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
Other platforms: download the **Standalone / RCP** build from
|
|
18
|
+
<https://eclipse.dev/mat/download/> (needs a JDK 17+ on PATH).
|
|
19
|
+
|
|
20
|
+
## ⚠️ Pitfall 1 — the macOS `.app` launcher exits 14, silently
|
|
21
|
+
|
|
22
|
+
Running `…/MemoryAnalyzer.app/Contents/Eclipse/ParseHeapDump.sh` (which calls the
|
|
23
|
+
`Contents/MacOS/MemoryAnalyzer` binary) from a **headless shell** (no GUI/WindowServer
|
|
24
|
+
session) fails with **exit code 14, zero output, no index files**. The Cocoa launcher
|
|
25
|
+
forces `-XstartOnFirstThread` and cannot reach the window server.
|
|
26
|
+
|
|
27
|
+
**Do not fight `ParseHeapDump.sh`.** Invoke the Equinox launcher jar directly with your own
|
|
28
|
+
JDK — fully headless, no GUI bootstrap:
|
|
29
|
+
|
|
30
|
+
```bash
|
|
31
|
+
ECL=/Applications/MemoryAnalyzer.app/Contents/Eclipse
|
|
32
|
+
LAUNCHER=$(ls "$ECL"/plugins/org.eclipse.equinox.launcher_*.jar)
|
|
33
|
+
|
|
34
|
+
java -Xmx12g \
|
|
35
|
+
--add-exports=java.base/jdk.internal.org.objectweb.asm=ALL-UNNAMED \
|
|
36
|
+
-Djava.awt.headless=true \
|
|
37
|
+
-Dosgi.install.area="file:$ECL/" \
|
|
38
|
+
-Dosgi.configuration.area=file:/tmp/mat_cfg/ \
|
|
39
|
+
-jar "$LAUNCHER" \
|
|
40
|
+
-consoleLog -nosplash -data /tmp/mat_ws \
|
|
41
|
+
-application org.eclipse.mat.api.parse \
|
|
42
|
+
'/path/to/dump_clean.hprof' \
|
|
43
|
+
org.eclipse.mat.api:suspects org.eclipse.mat.api:top_components
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
Notes:
|
|
47
|
+
|
|
48
|
+
- `-Dosgi.install.area` / `-Dosgi.configuration.area` are **required** — without them the
|
|
49
|
+
launcher loads no MAT bundle and the application is a no-op (the same silent exit 14).
|
|
50
|
+
Point `configuration.area` at a writable dir (the install `configuration/` is read-only).
|
|
51
|
+
- `-Xmx`: give MAT roughly **the dump size or more** (12g for a 2.6 GB dump is comfortable;
|
|
52
|
+
MAT's indexing is efficient but needs headroom). If you instead use `ParseHeapDump.sh` on
|
|
53
|
+
Linux/CI, raise `-Xmx` by editing `MemoryAnalyzer.ini`'s `-vmargs` block.
|
|
54
|
+
- Report ids: `org.eclipse.mat.api:suspects` (Leak Suspects),
|
|
55
|
+
`org.eclipse.mat.api:top_components`, `org.eclipse.mat.api:overview`.
|
|
56
|
+
|
|
57
|
+
## ⚠️ Pitfall 2 — special characters in the dump filename
|
|
58
|
+
|
|
59
|
+
`%`, spaces, etc. (common from `-XX:HeapDumpPath=app_%p.hprof`) break Equinox/URI handling
|
|
60
|
+
and cause the *same* silent exit 14. **Always analyze a clean-named hardlink** (see the
|
|
61
|
+
skill's Prerequisites). This also protects the data if the original is later deleted.
|
|
62
|
+
|
|
63
|
+
## ⚠️ Pitfall 3 — Gatekeeper quarantine (macOS)
|
|
64
|
+
|
|
65
|
+
A freshly cask-installed app may be quarantined. If launch is blocked:
|
|
66
|
+
|
|
67
|
+
```bash
|
|
68
|
+
xattr -dr com.apple.quarantine /Applications/MemoryAnalyzer.app
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
## Outputs
|
|
72
|
+
|
|
73
|
+
A successful parse writes, next to the dump:
|
|
74
|
+
|
|
75
|
+
- Index files: `*.index`, `*.idx.index`, `*.domIn.index`, `*.domOut.index`,
|
|
76
|
+
`*.o2ret.index`, `*.threads`, … (the dominator tree + retained-size indices).
|
|
77
|
+
- Report zips: `<name>_Leak_Suspects.zip`, `<name>_Top_Components.zip`,
|
|
78
|
+
`<name>_System_Overview.zip`.
|
|
79
|
+
|
|
80
|
+
The presence of `*.index` files is the reliable "parsing actually started" signal when
|
|
81
|
+
watching progress.
|
|
82
|
+
|
|
83
|
+
## Reading the report headlessly
|
|
84
|
+
|
|
85
|
+
The zips are HTML. Convert to text with macOS `textutil` (or any html-to-text tool):
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
cd /tmp && rm -rf leak && mkdir leak && cd leak
|
|
89
|
+
unzip -o '/path/to/<name>_Leak_Suspects.zip' >/dev/null
|
|
90
|
+
textutil -convert txt -stdout index.html | sed '/^[[:space:]]*$/d' | head -120
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
What to extract from **Leak Suspects**:
|
|
94
|
+
|
|
95
|
+
- **Problem Suspect 1**: "one instance of `<HolderClass>` … occupies `N` bytes (`P`%) …"
|
|
96
|
+
→ the dominator and its retained-heap share.
|
|
97
|
+
- "The memory is accumulated in one instance of `<X>[capacity]`" → the accumulation point
|
|
98
|
+
(often a `ConcurrentHashMap$Node[]` / `Object[]` table).
|
|
99
|
+
- "Thread `…` has a local variable … on the shortest path to …" → the GC-root path.
|
|
100
|
+
|
|
101
|
+
Confirm this holder matches what `trace_referrers.py` converged on. Detailed dominator
|
|
102
|
+
trees and per-object paths live in `pages/*.html` and `_Top_Components` / `_System_Overview`.
|
|
103
|
+
|
|
104
|
+
## Quick decision
|
|
105
|
+
|
|
106
|
+
- MAT runs → use it to corroborate + get retained sizes. Strongest evidence.
|
|
107
|
+
- MAT won't run after ~2 attempts → **don't keep fighting it.** The pure-Python Stages 1/2/4
|
|
108
|
+
fully locate and prove the holder chain on their own.
|
|
@@ -0,0 +1,148 @@
|
|
|
1
|
+
# Heap Leak Hunting — Methodology
|
|
2
|
+
|
|
3
|
+
The single idea behind this skill: **a histogram tells you *what* is numerous; it does not
|
|
4
|
+
tell you *who retains it* or *why it is never freed*.** Generic containers (`char[]`,
|
|
5
|
+
`String`, `HashMap$Node`, `Object[]`) always top the histogram and almost never name the
|
|
6
|
+
bug. Root-causing a leak means walking *up the reference graph* to the GC-root holder, then
|
|
7
|
+
explaining the *mechanism* that keeps the collection growing.
|
|
8
|
+
|
|
9
|
+
Every conclusion should be reached by **at least two independent routes** so it is not an
|
|
10
|
+
artifact of one tool.
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## Stage 1 — Histogram: find the abnormal class
|
|
15
|
+
|
|
16
|
+
Run `scripts/hprof_histogram.py`. Ignore the all-classes board first; read the
|
|
17
|
+
**business / third-party board (JDK excluded)**. You are hunting for a *domain* class whose
|
|
18
|
+
instance count is wildly larger than the number of real things it should represent.
|
|
19
|
+
|
|
20
|
+
**Anchor the count against reality.** Pick objects whose count equals "live work":
|
|
21
|
+
|
|
22
|
+
- `io.netty.channel.socket.nio.NioSocketChannel`, `sun.nio.ch.SocketChannelImpl`,
|
|
23
|
+
`java.io.FileDescriptor` → number of live TCP connections / sockets.
|
|
24
|
+
- a domain "online user / active session" object, thread-pool size, etc.
|
|
25
|
+
|
|
26
|
+
If a session/handler/listener/entry class has, say, 190k instances while live connections
|
|
27
|
+
are 40k and active users are 6k, the ~180k surplus is **retained zombies**. That class is
|
|
28
|
+
your Stage-2 suspect.
|
|
29
|
+
|
|
30
|
+
Cross-check internal consistency: leaked parents usually drag fixed-ratio children. If
|
|
31
|
+
`SuspectA = N`, look for classes at `≈N`, `≈2N`, `≈k·N` — they confirm a whole subtree
|
|
32
|
+
leaks together (e.g. each leaked session keeps 1 handshake + 2 transport states + 1 store).
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## Stage 2 — Reverse-reference trace: who retains them
|
|
37
|
+
|
|
38
|
+
Run `scripts/trace_referrers.py <dump> <suspect-class>`. It implements MAT's
|
|
39
|
+
"path to GC roots" in pure Python: from every instance of the suspect, find referrers via
|
|
40
|
+
**instance fields, array elements, and static fields**, hop by hop, until reaching a
|
|
41
|
+
GC root or a single accumulation container.
|
|
42
|
+
|
|
43
|
+
Read each hop for:
|
|
44
|
+
|
|
45
|
+
- **The dominant referrer field.** `('java.util.concurrent.ConcurrentHashMap$Node', 'val')`
|
|
46
|
+
with a count ≈ suspect count means "they live as *values* of a ConcurrentHashMap."
|
|
47
|
+
Next hops climb `Node → Node[] table → ConcurrentHashMap → holder.field`.
|
|
48
|
+
- **★ static field holder** — a `static` field directly references the set. This is the
|
|
49
|
+
smoking gun for static-registry / static-cache leaks; the holder is a GC root by itself.
|
|
50
|
+
- **★ referrer is itself a GC root** — thread (local var / thread object), JNI global,
|
|
51
|
+
sticky class.
|
|
52
|
+
|
|
53
|
+
**Convergence is the goal.** You are looking for the hop where many objects funnel into
|
|
54
|
+
*one* container instance (one big `Node[]` table, one singleton holder). That container's
|
|
55
|
+
owning field is the retention point.
|
|
56
|
+
|
|
57
|
+
**Cycles are expected.** Frameworks wire back-references (child holds a pointer to the
|
|
58
|
+
registry that holds the child), so the reverse-BFS frontier *grows* after a few hops. Don't
|
|
59
|
+
chase the ballooning frontier — read the *early* hops and the ★ anchors.
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## Stage 3 — MAT cross-validation (optional)
|
|
64
|
+
|
|
65
|
+
If Eclipse MAT runs (see `mat-headless-runbook.md`), its **Leak Suspects** report should
|
|
66
|
+
independently name the same holder ("Problem Suspect 1: one instance of X occupies N% …
|
|
67
|
+
accumulated in one `Node[capacity]`"). Two unrelated methods agreeing turns a strong
|
|
68
|
+
inference into a confident finding. MAT also gives **retained size** (true cost of the
|
|
69
|
+
subtree), which the histogram's shallow size cannot.
|
|
70
|
+
|
|
71
|
+
---
|
|
72
|
+
|
|
73
|
+
## Stage 4 — Precise measurement: turn inference into numbers
|
|
74
|
+
|
|
75
|
+
Run `scripts/inspect_objects.py` to:
|
|
76
|
+
|
|
77
|
+
- **Measure the suspect collection's true size** — `--map-fields`. Distinguish the real
|
|
78
|
+
leak map (100k+ entries, huge power-of-two table) from innocent siblings (thousands).
|
|
79
|
+
- **Read the runtime config that governs cleanup** — `--fields`. Heartbeat/timeout values,
|
|
80
|
+
feature flags, max sizes. A misconfigured timeout (e.g. 100× the default) often explains
|
|
81
|
+
*why* cleanup never fires.
|
|
82
|
+
|
|
83
|
+
> **Trust non-empty bucket count + table capacity, not `size`/`baseCount`.**
|
|
84
|
+
> `ConcurrentHashMap` spreads its counter across `counterCells` under contention, so
|
|
85
|
+
> `baseCount` can read absurdly low (e.g. 552 for a map with ~190k entries). The
|
|
86
|
+
> `Node[]` table capacity (a power of two) and the non-empty bucket count are reliable.
|
|
87
|
+
> The script flags this automatically (⚠️) when `size` and non-empty buckets differ by >10×.
|
|
88
|
+
|
|
89
|
+
When comparing sibling collections, remember **they may hold different things**: a registry
|
|
90
|
+
keyed by session-id can be huge while a sibling keyed by live channel stays small (it only
|
|
91
|
+
holds currently-connected transports, not every session). Don't expect every map's size to
|
|
92
|
+
equal the live-connection count — judge each against *what it is supposed to hold*. The leak
|
|
93
|
+
is the one whose size has no business being that large.
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## Stage 5 — Mechanism & report
|
|
98
|
+
|
|
99
|
+
Locating the holder is half the job; explain **why it grows unbounded**:
|
|
100
|
+
|
|
101
|
+
- **Static cache / registry never pruned** — entries added on event A, removal depends on
|
|
102
|
+
event B that doesn't always happen.
|
|
103
|
+
- **Listener / callback never deregistered** — observer holds the subject (or vice-versa).
|
|
104
|
+
- **Connection / session not removed on disconnect** — cleanup depends on a timeout or an
|
|
105
|
+
explicit close callback that some paths skip.
|
|
106
|
+
- **Unbounded queue / buffer** — producer outruns consumer.
|
|
107
|
+
- **`ThreadLocal` on a pooled thread** — value outlives the request.
|
|
108
|
+
- **`ClassLoader` leak** — a long-lived object pins a webapp/plugin classloader.
|
|
109
|
+
- **Misconfigured timeout / TTL** — cleanup mechanism exists but is effectively disabled
|
|
110
|
+
(set to 0) or set so large it never fires.
|
|
111
|
+
|
|
112
|
+
To pin the mechanism, read the owning library's source for the **add/remove lifecycle** of
|
|
113
|
+
that exact collection, and quote the method names. State plainly what is *proven from the
|
|
114
|
+
heap* vs *inferred*.
|
|
115
|
+
|
|
116
|
+
---
|
|
117
|
+
|
|
118
|
+
## Worked example (anonymized, real case)
|
|
119
|
+
|
|
120
|
+
> This is a real, anonymized case shown to illustrate the flow end-to-end. For a *new* dump,
|
|
121
|
+
> derive every number yourself with the scripts — **do not pattern-match your conclusion to
|
|
122
|
+
> this example.** A different leak will have a different holder class, field, and mechanism;
|
|
123
|
+
> the *method* transfers, the answer does not.
|
|
124
|
+
|
|
125
|
+
**Symptom.** 2.6 GB dump, ~40M objects. Histogram: a Socket.IO session class `ClientHead`
|
|
126
|
+
= 192,639; live `NioSocketChannel` = 41,773; active session-wrapper = 6,530. → ~186k
|
|
127
|
+
zombie sessions.
|
|
128
|
+
|
|
129
|
+
**Reverse trace.** `ClientHead ← ConcurrentHashMap$Node.val (194,844) ← Node[] table ←
|
|
130
|
+
ClientsBox (singleton) ← GC root (Netty I/O thread → SocketIOChannelInitializer)`. Sibling
|
|
131
|
+
ratios confirmed: `TransportState ≈ 2× ClientHead`, message queues `≈ 2× ClientHead`.
|
|
132
|
+
|
|
133
|
+
**MAT.** Leak Suspects: "one `ClientsBox` occupies 64.67% … accumulated in
|
|
134
|
+
`ConcurrentHashMap$Node[524288]`." Same holder. ✓
|
|
135
|
+
|
|
136
|
+
**Measure.** `ClientsBox.uuid2clients`: table 524288, non-empty buckets 160,989 (≈190k);
|
|
137
|
+
`channel2clients`: 2,027; business static maps: 939 / 14,496. → leak is **only**
|
|
138
|
+
`uuid2clients`. Config: `pingTimeout=120000` (2× default), `upgradeTimeout=1,000,000` (100×
|
|
139
|
+
default).
|
|
140
|
+
|
|
141
|
+
**Mechanism.** `AuthorizeHandler.authorize()` does `clientsBox.addClient()` on handshake;
|
|
142
|
+
removal depends on ping-timeout or explicit disconnect. Only ~18k timeout tasks exist for
|
|
143
|
+
190k sessions → most zombies have no pending cleanup. The blown-up `upgradeTimeout`
|
|
144
|
+
amplifies retention of mid-upgrade/aborted polling connections.
|
|
145
|
+
|
|
146
|
+
**Fix.** Upgrade the library; restore `upgradeTimeout`/`pingTimeout` to sane values; add
|
|
147
|
+
ingress rate-limit/auth to stop invalid handshakes; monitor `uuid2clients.size`; mitigate
|
|
148
|
+
with heap bump + rolling restart / scheduled stale-session cleanup.
|