vmware-debug 1.6.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,23 @@
1
+ __pycache__/
2
+ *.py[cod]
3
+ *$py.class
4
+ *.egg-info/
5
+ dist/
6
+ build/
7
+ .eggs/
8
+ *.egg
9
+ .venv/
10
+ venv/
11
+ .env
12
+ *.log
13
+ .pytest_cache/
14
+ .ruff_cache/
15
+ htmlcov/
16
+ .coverage
17
+ config.yaml
18
+ .agents/
19
+ .claude/
20
+ .trae/
21
+ skills-lock.json
22
+ tests/fixtures/token_corpus/
23
+ .DS_Store
@@ -0,0 +1,52 @@
1
+ Metadata-Version: 2.4
2
+ Name: vmware-debug
3
+ Version: 1.6.1
4
+ Summary: VMware diagnostic brain — read-only incident triage, log/event correlation, and root-cause routing across the VMware skill family
5
+ Author-email: Wei Zhou <wei-wz.zhou@broadcom.com>
6
+ License-Expression: MIT
7
+ Keywords: ai-ops,debug,diagnostics,mcp,rca,troubleshooting,vmware,vsphere
8
+ Classifier: Development Status :: 4 - Beta
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: Topic :: System :: Monitoring
12
+ Requires-Python: >=3.10
13
+ Requires-Dist: mcp[cli]<2.0,>=1.10
14
+ Requires-Dist: rich<15.0,>=13.0
15
+ Requires-Dist: typer<1.0,>=0.12
16
+ Requires-Dist: vmware-policy<2.0,>=1.0.0
17
+ Description-Content-Type: text/markdown
18
+
19
+ <!-- mcp-name: io.github.zw008/vmware-debug -->
20
+
21
+ # VMware Debug
22
+
23
+ > ⚠️ **Work in progress** — the core (event correlation engine, MCP tools, CLI)
24
+ > is built and tested; README, `server.json`, full reference docs, and packaging
25
+ > polish are still landing. Not yet published to PyPI.
26
+
27
+ > **Disclaimer**: Community-maintained open-source project, **not affiliated with,
28
+ > endorsed by, or sponsored by VMware, Inc. or Broadcom Inc.** "VMware" and
29
+ > "vSphere" are trademarks of Broadcom. Source is publicly auditable under the MIT
30
+ > license.
31
+
32
+ The diagnostic brain of the VMware skill family. You bring the symptom (an error,
33
+ a log dump, a slow VM); this skill runs a systematic investigation, correlates
34
+ events from the other skills into one timeline, ranks root-cause hypotheses, and
35
+ tells you what to check next. It is **read-only** — it never changes anything and
36
+ never executes fixes. Remediation is routed to `vmware-aiops` (single op) or
37
+ `vmware-pilot` (multi-step, gated), mirroring the `vmware-harden → vmware-pilot`
38
+ advisor/executor split.
39
+
40
+ See [`skills/vmware-debug/SKILL.md`](skills/vmware-debug/SKILL.md) for the full
41
+ methodology, the event-envelope contract, and symptom routing.
42
+
43
+ ## MCP tools
44
+
45
+ | Tool | What |
46
+ |---|---|
47
+ | `incident_timeline` | [READ] Correlate pre-fetched events → timeline + spikes + ranked hypotheses + next-check ideas |
48
+ | `list_symptom_categories` | [READ] List recognised symptom categories + what to check for each |
49
+
50
+ ## License
51
+
52
+ MIT.
@@ -0,0 +1,45 @@
1
+ <!-- mcp-name: io.github.zw008/vmware-debug -->
2
+
3
+ # VMware Debug(中文)
4
+
5
+ > **声明**:本项目为社区维护的开源项目,**与 VMware, Inc. 或 Broadcom Inc. 无任何隶属、
6
+ > 背书或赞助关系。** "VMware"、"vSphere" 为 Broadcom 商标。源码以 MIT 许可证公开可审计。
7
+
8
+ VMware skill 家族的**诊断大脑**。你给出症状(报错、日志、变慢的 VM),它来跑系统化排查:
9
+ 把其它 skill 取到的事件关联成一条时间线、检测突刺、给根因假设排序,并告诉你下一步该查什么。
10
+ **只读**——从不修改任何东西,也从不执行修复。修复一律路由给 vmware-aiops(单步)或
11
+ vmware-pilot(多步、带审批门控),完全复刻 vmware-harden → vmware-pilot 的「顾问/执行」分工。
12
+
13
+ ## 配套 Skill
14
+
15
+ | 需求 | Skill |
16
+ |---|---|
17
+ | 故障关联 / 根因 | **vmware-debug**(本项目) |
18
+ | 集中日志检索 | vmware-log-insight(把 `log_search` 结果喂给它) |
19
+ | vCenter 事件与告警 | vmware-monitor |
20
+ | 指标 / 异常 | vmware-aria |
21
+ | 执行修复 | vmware-aiops(单步)/ vmware-pilot(多步门控) |
22
+
23
+ ## 安装
24
+
25
+ ```bash
26
+ uv tool install vmware-debug
27
+ vmware-debug categories # 看它能诊断哪些症状类别
28
+ ```
29
+
30
+ ## MCP 工具(2 个,全只读)
31
+
32
+ - `incident_timeline`:把已取到的事件关联成 时间线 + 突刺 + 排序后的根因假设 + 下一步检查建议
33
+ - `list_symptom_categories`:症状类别及对应的排查路由(不知道查什么时用它)
34
+
35
+ **事件信封**:`{ts, source, severity, entity, text, fields}`。agent 把各源事件归一成此形状再交给
36
+ debug;debug 因此与其它包零运行时依赖。
37
+
38
+ ## 安全
39
+
40
+ 结构上只读、离线、无凭据:不连任何 vCenter/NSX/Aria,没有可破坏面,也没有秘密可泄露。
41
+ 详见 [SECURITY.md](SECURITY.md)。
42
+
43
+ ## 许可证
44
+
45
+ MIT。
@@ -0,0 +1,34 @@
1
+ <!-- mcp-name: io.github.zw008/vmware-debug -->
2
+
3
+ # VMware Debug
4
+
5
+ > ⚠️ **Work in progress** — the core (event correlation engine, MCP tools, CLI)
6
+ > is built and tested; README, `server.json`, full reference docs, and packaging
7
+ > polish are still landing. Not yet published to PyPI.
8
+
9
+ > **Disclaimer**: Community-maintained open-source project, **not affiliated with,
10
+ > endorsed by, or sponsored by VMware, Inc. or Broadcom Inc.** "VMware" and
11
+ > "vSphere" are trademarks of Broadcom. Source is publicly auditable under the MIT
12
+ > license.
13
+
14
+ The diagnostic brain of the VMware skill family. You bring the symptom (an error,
15
+ a log dump, a slow VM); this skill runs a systematic investigation, correlates
16
+ events from the other skills into one timeline, ranks root-cause hypotheses, and
17
+ tells you what to check next. It is **read-only** — it never changes anything and
18
+ never executes fixes. Remediation is routed to `vmware-aiops` (single op) or
19
+ `vmware-pilot` (multi-step, gated), mirroring the `vmware-harden → vmware-pilot`
20
+ advisor/executor split.
21
+
22
+ See [`skills/vmware-debug/SKILL.md`](skills/vmware-debug/SKILL.md) for the full
23
+ methodology, the event-envelope contract, and symptom routing.
24
+
25
+ ## MCP tools
26
+
27
+ | Tool | What |
28
+ |---|---|
29
+ | `incident_timeline` | [READ] Correlate pre-fetched events → timeline + spikes + ranked hypotheses + next-check ideas |
30
+ | `list_symptom_categories` | [READ] List recognised symptom categories + what to check for each |
31
+
32
+ ## License
33
+
34
+ MIT.
@@ -0,0 +1,26 @@
1
+ ## v1.6.1 (2026-06-24) — initial release
2
+
3
+ First release of **vmware-debug**: the read-only diagnostic brain of the VMware
4
+ skill family. You bring the symptom; it runs the investigation, correlates
5
+ events from the other skills into one timeline, ranks root-cause hypotheses, and
6
+ routes remediation to vmware-aiops / vmware-pilot. It never writes and never
7
+ executes fixes (advisor/executor split, mirroring vmware-harden → vmware-pilot).
8
+
9
+ ### Added
10
+ - **2 read-only MCP tools**: `incident_timeline` (correlate pre-fetched events
11
+ into a timeline + z-score spikes + ranked hypotheses + next-check ideas) and
12
+ `list_symptom_categories` (the symptom→skill routing catalogue).
13
+ - **Unified event envelope** + tolerant normalizer so debug stays source-agnostic
14
+ with zero runtime dependency on the other skill packages — the agent fans out
15
+ to each skill's read tools and feeds events here (avoids cross-skill coupling,
16
+ 踩坑 #21/#32).
17
+ - **Pure correlation engine** (timeline merge, time-binning, spike detection,
18
+ hypothesis ranking, symptom routing) — fully unit-tested offline.
19
+ - **Typer CLI**: `triage`, `categories`, `version`, `mcp`. The `mcp` entry point
20
+ needs no network at startup (proxy-safe, 踩坑 #25).
21
+ - SKILL.md + references (event-envelope contract, symptom routing, playbooks).
22
+
23
+ ### Notes
24
+ - Read-only by construction; remediation is routed, never executed.
25
+ - `parse_timestamp` rejects implausible/garbage timestamps loudly rather than
26
+ silently landing at the epoch.
@@ -0,0 +1,49 @@
1
+ # Security Policy
2
+
3
+ ## Disclaimer
4
+
5
+ This is a community-maintained open-source project and is **not affiliated with,
6
+ endorsed by, or sponsored by VMware, Inc. or Broadcom Inc.** "VMware" and
7
+ "vSphere" are trademarks of Broadcom. Source code is publicly auditable at
8
+ [github.com/zw008/VMware-Debug](https://github.com/zw008/VMware-Debug) under the
9
+ MIT license.
10
+
11
+ ## Reporting Vulnerabilities
12
+
13
+ Report security issues via a GitHub private security advisory on the repository,
14
+ or by email to the maintainer. Please do not open public issues for security bugs.
15
+
16
+ ## Security Design
17
+
18
+ ### Read-only and offline by construction
19
+ vmware-debug has **no write tools, no network access, and no credentials**. It
20
+ does not connect to vCenter, NSX, Aria, or any appliance. Its tools are pure
21
+ functions over event data the orchestrating agent has already fetched with the
22
+ other skills' read tools. There is no destructive surface and no secret to leak.
23
+
24
+ ### No remediation execution
25
+ debug only *diagnoses* and *recommends*. Any fix is routed to vmware-aiops
26
+ (single op, with its own confirmation) or vmware-pilot (multi-step, approval-gated,
27
+ audited). The safety gates live in those skills, not here.
28
+
29
+ ### No cross-skill coupling
30
+ debug imports none of the other skill packages at runtime. Events arrive as plain
31
+ dicts (the unified event envelope), so there is no transitive dependency surface.
32
+
33
+ ### Prompt-injection consideration
34
+ debug operates on text the agent supplies. Its outputs are structured data
35
+ (timelines, hypotheses, routing strings); it does not execute or shell out to
36
+ anything based on event content.
37
+
38
+ ## Static Analysis
39
+
40
+ ```bash
41
+ uvx bandit -r vmware_debug/ mcp_server/
42
+ ```
43
+
44
+ Release bar: 0 Medium-or-higher severity findings.
45
+
46
+ ## Supported Versions
47
+
48
+ The latest released version receives fixes. Versions are kept aligned across the
49
+ VMware skill family.
@@ -0,0 +1 @@
1
+ """stdio MCP server package for vmware-debug."""
@@ -0,0 +1,78 @@
1
+ """vmware-debug MCP server entry point.
2
+
3
+ Tools are defined in vmware_debug.mcp.tools (so audit logs see skill=debug).
4
+ This module wires them into a FastMCP server and provides the stdio entry point.
5
+
6
+ Note: signatures here use typing.Optional, never PEP 604 ``X | None`` — FastMCP
7
+ reflects these at registration and ``X | None`` crashes on Python 3.10 + older
8
+ mcp/pydantic (CLAUDE.md 踩坑 #33).
9
+ """
10
+
11
+ import sys
12
+ from typing import Optional
13
+
14
+ from mcp.server.fastmcp import FastMCP
15
+
16
+ from vmware_debug.mcp import tools as t
17
+
18
+
19
+ def build_server() -> FastMCP:
20
+ """Construct and configure the MCP server."""
21
+ server = FastMCP("vmware-debug")
22
+
23
+ @server.tool(name="incident_timeline")
24
+ def _incident_timeline_impl(
25
+ events: list[dict],
26
+ bin_seconds: Optional[float] = None,
27
+ z_threshold: float = 2.0,
28
+ top_n: int = 5,
29
+ ) -> dict:
30
+ """[READ] Correlate already-fetched VMware events into one incident view.
31
+
32
+ WHEN: after you've pulled events for an incident from the data-source
33
+ skills (vmware-monitor event_list/alarm_list, vmware-aria alerts/anomaly,
34
+ vmware-log-insight log_search/log_aggregate, vmware-nsx) — feed them all
35
+ here to find what correlates and where to look next. This tool does NOT
36
+ fetch anything itself; it has no vCenter/network access.
37
+
38
+ INPUT: events = list of event envelopes, each {ts, source, severity,
39
+ entity, text, fields} (ts may be ISO-8601, epoch seconds, or millis;
40
+ severity is normalised). Optional: bin_seconds (time-bin width; auto if
41
+ omitted), z_threshold (spike sensitivity, default 2.0), top_n (max
42
+ hypotheses, default 5).
43
+
44
+ RETURNS: {event_count, window, spikes (anomalous time bins), hypotheses
45
+ (ranked root-cause candidates, each with a suggested_check), next_checks
46
+ (concrete ideas for what to investigate next, including which skill/tool
47
+ to run)}.
48
+
49
+ GOTCHAS: read-only and stateless — nothing is executed. Remediation is
50
+ routed to vmware-aiops (single fix) or vmware-pilot (multi-step, gated).
51
+ A malformed event raises ValueError naming its index."""
52
+ return t.incident_timeline(events, bin_seconds, z_threshold, top_n)
53
+
54
+ @server.tool(name="list_symptom_categories")
55
+ def _list_symptom_categories_impl() -> list[dict]:
56
+ """[READ] List the symptom categories vmware-debug recognises and, for
57
+ each, example keywords and the suggested next check (which skill/tool to
58
+ run). Takes no parameters. Use this when you don't yet know what to look
59
+ at — it turns "something's wrong" into concrete investigation steps.
60
+ Read-only; no network access."""
61
+ return t.list_symptom_categories()
62
+
63
+ return server
64
+
65
+
66
+ def main() -> None:
67
+ """Entry point for `vmware-debug-mcp` (stdio transport)."""
68
+ if sys.version_info < (3, 11):
69
+ sys.exit(
70
+ "vmware-debug-mcp requires Python >= 3.11 (FastMCP schema reflection "
71
+ "is unreliable on 3.10). Reinstall under 3.11+: "
72
+ "uv tool install --python 3.11 vmware-debug"
73
+ )
74
+ build_server().run()
75
+
76
+
77
+ if __name__ == "__main__":
78
+ main()
@@ -0,0 +1,43 @@
1
+ [build-system]
2
+ requires = ["hatchling"]
3
+ build-backend = "hatchling.build"
4
+
5
+ [project]
6
+ name = "vmware-debug"
7
+ version = "1.6.1"
8
+ description = "VMware diagnostic brain — read-only incident triage, log/event correlation, and root-cause routing across the VMware skill family"
9
+ readme = "README.md"
10
+ license = "MIT"
11
+ requires-python = ">=3.10"
12
+ authors = [{ name = "Wei Zhou", email = "wei-wz.zhou@broadcom.com" }]
13
+ keywords = ["vmware", "vsphere", "debug", "troubleshooting", "diagnostics", "rca", "mcp", "ai-ops"]
14
+ classifiers = [
15
+ "Development Status :: 4 - Beta",
16
+ "License :: OSI Approved :: MIT License",
17
+ "Programming Language :: Python :: 3",
18
+ "Topic :: System :: Monitoring",
19
+ ]
20
+ dependencies = [
21
+ "typer>=0.12,<1.0",
22
+ "rich>=13.0,<15.0",
23
+ "mcp[cli]>=1.10,<2.0",
24
+ "vmware-policy>=1.0.0,<2.0",
25
+ ]
26
+
27
+ [project.scripts]
28
+ vmware-debug = "vmware_debug.cli:app"
29
+ vmware-debug-mcp = "mcp_server.server:main"
30
+
31
+ [tool.hatch.build.targets.wheel]
32
+ packages = ["vmware_debug", "mcp_server"]
33
+
34
+ [dependency-groups]
35
+ dev = [
36
+ "pytest>=8.0,<10.0",
37
+ "pytest-cov>=5.0,<8.0",
38
+ "ruff>=0.5,<1.0",
39
+ ]
40
+
41
+ [tool.ruff]
42
+ line-length = 100
43
+ target-version = "py310"
@@ -0,0 +1,21 @@
1
+ {
2
+ "$schema": "https://static.modelcontextprotocol.io/schemas/2025-12-11/server.schema.json",
3
+ "name": "io.github.zw008/vmware-debug",
4
+ "title": "VMware Debug",
5
+ "description": "VMware diagnostic brain: read-only incident correlation (timeline, spikes, ranked root-cause hypotheses) and next-check routing across the VMware skill family — 2 MCP tools.",
6
+ "repository": {
7
+ "url": "https://github.com/zw008/VMware-Debug",
8
+ "source": "github"
9
+ },
10
+ "version": "1.6.1",
11
+ "packages": [
12
+ {
13
+ "registryType": "pypi",
14
+ "identifier": "vmware-debug",
15
+ "version": "1.6.1",
16
+ "transport": {
17
+ "type": "stdio"
18
+ }
19
+ }
20
+ ]
21
+ }
@@ -0,0 +1,138 @@
1
+ ---
2
+ name: vmware-debug
3
+ description: >
4
+ Use this skill whenever the user is troubleshooting a VMware/vSphere problem —
5
+ a reported error, an exception, a log dump, a slow or failed VM, a host that
6
+ went sideways — and needs help locating the root cause. It is the diagnostic
7
+ brain of the VMware family: it drives a systematic investigation, pulls the
8
+ right signals from the other skills, correlates events into one timeline,
9
+ ranks root-cause hypotheses, and tells you what to check next even when you
10
+ don't know where to start. Always use this skill for "diagnose this VMware
11
+ issue", "why is my VM slow", "troubleshoot this vSphere error", "what does
12
+ this log mean", "help me figure out what broke" when the context is explicitly
13
+ VMware/vSphere/ESXi/NSX. It is READ-ONLY: it never changes anything. Do NOT
14
+ use it to execute fixes — single fixes go to vmware-aiops, multi-step gated
15
+ remediation goes to vmware-pilot. Do NOT use it for routine inventory or
16
+ health checks with no problem to solve — use vmware-monitor.
17
+ installer:
18
+ kind: uv
19
+ package: vmware-debug
20
+ allowed-tools:
21
+ - Bash
22
+ metadata: {"openclaw":{"requires":{"bins":["vmware-debug"]},"primaryEnv":"NONE"}}
23
+ ---
24
+
25
+ # VMware Debug
26
+
27
+ > **Disclaimer**: Community-maintained open-source project, **not affiliated with,
28
+ > endorsed by, or sponsored by VMware, Inc. or Broadcom Inc.** "VMware" and "vSphere"
29
+ > are trademarks of Broadcom. Source is publicly auditable under the MIT license.
30
+
31
+ The diagnostic brain of the VMware skill family. You bring the symptom; this skill
32
+ runs the investigation and points at the root cause. It **reads and reasons** — it
33
+ never writes. Companion skills do the data collection and the fixing.
34
+
35
+ ## What This Skill Does
36
+
37
+ | Category | What | Read or Write |
38
+ |---|---|---|
39
+ | Incident correlation | Merge events from many sources into one timeline, detect spikes | Read |
40
+ | Root-cause ranking | Score symptom clusters, surface the most likely cause first | Read |
41
+ | Next-check ideas | Suggest exactly what to look at next (which skill/tool) when you're stuck | Read |
42
+ | Remediation routing | Hand the fix to vmware-aiops (single) or vmware-pilot (gated, multi-step) | Read (routes only) |
43
+
44
+ **Zero write tools. Zero network access of its own.** It correlates data the agent
45
+ has already gathered with the other skills' read tools.
46
+
47
+ ## Quick Install
48
+
49
+ ```bash
50
+ uv tool install vmware-debug
51
+ vmware-debug categories # see what it can diagnose
52
+ ```
53
+
54
+ ## When to Use This Skill
55
+
56
+ Use it when there is a **problem to solve**: an error message, a stack of logs, an
57
+ alarm storm, "my VM won't power on", "storage feels slow", "the host disconnected".
58
+
59
+ - Need raw inventory/health with no incident? → **vmware-monitor**
60
+ - Need to actually run a fix? → **vmware-aiops** (single op) or **vmware-pilot** (gated workflow)
61
+ - Need metrics/anomalies? → **vmware-aria**; centralized logs? → **vmware-log-insight**
62
+
63
+ **Do NOT use when** there is nothing wrong (routine listing → monitor), or when the
64
+ user wants the fix executed (→ aiops/pilot). This skill stops at the diagnosis and
65
+ a recommended plan.
66
+
67
+ ## Related Skills — Skill Routing
68
+
69
+ | Symptom touches | Pull signals from | Then |
70
+ |---|---|---|
71
+ | Storage / datastore / vSAN | vmware-storage, vmware-log-insight | rank → route fix to aiops/pilot |
72
+ | Network / firewall / vMotion | vmware-nsx, vmware-nsx-security | run traceflow, check DFW |
73
+ | CPU / memory contention | vmware-aria (metrics/anomalies) | rightsizing via pilot |
74
+ | HA / DRS / cluster | vmware-monitor, vmware-aiops | cluster remediation via pilot |
75
+ | Power / clone / snapshot | vmware-aiops, vmware-monitor | task status, then fix via aiops |
76
+ | Auth / cert / login | check creds & cert; (security) | fix config/.env |
77
+
78
+ ## Common Workflows
79
+
80
+ ### 1. "Here's a pile of logs / alarms — what broke?"
81
+ 1. Collect events with the data-source skills (e.g. `vmware-monitor event_list --vm web01 --since 1h`, `vmware-log-insight log_search ...`, `vmware-aria alert_query ...`).
82
+ 2. Pass them all to **`incident_timeline`** (envelope below). Read the top hypothesis + `next_checks`.
83
+ 3. Follow `next_checks` to pull more targeted data; re-run `incident_timeline` to confirm.
84
+ 4. **Failure branch — no events come back:** the affected target may be unreachable. Run the source skill's `doctor`/health first; a 503/timeout is a *signal* (platform not ready), not a dead end.
85
+ 5. Produce a diagnosis + recommended fix. Route execution to aiops/pilot. **Do not fix here.**
86
+
87
+ ### 2. "I don't even know what to check"
88
+ 1. Run **`list_symptom_categories`** (or `vmware-debug categories`) to see the catalogue.
89
+ 2. Describe the symptom; map it to a category; the `suggested_check` tells you which skill/tool to run first.
90
+ 3. Collect → `incident_timeline` → narrow. Loop until one hypothesis dominates.
91
+
92
+ ### 3. Hand off the fix (advisor → executor, like vmware-harden)
93
+ 1. Debug emits a structured diagnosis + a proposed remediation (steps).
94
+ 2. **Single, low-risk fix** → call the matching **vmware-aiops** tool (it has its own double-confirm).
95
+ 3. **Multi-step / needs approval / cross-skill** → submit the plan to **vmware-pilot**, which owns the state machine, approval gate, rollback, and audit.
96
+ 4. **Failure branch — fix is ambiguous or risky:** stop and present the hypotheses to the user; never guess-execute.
97
+
98
+ ## Usage Mode
99
+
100
+ - **MCP** (in an agent): the agent calls the other skills' read tools, then `incident_timeline` to correlate. This is the primary mode — that's where the cross-skill "联动" happens.
101
+ - **CLI** (humans): `vmware-debug triage --events events.json` correlates a JSON array you collected yourself.
102
+
103
+ ## MCP Tools (2 — 2 read, 0 write)
104
+
105
+ | Tool | What |
106
+ |---|---|
107
+ | `incident_timeline` | [READ] Correlate pre-fetched events → timeline + spikes + ranked hypotheses + next-check ideas |
108
+ | `list_symptom_categories` | [READ] List recognised symptom categories + what to check for each |
109
+
110
+ **Event envelope** (input to `incident_timeline`): `{ts, source, severity, entity, text, fields}`.
111
+ See `references/event-envelope.md`. The agent normalises each source's events into this
112
+ shape; debug stays source-agnostic and has no dependency on the other packages.
113
+
114
+ ## CLI Quick Reference
115
+
116
+ ```bash
117
+ vmware-debug categories # what can it diagnose
118
+ vmware-debug triage --events events.json # correlate a collected event set
119
+ cat events.json | vmware-debug triage # or via stdin
120
+ vmware-debug mcp # start stdio MCP server (proxy-safe)
121
+ ```
122
+
123
+ ## Troubleshooting
124
+
125
+ - **`incident_timeline` raises "event[N] could not be normalised"** — event N is missing a timestamp or has an unparseable one. Every event needs `ts` (ISO-8601, epoch seconds, or millis).
126
+ - **All hypotheses come back "uncategorized"** — the symptom isn't in the catalogue yet; widen the window and pull from another source (aria anomalies, log-insight). Consider adding a signature (see `references/routing.md`).
127
+ - **No spikes detected on an obvious burst** — you need ≥3 time bins for a baseline; shrink `bin_seconds`.
128
+ - **It won't execute the fix** — by design. Route to vmware-aiops or vmware-pilot.
129
+
130
+ ## Audit & Safety
131
+
132
+ Read-only by construction: no write tools, no network, nothing executed. Remediation
133
+ is always routed to aiops/pilot, where the double-confirm / approval / audit gates live
134
+ (audit DB `~/.vmware/audit.db`). See `references/setup-guide.md`.
135
+
136
+ ## License
137
+
138
+ MIT.
@@ -0,0 +1,32 @@
1
+ # vmware-debug Capabilities
2
+
3
+ Read-only, offline incident correlation. No network, no credentials, no writes.
4
+
5
+ | Tool | What it returns | Typical response tokens |
6
+ |---|---|---|
7
+ | `incident_timeline` | `{event_count, window, spikes:[{start,end,count,zscore}], hypotheses:[{category, score, summary, evidence_count, first_seen, last_seen, sample_text, suggested_check}], next_checks:[...]}` | 300–2000 (scales with hypotheses) |
8
+ | `list_symptom_categories` | `[{category, example_keywords, suggested_check}]` | ~400 |
9
+
10
+ ## Correlation engine
11
+
12
+ - **Timeline**: events normalised to the unified envelope, sorted, and time-binned
13
+ (auto bin width ≈ span/30, or caller-specified).
14
+ - **Spike detection**: z-score over bin counts (≥3 bins required for a baseline;
15
+ flat series yields no false spikes).
16
+ - **Hypothesis ranking**: events clustered by symptom category (keyword match on
17
+ text + entity), scored by summed severity weight, tie-broken by recency.
18
+ Uncategorised events are kept visible, not dropped.
19
+ - **Next-check routing**: each category carries a concrete "which skill/tool to run
20
+ next" suggestion — the value when the user doesn't know what to check.
21
+
22
+ ## Symptom categories
23
+
24
+ `storage`, `network`, `compute`, `ha_drs`, `power_lifecycle`, `auth`, `platform`.
25
+ See `references/routing.md` for keyword signatures and the skill each routes to.
26
+
27
+ ## Design properties
28
+
29
+ - **Zero cross-skill runtime deps** — correlation is pure functions over plain
30
+ dicts; the agent fans out to other skills' read tools (踩坑 #21/#32).
31
+ - **JSON-serialisable output** — suitable for direct MCP responses.
32
+ - **Immutable** — inputs are never mutated; every function returns new values.
@@ -0,0 +1,48 @@
1
+ # vmware-debug CLI Reference
2
+
3
+ All commands are read-only and offline (no network, no credentials).
4
+
5
+ ## triage — correlate a set of collected events
6
+
7
+ ```bash
8
+ vmware-debug triage [OPTIONS]
9
+ -e, --events PATH JSON file of event envelopes (reads stdin if omitted)
10
+ --bin-seconds N Time-bin width (auto if omitted)
11
+ --top-n N Max hypotheses to return [default: 5]
12
+ ```
13
+
14
+ Input is a JSON array of event envelopes (see `references/event-envelope.md`):
15
+
16
+ ```bash
17
+ cat events.json | vmware-debug triage
18
+ vmware-debug triage --events events.json --top-n 3
19
+ ```
20
+
21
+ Output (JSON): `{event_count, window, spikes, hypotheses, next_checks}`.
22
+
23
+ ## categories — list recognised symptom categories
24
+
25
+ ```bash
26
+ vmware-debug categories
27
+ ```
28
+
29
+ Prints each category, sample keywords, and the suggested next check (which
30
+ skill/tool to run). Use when you don't know what to look at.
31
+
32
+ ## version / mcp
33
+
34
+ ```bash
35
+ vmware-debug version # installed version
36
+ vmware-debug mcp # start the stdio MCP server (no network at startup)
37
+ ```
38
+
39
+ ## How the agent uses it
40
+
41
+ In an agent, the cross-skill correlation happens at the agent layer:
42
+
43
+ 1. Fetch events with the data-source skills (vmware-monitor `event_list`,
44
+ vmware-log-insight `log_search`/`log_aggregate`, vmware-aria alerts/anomaly,
45
+ vmware-nsx).
46
+ 2. Normalise each into the event envelope.
47
+ 3. Call the `incident_timeline` MCP tool to correlate and rank.
48
+ 4. Follow `next_checks`; route any fix to vmware-aiops / vmware-pilot.
@@ -0,0 +1,49 @@
1
+ # The Unified Event Envelope
2
+
3
+ This is the contract between `vmware-debug` and every data-source skill. The
4
+ orchestrating agent fetches events with each skill's own read tools, normalises
5
+ each into this shape, and passes the list to `incident_timeline`. Debug has **no
6
+ runtime dependency** on the other packages (no version lockstep, no heavy install).
7
+
8
+ ## Shape
9
+
10
+ ```json
11
+ {
12
+ "ts": "2026-06-23T10:15:30Z",
13
+ "source": "monitor",
14
+ "severity": "error",
15
+ "entity": "vm-web01",
16
+ "text": "Device naa.600... performance has deteriorated",
17
+ "fields": { "host": "esxi-03", "datastore": "ds1" }
18
+ }
19
+ ```
20
+
21
+ | Field | Type | Notes |
22
+ |---|---|---|
23
+ | `ts` | string \| number | ISO-8601, epoch **seconds**, or epoch **millis** (auto-detected). Required. |
24
+ | `source` | string | `monitor` \| `aria` \| `loginsight` \| `nsx` \| `nsx-security` \| `storage` \| ... |
25
+ | `severity` | string | Free text; normalised to `critical`/`error`/`warning`/`info`/`unknown`. |
26
+ | `entity` | string | The object the event is about (VM/host/datastore). May be empty. |
27
+ | `text` | string | Human-readable message — this is what the symptom classifier matches on. |
28
+ | `fields` | object | Any source-specific extras; preserved, never dropped. |
29
+
30
+ The normaliser is tolerant of common field-name variants (e.g. `timestamp`,
31
+ `createTime`, `startTimeUTC` for `ts`; `criticality`, `level` for `severity`;
32
+ `resourceName`, `vm_name`, `fullFormattedMessage` for entity/text), so most
33
+ sources map with little or no adaptation.
34
+
35
+ ## Mapping cheatsheet per source
36
+
37
+ | Source tool (example) | ts | severity | entity | text |
38
+ |---|---|---|---|---|
39
+ | vmware-monitor `event_list` | `createdTime` | `severity` | `vm`/`host` | `fullFormattedMessage` |
40
+ | vmware-aria `alert_query` | `startTimeUTC` | `criticality` | `resourceName` | `alertDefinitionName` |
41
+ | vmware-aria `anomaly` | `timestamp` | (derive) | `resourceName` | stat + value |
42
+ | vmware-log-insight `log_search` | `timestamp` | `severity`/derive | `hostname` | `text` |
43
+ | vmware-nsx (firewall/traceflow) | `time` | (derive) | src/dst | rule/verdict |
44
+
45
+ ## Why this design
46
+
47
+ - **Decoupling** — debug never imports monitor/aria/log-insight (CLAUDE.md 踩坑 #21/#32).
48
+ - **Testability** — correlation is pure functions over `Event`; unit tests feed synthetic events.
49
+ - **Transparency** — the cross-skill "联动" happens at the agent layer, visibly, not hidden inside debug.