vmware-debug 1.6.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- vmware_debug-1.6.1/.gitignore +23 -0
- vmware_debug-1.6.1/PKG-INFO +52 -0
- vmware_debug-1.6.1/README-CN.md +45 -0
- vmware_debug-1.6.1/README.md +34 -0
- vmware_debug-1.6.1/RELEASE_NOTES.md +26 -0
- vmware_debug-1.6.1/SECURITY.md +49 -0
- vmware_debug-1.6.1/mcp_server/__init__.py +1 -0
- vmware_debug-1.6.1/mcp_server/server.py +78 -0
- vmware_debug-1.6.1/pyproject.toml +43 -0
- vmware_debug-1.6.1/server.json +21 -0
- vmware_debug-1.6.1/skills/vmware-debug/SKILL.md +138 -0
- vmware_debug-1.6.1/skills/vmware-debug/references/capabilities.md +32 -0
- vmware_debug-1.6.1/skills/vmware-debug/references/cli-reference.md +48 -0
- vmware_debug-1.6.1/skills/vmware-debug/references/event-envelope.md +49 -0
- vmware_debug-1.6.1/skills/vmware-debug/references/routing.md +30 -0
- vmware_debug-1.6.1/skills/vmware-debug/references/setup-guide.md +43 -0
- vmware_debug-1.6.1/tests/eval/regression/__init__.py +0 -0
- vmware_debug-1.6.1/tests/eval/regression/test_debug_regressions.py +49 -0
- vmware_debug-1.6.1/tests/test_timeline.py +193 -0
- vmware_debug-1.6.1/vmware_debug/__init__.py +8 -0
- vmware_debug-1.6.1/vmware_debug/cli.py +85 -0
- vmware_debug-1.6.1/vmware_debug/envelope.py +173 -0
- vmware_debug-1.6.1/vmware_debug/mcp/__init__.py +2 -0
- vmware_debug-1.6.1/vmware_debug/mcp/tools.py +33 -0
- vmware_debug-1.6.1/vmware_debug/ops/__init__.py +1 -0
- vmware_debug-1.6.1/vmware_debug/ops/timeline.py +312 -0
|
@@ -0,0 +1,23 @@
|
|
|
1
|
+
__pycache__/
|
|
2
|
+
*.py[cod]
|
|
3
|
+
*$py.class
|
|
4
|
+
*.egg-info/
|
|
5
|
+
dist/
|
|
6
|
+
build/
|
|
7
|
+
.eggs/
|
|
8
|
+
*.egg
|
|
9
|
+
.venv/
|
|
10
|
+
venv/
|
|
11
|
+
.env
|
|
12
|
+
*.log
|
|
13
|
+
.pytest_cache/
|
|
14
|
+
.ruff_cache/
|
|
15
|
+
htmlcov/
|
|
16
|
+
.coverage
|
|
17
|
+
config.yaml
|
|
18
|
+
.agents/
|
|
19
|
+
.claude/
|
|
20
|
+
.trae/
|
|
21
|
+
skills-lock.json
|
|
22
|
+
tests/fixtures/token_corpus/
|
|
23
|
+
.DS_Store
|
|
@@ -0,0 +1,52 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: vmware-debug
|
|
3
|
+
Version: 1.6.1
|
|
4
|
+
Summary: VMware diagnostic brain — read-only incident triage, log/event correlation, and root-cause routing across the VMware skill family
|
|
5
|
+
Author-email: Wei Zhou <wei-wz.zhou@broadcom.com>
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Keywords: ai-ops,debug,diagnostics,mcp,rca,troubleshooting,vmware,vsphere
|
|
8
|
+
Classifier: Development Status :: 4 - Beta
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Programming Language :: Python :: 3
|
|
11
|
+
Classifier: Topic :: System :: Monitoring
|
|
12
|
+
Requires-Python: >=3.10
|
|
13
|
+
Requires-Dist: mcp[cli]<2.0,>=1.10
|
|
14
|
+
Requires-Dist: rich<15.0,>=13.0
|
|
15
|
+
Requires-Dist: typer<1.0,>=0.12
|
|
16
|
+
Requires-Dist: vmware-policy<2.0,>=1.0.0
|
|
17
|
+
Description-Content-Type: text/markdown
|
|
18
|
+
|
|
19
|
+
<!-- mcp-name: io.github.zw008/vmware-debug -->
|
|
20
|
+
|
|
21
|
+
# VMware Debug
|
|
22
|
+
|
|
23
|
+
> ⚠️ **Work in progress** — the core (event correlation engine, MCP tools, CLI)
|
|
24
|
+
> is built and tested; README, `server.json`, full reference docs, and packaging
|
|
25
|
+
> polish are still landing. Not yet published to PyPI.
|
|
26
|
+
|
|
27
|
+
> **Disclaimer**: Community-maintained open-source project, **not affiliated with,
|
|
28
|
+
> endorsed by, or sponsored by VMware, Inc. or Broadcom Inc.** "VMware" and
|
|
29
|
+
> "vSphere" are trademarks of Broadcom. Source is publicly auditable under the MIT
|
|
30
|
+
> license.
|
|
31
|
+
|
|
32
|
+
The diagnostic brain of the VMware skill family. You bring the symptom (an error,
|
|
33
|
+
a log dump, a slow VM); this skill runs a systematic investigation, correlates
|
|
34
|
+
events from the other skills into one timeline, ranks root-cause hypotheses, and
|
|
35
|
+
tells you what to check next. It is **read-only** — it never changes anything and
|
|
36
|
+
never executes fixes. Remediation is routed to `vmware-aiops` (single op) or
|
|
37
|
+
`vmware-pilot` (multi-step, gated), mirroring the `vmware-harden → vmware-pilot`
|
|
38
|
+
advisor/executor split.
|
|
39
|
+
|
|
40
|
+
See [`skills/vmware-debug/SKILL.md`](skills/vmware-debug/SKILL.md) for the full
|
|
41
|
+
methodology, the event-envelope contract, and symptom routing.
|
|
42
|
+
|
|
43
|
+
## MCP tools
|
|
44
|
+
|
|
45
|
+
| Tool | What |
|
|
46
|
+
|---|---|
|
|
47
|
+
| `incident_timeline` | [READ] Correlate pre-fetched events → timeline + spikes + ranked hypotheses + next-check ideas |
|
|
48
|
+
| `list_symptom_categories` | [READ] List recognised symptom categories + what to check for each |
|
|
49
|
+
|
|
50
|
+
## License
|
|
51
|
+
|
|
52
|
+
MIT.
|
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
<!-- mcp-name: io.github.zw008/vmware-debug -->
|
|
2
|
+
|
|
3
|
+
# VMware Debug(中文)
|
|
4
|
+
|
|
5
|
+
> **声明**:本项目为社区维护的开源项目,**与 VMware, Inc. 或 Broadcom Inc. 无任何隶属、
|
|
6
|
+
> 背书或赞助关系。** "VMware"、"vSphere" 为 Broadcom 商标。源码以 MIT 许可证公开可审计。
|
|
7
|
+
|
|
8
|
+
VMware skill 家族的**诊断大脑**。你给出症状(报错、日志、变慢的 VM),它来跑系统化排查:
|
|
9
|
+
把其它 skill 取到的事件关联成一条时间线、检测突刺、给根因假设排序,并告诉你下一步该查什么。
|
|
10
|
+
**只读**——从不修改任何东西,也从不执行修复。修复一律路由给 vmware-aiops(单步)或
|
|
11
|
+
vmware-pilot(多步、带审批门控),完全复刻 vmware-harden → vmware-pilot 的「顾问/执行」分工。
|
|
12
|
+
|
|
13
|
+
## 配套 Skill
|
|
14
|
+
|
|
15
|
+
| 需求 | Skill |
|
|
16
|
+
|---|---|
|
|
17
|
+
| 故障关联 / 根因 | **vmware-debug**(本项目) |
|
|
18
|
+
| 集中日志检索 | vmware-log-insight(把 `log_search` 结果喂给它) |
|
|
19
|
+
| vCenter 事件与告警 | vmware-monitor |
|
|
20
|
+
| 指标 / 异常 | vmware-aria |
|
|
21
|
+
| 执行修复 | vmware-aiops(单步)/ vmware-pilot(多步门控) |
|
|
22
|
+
|
|
23
|
+
## 安装
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
uv tool install vmware-debug
|
|
27
|
+
vmware-debug categories # 看它能诊断哪些症状类别
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
## MCP 工具(2 个,全只读)
|
|
31
|
+
|
|
32
|
+
- `incident_timeline`:把已取到的事件关联成 时间线 + 突刺 + 排序后的根因假设 + 下一步检查建议
|
|
33
|
+
- `list_symptom_categories`:症状类别及对应的排查路由(不知道查什么时用它)
|
|
34
|
+
|
|
35
|
+
**事件信封**:`{ts, source, severity, entity, text, fields}`。agent 把各源事件归一成此形状再交给
|
|
36
|
+
debug;debug 因此与其它包零运行时依赖。
|
|
37
|
+
|
|
38
|
+
## 安全
|
|
39
|
+
|
|
40
|
+
结构上只读、离线、无凭据:不连任何 vCenter/NSX/Aria,没有可破坏面,也没有秘密可泄露。
|
|
41
|
+
详见 [SECURITY.md](SECURITY.md)。
|
|
42
|
+
|
|
43
|
+
## 许可证
|
|
44
|
+
|
|
45
|
+
MIT。
|
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
<!-- mcp-name: io.github.zw008/vmware-debug -->
|
|
2
|
+
|
|
3
|
+
# VMware Debug
|
|
4
|
+
|
|
5
|
+
> ⚠️ **Work in progress** — the core (event correlation engine, MCP tools, CLI)
|
|
6
|
+
> is built and tested; README, `server.json`, full reference docs, and packaging
|
|
7
|
+
> polish are still landing. Not yet published to PyPI.
|
|
8
|
+
|
|
9
|
+
> **Disclaimer**: Community-maintained open-source project, **not affiliated with,
|
|
10
|
+
> endorsed by, or sponsored by VMware, Inc. or Broadcom Inc.** "VMware" and
|
|
11
|
+
> "vSphere" are trademarks of Broadcom. Source is publicly auditable under the MIT
|
|
12
|
+
> license.
|
|
13
|
+
|
|
14
|
+
The diagnostic brain of the VMware skill family. You bring the symptom (an error,
|
|
15
|
+
a log dump, a slow VM); this skill runs a systematic investigation, correlates
|
|
16
|
+
events from the other skills into one timeline, ranks root-cause hypotheses, and
|
|
17
|
+
tells you what to check next. It is **read-only** — it never changes anything and
|
|
18
|
+
never executes fixes. Remediation is routed to `vmware-aiops` (single op) or
|
|
19
|
+
`vmware-pilot` (multi-step, gated), mirroring the `vmware-harden → vmware-pilot`
|
|
20
|
+
advisor/executor split.
|
|
21
|
+
|
|
22
|
+
See [`skills/vmware-debug/SKILL.md`](skills/vmware-debug/SKILL.md) for the full
|
|
23
|
+
methodology, the event-envelope contract, and symptom routing.
|
|
24
|
+
|
|
25
|
+
## MCP tools
|
|
26
|
+
|
|
27
|
+
| Tool | What |
|
|
28
|
+
|---|---|
|
|
29
|
+
| `incident_timeline` | [READ] Correlate pre-fetched events → timeline + spikes + ranked hypotheses + next-check ideas |
|
|
30
|
+
| `list_symptom_categories` | [READ] List recognised symptom categories + what to check for each |
|
|
31
|
+
|
|
32
|
+
## License
|
|
33
|
+
|
|
34
|
+
MIT.
|
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
## v1.6.1 (2026-06-24) — initial release
|
|
2
|
+
|
|
3
|
+
First release of **vmware-debug**: the read-only diagnostic brain of the VMware
|
|
4
|
+
skill family. You bring the symptom; it runs the investigation, correlates
|
|
5
|
+
events from the other skills into one timeline, ranks root-cause hypotheses, and
|
|
6
|
+
routes remediation to vmware-aiops / vmware-pilot. It never writes and never
|
|
7
|
+
executes fixes (advisor/executor split, mirroring vmware-harden → vmware-pilot).
|
|
8
|
+
|
|
9
|
+
### Added
|
|
10
|
+
- **2 read-only MCP tools**: `incident_timeline` (correlate pre-fetched events
|
|
11
|
+
into a timeline + z-score spikes + ranked hypotheses + next-check ideas) and
|
|
12
|
+
`list_symptom_categories` (the symptom→skill routing catalogue).
|
|
13
|
+
- **Unified event envelope** + tolerant normalizer so debug stays source-agnostic
|
|
14
|
+
with zero runtime dependency on the other skill packages — the agent fans out
|
|
15
|
+
to each skill's read tools and feeds events here (avoids cross-skill coupling,
|
|
16
|
+
踩坑 #21/#32).
|
|
17
|
+
- **Pure correlation engine** (timeline merge, time-binning, spike detection,
|
|
18
|
+
hypothesis ranking, symptom routing) — fully unit-tested offline.
|
|
19
|
+
- **Typer CLI**: `triage`, `categories`, `version`, `mcp`. The `mcp` entry point
|
|
20
|
+
needs no network at startup (proxy-safe, 踩坑 #25).
|
|
21
|
+
- SKILL.md + references (event-envelope contract, symptom routing, playbooks).
|
|
22
|
+
|
|
23
|
+
### Notes
|
|
24
|
+
- Read-only by construction; remediation is routed, never executed.
|
|
25
|
+
- `parse_timestamp` rejects implausible/garbage timestamps loudly rather than
|
|
26
|
+
silently landing at the epoch.
|
|
@@ -0,0 +1,49 @@
|
|
|
1
|
+
# Security Policy
|
|
2
|
+
|
|
3
|
+
## Disclaimer
|
|
4
|
+
|
|
5
|
+
This is a community-maintained open-source project and is **not affiliated with,
|
|
6
|
+
endorsed by, or sponsored by VMware, Inc. or Broadcom Inc.** "VMware" and
|
|
7
|
+
"vSphere" are trademarks of Broadcom. Source code is publicly auditable at
|
|
8
|
+
[github.com/zw008/VMware-Debug](https://github.com/zw008/VMware-Debug) under the
|
|
9
|
+
MIT license.
|
|
10
|
+
|
|
11
|
+
## Reporting Vulnerabilities
|
|
12
|
+
|
|
13
|
+
Report security issues via a GitHub private security advisory on the repository,
|
|
14
|
+
or by email to the maintainer. Please do not open public issues for security bugs.
|
|
15
|
+
|
|
16
|
+
## Security Design
|
|
17
|
+
|
|
18
|
+
### Read-only and offline by construction
|
|
19
|
+
vmware-debug has **no write tools, no network access, and no credentials**. It
|
|
20
|
+
does not connect to vCenter, NSX, Aria, or any appliance. Its tools are pure
|
|
21
|
+
functions over event data the orchestrating agent has already fetched with the
|
|
22
|
+
other skills' read tools. There is no destructive surface and no secret to leak.
|
|
23
|
+
|
|
24
|
+
### No remediation execution
|
|
25
|
+
debug only *diagnoses* and *recommends*. Any fix is routed to vmware-aiops
|
|
26
|
+
(single op, with its own confirmation) or vmware-pilot (multi-step, approval-gated,
|
|
27
|
+
audited). The safety gates live in those skills, not here.
|
|
28
|
+
|
|
29
|
+
### No cross-skill coupling
|
|
30
|
+
debug imports none of the other skill packages at runtime. Events arrive as plain
|
|
31
|
+
dicts (the unified event envelope), so there is no transitive dependency surface.
|
|
32
|
+
|
|
33
|
+
### Prompt-injection consideration
|
|
34
|
+
debug operates on text the agent supplies. Its outputs are structured data
|
|
35
|
+
(timelines, hypotheses, routing strings); it does not execute or shell out to
|
|
36
|
+
anything based on event content.
|
|
37
|
+
|
|
38
|
+
## Static Analysis
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
uvx bandit -r vmware_debug/ mcp_server/
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
Release bar: 0 Medium-or-higher severity findings.
|
|
45
|
+
|
|
46
|
+
## Supported Versions
|
|
47
|
+
|
|
48
|
+
The latest released version receives fixes. Versions are kept aligned across the
|
|
49
|
+
VMware skill family.
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
"""stdio MCP server package for vmware-debug."""
|
|
@@ -0,0 +1,78 @@
|
|
|
1
|
+
"""vmware-debug MCP server entry point.
|
|
2
|
+
|
|
3
|
+
Tools are defined in vmware_debug.mcp.tools (so audit logs see skill=debug).
|
|
4
|
+
This module wires them into a FastMCP server and provides the stdio entry point.
|
|
5
|
+
|
|
6
|
+
Note: signatures here use typing.Optional, never PEP 604 ``X | None`` — FastMCP
|
|
7
|
+
reflects these at registration and ``X | None`` crashes on Python 3.10 + older
|
|
8
|
+
mcp/pydantic (CLAUDE.md 踩坑 #33).
|
|
9
|
+
"""
|
|
10
|
+
|
|
11
|
+
import sys
|
|
12
|
+
from typing import Optional
|
|
13
|
+
|
|
14
|
+
from mcp.server.fastmcp import FastMCP
|
|
15
|
+
|
|
16
|
+
from vmware_debug.mcp import tools as t
|
|
17
|
+
|
|
18
|
+
|
|
19
|
+
def build_server() -> FastMCP:
|
|
20
|
+
"""Construct and configure the MCP server."""
|
|
21
|
+
server = FastMCP("vmware-debug")
|
|
22
|
+
|
|
23
|
+
@server.tool(name="incident_timeline")
|
|
24
|
+
def _incident_timeline_impl(
|
|
25
|
+
events: list[dict],
|
|
26
|
+
bin_seconds: Optional[float] = None,
|
|
27
|
+
z_threshold: float = 2.0,
|
|
28
|
+
top_n: int = 5,
|
|
29
|
+
) -> dict:
|
|
30
|
+
"""[READ] Correlate already-fetched VMware events into one incident view.
|
|
31
|
+
|
|
32
|
+
WHEN: after you've pulled events for an incident from the data-source
|
|
33
|
+
skills (vmware-monitor event_list/alarm_list, vmware-aria alerts/anomaly,
|
|
34
|
+
vmware-log-insight log_search/log_aggregate, vmware-nsx) — feed them all
|
|
35
|
+
here to find what correlates and where to look next. This tool does NOT
|
|
36
|
+
fetch anything itself; it has no vCenter/network access.
|
|
37
|
+
|
|
38
|
+
INPUT: events = list of event envelopes, each {ts, source, severity,
|
|
39
|
+
entity, text, fields} (ts may be ISO-8601, epoch seconds, or millis;
|
|
40
|
+
severity is normalised). Optional: bin_seconds (time-bin width; auto if
|
|
41
|
+
omitted), z_threshold (spike sensitivity, default 2.0), top_n (max
|
|
42
|
+
hypotheses, default 5).
|
|
43
|
+
|
|
44
|
+
RETURNS: {event_count, window, spikes (anomalous time bins), hypotheses
|
|
45
|
+
(ranked root-cause candidates, each with a suggested_check), next_checks
|
|
46
|
+
(concrete ideas for what to investigate next, including which skill/tool
|
|
47
|
+
to run)}.
|
|
48
|
+
|
|
49
|
+
GOTCHAS: read-only and stateless — nothing is executed. Remediation is
|
|
50
|
+
routed to vmware-aiops (single fix) or vmware-pilot (multi-step, gated).
|
|
51
|
+
A malformed event raises ValueError naming its index."""
|
|
52
|
+
return t.incident_timeline(events, bin_seconds, z_threshold, top_n)
|
|
53
|
+
|
|
54
|
+
@server.tool(name="list_symptom_categories")
|
|
55
|
+
def _list_symptom_categories_impl() -> list[dict]:
|
|
56
|
+
"""[READ] List the symptom categories vmware-debug recognises and, for
|
|
57
|
+
each, example keywords and the suggested next check (which skill/tool to
|
|
58
|
+
run). Takes no parameters. Use this when you don't yet know what to look
|
|
59
|
+
at — it turns "something's wrong" into concrete investigation steps.
|
|
60
|
+
Read-only; no network access."""
|
|
61
|
+
return t.list_symptom_categories()
|
|
62
|
+
|
|
63
|
+
return server
|
|
64
|
+
|
|
65
|
+
|
|
66
|
+
def main() -> None:
|
|
67
|
+
"""Entry point for `vmware-debug-mcp` (stdio transport)."""
|
|
68
|
+
if sys.version_info < (3, 11):
|
|
69
|
+
sys.exit(
|
|
70
|
+
"vmware-debug-mcp requires Python >= 3.11 (FastMCP schema reflection "
|
|
71
|
+
"is unreliable on 3.10). Reinstall under 3.11+: "
|
|
72
|
+
"uv tool install --python 3.11 vmware-debug"
|
|
73
|
+
)
|
|
74
|
+
build_server().run()
|
|
75
|
+
|
|
76
|
+
|
|
77
|
+
if __name__ == "__main__":
|
|
78
|
+
main()
|
|
@@ -0,0 +1,43 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["hatchling"]
|
|
3
|
+
build-backend = "hatchling.build"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "vmware-debug"
|
|
7
|
+
version = "1.6.1"
|
|
8
|
+
description = "VMware diagnostic brain — read-only incident triage, log/event correlation, and root-cause routing across the VMware skill family"
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
license = "MIT"
|
|
11
|
+
requires-python = ">=3.10"
|
|
12
|
+
authors = [{ name = "Wei Zhou", email = "wei-wz.zhou@broadcom.com" }]
|
|
13
|
+
keywords = ["vmware", "vsphere", "debug", "troubleshooting", "diagnostics", "rca", "mcp", "ai-ops"]
|
|
14
|
+
classifiers = [
|
|
15
|
+
"Development Status :: 4 - Beta",
|
|
16
|
+
"License :: OSI Approved :: MIT License",
|
|
17
|
+
"Programming Language :: Python :: 3",
|
|
18
|
+
"Topic :: System :: Monitoring",
|
|
19
|
+
]
|
|
20
|
+
dependencies = [
|
|
21
|
+
"typer>=0.12,<1.0",
|
|
22
|
+
"rich>=13.0,<15.0",
|
|
23
|
+
"mcp[cli]>=1.10,<2.0",
|
|
24
|
+
"vmware-policy>=1.0.0,<2.0",
|
|
25
|
+
]
|
|
26
|
+
|
|
27
|
+
[project.scripts]
|
|
28
|
+
vmware-debug = "vmware_debug.cli:app"
|
|
29
|
+
vmware-debug-mcp = "mcp_server.server:main"
|
|
30
|
+
|
|
31
|
+
[tool.hatch.build.targets.wheel]
|
|
32
|
+
packages = ["vmware_debug", "mcp_server"]
|
|
33
|
+
|
|
34
|
+
[dependency-groups]
|
|
35
|
+
dev = [
|
|
36
|
+
"pytest>=8.0,<10.0",
|
|
37
|
+
"pytest-cov>=5.0,<8.0",
|
|
38
|
+
"ruff>=0.5,<1.0",
|
|
39
|
+
]
|
|
40
|
+
|
|
41
|
+
[tool.ruff]
|
|
42
|
+
line-length = 100
|
|
43
|
+
target-version = "py310"
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
{
|
|
2
|
+
"$schema": "https://static.modelcontextprotocol.io/schemas/2025-12-11/server.schema.json",
|
|
3
|
+
"name": "io.github.zw008/vmware-debug",
|
|
4
|
+
"title": "VMware Debug",
|
|
5
|
+
"description": "VMware diagnostic brain: read-only incident correlation (timeline, spikes, ranked root-cause hypotheses) and next-check routing across the VMware skill family — 2 MCP tools.",
|
|
6
|
+
"repository": {
|
|
7
|
+
"url": "https://github.com/zw008/VMware-Debug",
|
|
8
|
+
"source": "github"
|
|
9
|
+
},
|
|
10
|
+
"version": "1.6.1",
|
|
11
|
+
"packages": [
|
|
12
|
+
{
|
|
13
|
+
"registryType": "pypi",
|
|
14
|
+
"identifier": "vmware-debug",
|
|
15
|
+
"version": "1.6.1",
|
|
16
|
+
"transport": {
|
|
17
|
+
"type": "stdio"
|
|
18
|
+
}
|
|
19
|
+
}
|
|
20
|
+
]
|
|
21
|
+
}
|
|
@@ -0,0 +1,138 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: vmware-debug
|
|
3
|
+
description: >
|
|
4
|
+
Use this skill whenever the user is troubleshooting a VMware/vSphere problem —
|
|
5
|
+
a reported error, an exception, a log dump, a slow or failed VM, a host that
|
|
6
|
+
went sideways — and needs help locating the root cause. It is the diagnostic
|
|
7
|
+
brain of the VMware family: it drives a systematic investigation, pulls the
|
|
8
|
+
right signals from the other skills, correlates events into one timeline,
|
|
9
|
+
ranks root-cause hypotheses, and tells you what to check next even when you
|
|
10
|
+
don't know where to start. Always use this skill for "diagnose this VMware
|
|
11
|
+
issue", "why is my VM slow", "troubleshoot this vSphere error", "what does
|
|
12
|
+
this log mean", "help me figure out what broke" when the context is explicitly
|
|
13
|
+
VMware/vSphere/ESXi/NSX. It is READ-ONLY: it never changes anything. Do NOT
|
|
14
|
+
use it to execute fixes — single fixes go to vmware-aiops, multi-step gated
|
|
15
|
+
remediation goes to vmware-pilot. Do NOT use it for routine inventory or
|
|
16
|
+
health checks with no problem to solve — use vmware-monitor.
|
|
17
|
+
installer:
|
|
18
|
+
kind: uv
|
|
19
|
+
package: vmware-debug
|
|
20
|
+
allowed-tools:
|
|
21
|
+
- Bash
|
|
22
|
+
metadata: {"openclaw":{"requires":{"bins":["vmware-debug"]},"primaryEnv":"NONE"}}
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
# VMware Debug
|
|
26
|
+
|
|
27
|
+
> **Disclaimer**: Community-maintained open-source project, **not affiliated with,
|
|
28
|
+
> endorsed by, or sponsored by VMware, Inc. or Broadcom Inc.** "VMware" and "vSphere"
|
|
29
|
+
> are trademarks of Broadcom. Source is publicly auditable under the MIT license.
|
|
30
|
+
|
|
31
|
+
The diagnostic brain of the VMware skill family. You bring the symptom; this skill
|
|
32
|
+
runs the investigation and points at the root cause. It **reads and reasons** — it
|
|
33
|
+
never writes. Companion skills do the data collection and the fixing.
|
|
34
|
+
|
|
35
|
+
## What This Skill Does
|
|
36
|
+
|
|
37
|
+
| Category | What | Read or Write |
|
|
38
|
+
|---|---|---|
|
|
39
|
+
| Incident correlation | Merge events from many sources into one timeline, detect spikes | Read |
|
|
40
|
+
| Root-cause ranking | Score symptom clusters, surface the most likely cause first | Read |
|
|
41
|
+
| Next-check ideas | Suggest exactly what to look at next (which skill/tool) when you're stuck | Read |
|
|
42
|
+
| Remediation routing | Hand the fix to vmware-aiops (single) or vmware-pilot (gated, multi-step) | Read (routes only) |
|
|
43
|
+
|
|
44
|
+
**Zero write tools. Zero network access of its own.** It correlates data the agent
|
|
45
|
+
has already gathered with the other skills' read tools.
|
|
46
|
+
|
|
47
|
+
## Quick Install
|
|
48
|
+
|
|
49
|
+
```bash
|
|
50
|
+
uv tool install vmware-debug
|
|
51
|
+
vmware-debug categories # see what it can diagnose
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
## When to Use This Skill
|
|
55
|
+
|
|
56
|
+
Use it when there is a **problem to solve**: an error message, a stack of logs, an
|
|
57
|
+
alarm storm, "my VM won't power on", "storage feels slow", "the host disconnected".
|
|
58
|
+
|
|
59
|
+
- Need raw inventory/health with no incident? → **vmware-monitor**
|
|
60
|
+
- Need to actually run a fix? → **vmware-aiops** (single op) or **vmware-pilot** (gated workflow)
|
|
61
|
+
- Need metrics/anomalies? → **vmware-aria**; centralized logs? → **vmware-log-insight**
|
|
62
|
+
|
|
63
|
+
**Do NOT use when** there is nothing wrong (routine listing → monitor), or when the
|
|
64
|
+
user wants the fix executed (→ aiops/pilot). This skill stops at the diagnosis and
|
|
65
|
+
a recommended plan.
|
|
66
|
+
|
|
67
|
+
## Related Skills — Skill Routing
|
|
68
|
+
|
|
69
|
+
| Symptom touches | Pull signals from | Then |
|
|
70
|
+
|---|---|---|
|
|
71
|
+
| Storage / datastore / vSAN | vmware-storage, vmware-log-insight | rank → route fix to aiops/pilot |
|
|
72
|
+
| Network / firewall / vMotion | vmware-nsx, vmware-nsx-security | run traceflow, check DFW |
|
|
73
|
+
| CPU / memory contention | vmware-aria (metrics/anomalies) | rightsizing via pilot |
|
|
74
|
+
| HA / DRS / cluster | vmware-monitor, vmware-aiops | cluster remediation via pilot |
|
|
75
|
+
| Power / clone / snapshot | vmware-aiops, vmware-monitor | task status, then fix via aiops |
|
|
76
|
+
| Auth / cert / login | check creds & cert; (security) | fix config/.env |
|
|
77
|
+
|
|
78
|
+
## Common Workflows
|
|
79
|
+
|
|
80
|
+
### 1. "Here's a pile of logs / alarms — what broke?"
|
|
81
|
+
1. Collect events with the data-source skills (e.g. `vmware-monitor event_list --vm web01 --since 1h`, `vmware-log-insight log_search ...`, `vmware-aria alert_query ...`).
|
|
82
|
+
2. Pass them all to **`incident_timeline`** (envelope below). Read the top hypothesis + `next_checks`.
|
|
83
|
+
3. Follow `next_checks` to pull more targeted data; re-run `incident_timeline` to confirm.
|
|
84
|
+
4. **Failure branch — no events come back:** the affected target may be unreachable. Run the source skill's `doctor`/health first; a 503/timeout is a *signal* (platform not ready), not a dead end.
|
|
85
|
+
5. Produce a diagnosis + recommended fix. Route execution to aiops/pilot. **Do not fix here.**
|
|
86
|
+
|
|
87
|
+
### 2. "I don't even know what to check"
|
|
88
|
+
1. Run **`list_symptom_categories`** (or `vmware-debug categories`) to see the catalogue.
|
|
89
|
+
2. Describe the symptom; map it to a category; the `suggested_check` tells you which skill/tool to run first.
|
|
90
|
+
3. Collect → `incident_timeline` → narrow. Loop until one hypothesis dominates.
|
|
91
|
+
|
|
92
|
+
### 3. Hand off the fix (advisor → executor, like vmware-harden)
|
|
93
|
+
1. Debug emits a structured diagnosis + a proposed remediation (steps).
|
|
94
|
+
2. **Single, low-risk fix** → call the matching **vmware-aiops** tool (it has its own double-confirm).
|
|
95
|
+
3. **Multi-step / needs approval / cross-skill** → submit the plan to **vmware-pilot**, which owns the state machine, approval gate, rollback, and audit.
|
|
96
|
+
4. **Failure branch — fix is ambiguous or risky:** stop and present the hypotheses to the user; never guess-execute.
|
|
97
|
+
|
|
98
|
+
## Usage Mode
|
|
99
|
+
|
|
100
|
+
- **MCP** (in an agent): the agent calls the other skills' read tools, then `incident_timeline` to correlate. This is the primary mode — that's where the cross-skill "联动" happens.
|
|
101
|
+
- **CLI** (humans): `vmware-debug triage --events events.json` correlates a JSON array you collected yourself.
|
|
102
|
+
|
|
103
|
+
## MCP Tools (2 — 2 read, 0 write)
|
|
104
|
+
|
|
105
|
+
| Tool | What |
|
|
106
|
+
|---|---|
|
|
107
|
+
| `incident_timeline` | [READ] Correlate pre-fetched events → timeline + spikes + ranked hypotheses + next-check ideas |
|
|
108
|
+
| `list_symptom_categories` | [READ] List recognised symptom categories + what to check for each |
|
|
109
|
+
|
|
110
|
+
**Event envelope** (input to `incident_timeline`): `{ts, source, severity, entity, text, fields}`.
|
|
111
|
+
See `references/event-envelope.md`. The agent normalises each source's events into this
|
|
112
|
+
shape; debug stays source-agnostic and has no dependency on the other packages.
|
|
113
|
+
|
|
114
|
+
## CLI Quick Reference
|
|
115
|
+
|
|
116
|
+
```bash
|
|
117
|
+
vmware-debug categories # what can it diagnose
|
|
118
|
+
vmware-debug triage --events events.json # correlate a collected event set
|
|
119
|
+
cat events.json | vmware-debug triage # or via stdin
|
|
120
|
+
vmware-debug mcp # start stdio MCP server (proxy-safe)
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
## Troubleshooting
|
|
124
|
+
|
|
125
|
+
- **`incident_timeline` raises "event[N] could not be normalised"** — event N is missing a timestamp or has an unparseable one. Every event needs `ts` (ISO-8601, epoch seconds, or millis).
|
|
126
|
+
- **All hypotheses come back "uncategorized"** — the symptom isn't in the catalogue yet; widen the window and pull from another source (aria anomalies, log-insight). Consider adding a signature (see `references/routing.md`).
|
|
127
|
+
- **No spikes detected on an obvious burst** — you need ≥3 time bins for a baseline; shrink `bin_seconds`.
|
|
128
|
+
- **It won't execute the fix** — by design. Route to vmware-aiops or vmware-pilot.
|
|
129
|
+
|
|
130
|
+
## Audit & Safety
|
|
131
|
+
|
|
132
|
+
Read-only by construction: no write tools, no network, nothing executed. Remediation
|
|
133
|
+
is always routed to aiops/pilot, where the double-confirm / approval / audit gates live
|
|
134
|
+
(audit DB `~/.vmware/audit.db`). See `references/setup-guide.md`.
|
|
135
|
+
|
|
136
|
+
## License
|
|
137
|
+
|
|
138
|
+
MIT.
|
|
@@ -0,0 +1,32 @@
|
|
|
1
|
+
# vmware-debug Capabilities
|
|
2
|
+
|
|
3
|
+
Read-only, offline incident correlation. No network, no credentials, no writes.
|
|
4
|
+
|
|
5
|
+
| Tool | What it returns | Typical response tokens |
|
|
6
|
+
|---|---|---|
|
|
7
|
+
| `incident_timeline` | `{event_count, window, spikes:[{start,end,count,zscore}], hypotheses:[{category, score, summary, evidence_count, first_seen, last_seen, sample_text, suggested_check}], next_checks:[...]}` | 300–2000 (scales with hypotheses) |
|
|
8
|
+
| `list_symptom_categories` | `[{category, example_keywords, suggested_check}]` | ~400 |
|
|
9
|
+
|
|
10
|
+
## Correlation engine
|
|
11
|
+
|
|
12
|
+
- **Timeline**: events normalised to the unified envelope, sorted, and time-binned
|
|
13
|
+
(auto bin width ≈ span/30, or caller-specified).
|
|
14
|
+
- **Spike detection**: z-score over bin counts (≥3 bins required for a baseline;
|
|
15
|
+
flat series yields no false spikes).
|
|
16
|
+
- **Hypothesis ranking**: events clustered by symptom category (keyword match on
|
|
17
|
+
text + entity), scored by summed severity weight, tie-broken by recency.
|
|
18
|
+
Uncategorised events are kept visible, not dropped.
|
|
19
|
+
- **Next-check routing**: each category carries a concrete "which skill/tool to run
|
|
20
|
+
next" suggestion — the value when the user doesn't know what to check.
|
|
21
|
+
|
|
22
|
+
## Symptom categories
|
|
23
|
+
|
|
24
|
+
`storage`, `network`, `compute`, `ha_drs`, `power_lifecycle`, `auth`, `platform`.
|
|
25
|
+
See `references/routing.md` for keyword signatures and the skill each routes to.
|
|
26
|
+
|
|
27
|
+
## Design properties
|
|
28
|
+
|
|
29
|
+
- **Zero cross-skill runtime deps** — correlation is pure functions over plain
|
|
30
|
+
dicts; the agent fans out to other skills' read tools (踩坑 #21/#32).
|
|
31
|
+
- **JSON-serialisable output** — suitable for direct MCP responses.
|
|
32
|
+
- **Immutable** — inputs are never mutated; every function returns new values.
|
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
# vmware-debug CLI Reference
|
|
2
|
+
|
|
3
|
+
All commands are read-only and offline (no network, no credentials).
|
|
4
|
+
|
|
5
|
+
## triage — correlate a set of collected events
|
|
6
|
+
|
|
7
|
+
```bash
|
|
8
|
+
vmware-debug triage [OPTIONS]
|
|
9
|
+
-e, --events PATH JSON file of event envelopes (reads stdin if omitted)
|
|
10
|
+
--bin-seconds N Time-bin width (auto if omitted)
|
|
11
|
+
--top-n N Max hypotheses to return [default: 5]
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
Input is a JSON array of event envelopes (see `references/event-envelope.md`):
|
|
15
|
+
|
|
16
|
+
```bash
|
|
17
|
+
cat events.json | vmware-debug triage
|
|
18
|
+
vmware-debug triage --events events.json --top-n 3
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
Output (JSON): `{event_count, window, spikes, hypotheses, next_checks}`.
|
|
22
|
+
|
|
23
|
+
## categories — list recognised symptom categories
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
vmware-debug categories
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
Prints each category, sample keywords, and the suggested next check (which
|
|
30
|
+
skill/tool to run). Use when you don't know what to look at.
|
|
31
|
+
|
|
32
|
+
## version / mcp
|
|
33
|
+
|
|
34
|
+
```bash
|
|
35
|
+
vmware-debug version # installed version
|
|
36
|
+
vmware-debug mcp # start the stdio MCP server (no network at startup)
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
## How the agent uses it
|
|
40
|
+
|
|
41
|
+
In an agent, the cross-skill correlation happens at the agent layer:
|
|
42
|
+
|
|
43
|
+
1. Fetch events with the data-source skills (vmware-monitor `event_list`,
|
|
44
|
+
vmware-log-insight `log_search`/`log_aggregate`, vmware-aria alerts/anomaly,
|
|
45
|
+
vmware-nsx).
|
|
46
|
+
2. Normalise each into the event envelope.
|
|
47
|
+
3. Call the `incident_timeline` MCP tool to correlate and rank.
|
|
48
|
+
4. Follow `next_checks`; route any fix to vmware-aiops / vmware-pilot.
|
|
@@ -0,0 +1,49 @@
|
|
|
1
|
+
# The Unified Event Envelope
|
|
2
|
+
|
|
3
|
+
This is the contract between `vmware-debug` and every data-source skill. The
|
|
4
|
+
orchestrating agent fetches events with each skill's own read tools, normalises
|
|
5
|
+
each into this shape, and passes the list to `incident_timeline`. Debug has **no
|
|
6
|
+
runtime dependency** on the other packages (no version lockstep, no heavy install).
|
|
7
|
+
|
|
8
|
+
## Shape
|
|
9
|
+
|
|
10
|
+
```json
|
|
11
|
+
{
|
|
12
|
+
"ts": "2026-06-23T10:15:30Z",
|
|
13
|
+
"source": "monitor",
|
|
14
|
+
"severity": "error",
|
|
15
|
+
"entity": "vm-web01",
|
|
16
|
+
"text": "Device naa.600... performance has deteriorated",
|
|
17
|
+
"fields": { "host": "esxi-03", "datastore": "ds1" }
|
|
18
|
+
}
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
| Field | Type | Notes |
|
|
22
|
+
|---|---|---|
|
|
23
|
+
| `ts` | string \| number | ISO-8601, epoch **seconds**, or epoch **millis** (auto-detected). Required. |
|
|
24
|
+
| `source` | string | `monitor` \| `aria` \| `loginsight` \| `nsx` \| `nsx-security` \| `storage` \| ... |
|
|
25
|
+
| `severity` | string | Free text; normalised to `critical`/`error`/`warning`/`info`/`unknown`. |
|
|
26
|
+
| `entity` | string | The object the event is about (VM/host/datastore). May be empty. |
|
|
27
|
+
| `text` | string | Human-readable message — this is what the symptom classifier matches on. |
|
|
28
|
+
| `fields` | object | Any source-specific extras; preserved, never dropped. |
|
|
29
|
+
|
|
30
|
+
The normaliser is tolerant of common field-name variants (e.g. `timestamp`,
|
|
31
|
+
`createTime`, `startTimeUTC` for `ts`; `criticality`, `level` for `severity`;
|
|
32
|
+
`resourceName`, `vm_name`, `fullFormattedMessage` for entity/text), so most
|
|
33
|
+
sources map with little or no adaptation.
|
|
34
|
+
|
|
35
|
+
## Mapping cheatsheet per source
|
|
36
|
+
|
|
37
|
+
| Source tool (example) | ts | severity | entity | text |
|
|
38
|
+
|---|---|---|---|---|
|
|
39
|
+
| vmware-monitor `event_list` | `createdTime` | `severity` | `vm`/`host` | `fullFormattedMessage` |
|
|
40
|
+
| vmware-aria `alert_query` | `startTimeUTC` | `criticality` | `resourceName` | `alertDefinitionName` |
|
|
41
|
+
| vmware-aria `anomaly` | `timestamp` | (derive) | `resourceName` | stat + value |
|
|
42
|
+
| vmware-log-insight `log_search` | `timestamp` | `severity`/derive | `hostname` | `text` |
|
|
43
|
+
| vmware-nsx (firewall/traceflow) | `time` | (derive) | src/dst | rule/verdict |
|
|
44
|
+
|
|
45
|
+
## Why this design
|
|
46
|
+
|
|
47
|
+
- **Decoupling** — debug never imports monitor/aria/log-insight (CLAUDE.md 踩坑 #21/#32).
|
|
48
|
+
- **Testability** — correlation is pure functions over `Event`; unit tests feed synthetic events.
|
|
49
|
+
- **Transparency** — the cross-skill "联动" happens at the agent layer, visibly, not hidden inside debug.
|