hatch3r 1.4.0 → 1.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +10 -6
- package/agents/hatch3r-a11y-auditor.md +13 -2
- package/agents/hatch3r-architect.md +20 -1
- package/agents/hatch3r-ci-watcher.md +25 -1
- package/agents/hatch3r-context-rules.md +15 -3
- package/agents/hatch3r-dependency-auditor.md +23 -2
- package/agents/hatch3r-devops.md +11 -0
- package/agents/hatch3r-docs-writer.md +27 -2
- package/agents/hatch3r-fixer.md +46 -3
- package/agents/hatch3r-implementer.md +19 -1
- package/agents/hatch3r-learnings-loader.md +19 -0
- package/agents/hatch3r-lint-fixer.md +11 -0
- package/agents/hatch3r-perf-profiler.md +21 -1
- package/agents/hatch3r-researcher.md +51 -911
- package/agents/hatch3r-reviewer.md +24 -2
- package/agents/hatch3r-security-auditor.md +20 -0
- package/agents/hatch3r-test-writer.md +24 -0
- package/agents/modes/architecture.md +1 -0
- package/agents/modes/boundary-analysis.md +2 -1
- package/agents/modes/codebase-impact.md +1 -0
- package/agents/modes/complexity-risk.md +1 -0
- package/agents/modes/coverage-analysis.md +1 -0
- package/agents/modes/current-state.md +1 -0
- package/agents/modes/feature-design.md +1 -0
- package/agents/modes/impact-analysis.md +1 -0
- package/agents/modes/library-docs.md +2 -1
- package/agents/modes/migration-path.md +1 -0
- package/agents/modes/prior-art.md +1 -0
- package/agents/modes/refactoring-strategy.md +1 -0
- package/agents/modes/regression.md +1 -0
- package/agents/modes/requirements-elicitation.md +1 -0
- package/agents/modes/risk-assessment.md +1 -0
- package/agents/modes/risk-prioritization.md +1 -0
- package/agents/modes/root-cause.md +1 -0
- package/agents/modes/similar-implementation.md +2 -1
- package/agents/modes/symptom-trace.md +1 -0
- package/agents/modes/test-pattern.md +2 -1
- package/agents/shared/external-knowledge.md +10 -0
- package/agents/shared/quality-charter.md +18 -0
- package/checks/README.md +1 -0
- package/checks/accessibility.md +55 -0
- package/commands/board/pickup-azure-devops.md +1 -0
- package/commands/board/pickup-delegation-multi.md +6 -1
- package/commands/board/pickup-delegation.md +1 -0
- package/commands/board/pickup-github.md +1 -0
- package/commands/board/pickup-gitlab.md +1 -0
- package/commands/board/pickup-modes.md +1 -0
- package/commands/board/pickup-post-impl.md +2 -1
- package/commands/board/shared-azure-devops.md +1 -0
- package/commands/board/shared-board-overview.md +1 -0
- package/commands/board/shared-github.md +1 -0
- package/commands/board/shared-gitlab.md +1 -0
- package/commands/hatch3r-agent-customize.md +1 -0
- package/commands/hatch3r-api-spec.md +1 -0
- package/commands/hatch3r-benchmark.md +4 -3
- package/commands/hatch3r-board-fill.md +52 -9
- package/commands/hatch3r-board-groom.md +69 -5
- package/commands/hatch3r-board-init.md +2 -1
- package/commands/hatch3r-board-pickup.md +1 -0
- package/commands/hatch3r-board-refresh.md +1 -0
- package/commands/hatch3r-board-shared.md +34 -3
- package/commands/hatch3r-bug-plan.md +2 -1
- package/commands/hatch3r-codebase-map.md +4 -3
- package/commands/hatch3r-command-customize.md +2 -1
- package/commands/hatch3r-context-health.md +1 -0
- package/commands/hatch3r-cost-tracking.md +1 -0
- package/commands/hatch3r-debug.md +4 -3
- package/commands/hatch3r-dep-audit.md +3 -0
- package/commands/hatch3r-feature-plan.md +3 -2
- package/commands/hatch3r-healthcheck.md +1 -0
- package/commands/hatch3r-hooks.md +5 -0
- package/commands/hatch3r-learn.md +1 -0
- package/commands/hatch3r-migration-plan.md +3 -2
- package/commands/hatch3r-onboard.md +2 -1
- package/commands/hatch3r-project-spec.md +4 -3
- package/commands/hatch3r-quick-change.md +2 -0
- package/commands/hatch3r-recipe.md +1 -0
- package/commands/hatch3r-refactor-plan.md +2 -1
- package/commands/hatch3r-release.md +4 -1
- package/commands/hatch3r-revision.md +2 -1
- package/commands/hatch3r-roadmap.md +5 -4
- package/commands/hatch3r-rule-customize.md +1 -0
- package/commands/hatch3r-security-audit.md +1 -0
- package/commands/hatch3r-skill-customize.md +1 -0
- package/commands/hatch3r-test-plan.md +3 -2
- package/commands/hatch3r-workflow.md +5 -0
- package/dist/cli/index.js +7467 -4582
- package/dist/cli/index.js.map +1 -1
- package/hooks/hatch3r-ci-failure.md +1 -0
- package/hooks/hatch3r-file-save.md +1 -0
- package/hooks/hatch3r-post-merge.md +1 -0
- package/hooks/hatch3r-pre-commit.md +1 -0
- package/hooks/hatch3r-pre-push.md +1 -0
- package/hooks/hatch3r-session-start.md +1 -0
- package/package.json +19 -4
- package/rules/hatch3r-accessibility-standards.md +2 -1
- package/rules/hatch3r-accessibility-standards.mdc +1 -1
- package/rules/hatch3r-agent-orchestration-detail.md +49 -1
- package/rules/hatch3r-agent-orchestration-detail.mdc +47 -1
- package/rules/hatch3r-agent-orchestration.md +87 -5
- package/rules/hatch3r-agent-orchestration.mdc +85 -5
- package/rules/hatch3r-api-design.md +2 -1
- package/rules/hatch3r-api-design.mdc +1 -1
- package/rules/hatch3r-browser-verification.md +4 -2
- package/rules/hatch3r-browser-verification.mdc +1 -0
- package/rules/hatch3r-ci-cd.md +2 -1
- package/rules/hatch3r-ci-cd.mdc +1 -1
- package/rules/hatch3r-code-standards.md +15 -2
- package/rules/hatch3r-code-standards.mdc +12 -0
- package/rules/hatch3r-component-conventions.md +2 -1
- package/rules/hatch3r-component-conventions.mdc +1 -0
- package/rules/hatch3r-data-classification.md +2 -1
- package/rules/hatch3r-data-classification.mdc +1 -1
- package/rules/hatch3r-deep-context.md +26 -1
- package/rules/hatch3r-deep-context.mdc +25 -1
- package/rules/hatch3r-dependency-management.md +2 -1
- package/rules/hatch3r-dependency-management.mdc +1 -1
- package/rules/hatch3r-feature-flags.md +2 -0
- package/rules/hatch3r-feature-flags.mdc +1 -0
- package/rules/hatch3r-git-conventions.md +2 -1
- package/rules/hatch3r-git-conventions.mdc +2 -1
- package/rules/hatch3r-i18n.md +2 -1
- package/rules/hatch3r-i18n.mdc +1 -0
- package/rules/hatch3r-learning-consult.md +11 -1
- package/rules/hatch3r-learning-consult.mdc +11 -1
- package/rules/hatch3r-migrations.md +2 -1
- package/rules/hatch3r-migrations.mdc +1 -1
- package/rules/hatch3r-observability-logging.md +34 -0
- package/rules/hatch3r-observability-logging.mdc +30 -0
- package/rules/hatch3r-observability-metrics.md +74 -0
- package/rules/hatch3r-observability-metrics.mdc +70 -0
- package/rules/hatch3r-observability-tracing-detail.md +160 -0
- package/rules/hatch3r-observability-tracing-detail.mdc +63 -0
- package/rules/hatch3r-observability-tracing.md +86 -0
- package/rules/hatch3r-observability-tracing.mdc +77 -0
- package/rules/hatch3r-observability.md +9 -448
- package/rules/hatch3r-observability.mdc +7 -448
- package/rules/hatch3r-performance-budgets.md +2 -0
- package/rules/hatch3r-performance-budgets.mdc +1 -0
- package/rules/hatch3r-secrets-management.md +2 -1
- package/rules/hatch3r-secrets-management.mdc +1 -1
- package/rules/hatch3r-security-patterns.md +3 -2
- package/rules/hatch3r-security-patterns.mdc +1 -1
- package/rules/hatch3r-testing.md +12 -2
- package/rules/hatch3r-testing.mdc +10 -1
- package/rules/hatch3r-theming.md +3 -2
- package/rules/hatch3r-theming.mdc +1 -0
- package/rules/hatch3r-tooling-hierarchy.md +3 -2
- package/rules/hatch3r-tooling-hierarchy.mdc +1 -1
- package/skills/hatch3r-a11y-audit/SKILL.md +11 -4
- package/skills/hatch3r-agent-customize/SKILL.md +1 -0
- package/skills/hatch3r-api-spec/SKILL.md +9 -2
- package/skills/hatch3r-architecture-review/SKILL.md +7 -0
- package/skills/hatch3r-bug-fix/SKILL.md +16 -7
- package/skills/hatch3r-ci-pipeline/SKILL.md +8 -1
- package/skills/hatch3r-command-customize/SKILL.md +1 -0
- package/skills/hatch3r-context-health/SKILL.md +23 -2
- package/skills/hatch3r-cost-tracking/SKILL.md +16 -6
- package/skills/hatch3r-customize/SKILL.md +8 -1
- package/skills/hatch3r-dep-audit/SKILL.md +9 -2
- package/skills/hatch3r-feature/SKILL.md +12 -4
- package/skills/hatch3r-gh-agentic-workflows/SKILL.md +7 -0
- package/skills/hatch3r-incident-response/SKILL.md +7 -0
- package/skills/hatch3r-issue-workflow/SKILL.md +8 -1
- package/skills/hatch3r-logical-refactor/SKILL.md +8 -1
- package/skills/hatch3r-migration/SKILL.md +7 -0
- package/skills/hatch3r-perf-audit/SKILL.md +9 -2
- package/skills/hatch3r-pr-creation/SKILL.md +8 -1
- package/skills/hatch3r-qa-validation/SKILL.md +8 -1
- package/skills/hatch3r-recipe/SKILL.md +8 -1
- package/skills/hatch3r-refactor/SKILL.md +10 -2
- package/skills/hatch3r-release/SKILL.md +8 -1
- package/skills/hatch3r-rule-customize/SKILL.md +1 -0
- package/skills/hatch3r-skill-customize/SKILL.md +1 -0
- package/skills/hatch3r-visual-refactor/SKILL.md +12 -5
|
@@ -0,0 +1,160 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: hatch3r-observability-tracing-detail
|
|
3
|
+
type: rule
|
|
4
|
+
description: Extended tracing reference -- AI agent instrumentation, tool call audit trails, LLM request tracing, and correlation ID patterns
|
|
5
|
+
scope: conditional
|
|
6
|
+
globs: "**/*trac*,**/*span*,**/*telemetry*,**/*otel*,**/*agent*,**/observability/**"
|
|
7
|
+
tags: [devops]
|
|
8
|
+
quality_charter: agents/shared/quality-charter.md
|
|
9
|
+
---
|
|
10
|
+
# Observability -- Tracing Extended Reference
|
|
11
|
+
|
|
12
|
+
On-demand companion to `hatch3r-observability-tracing`. Load when instrumenting AI agent systems, implementing tool call audit trails, or setting up correlation IDs for multi-agent workflows.
|
|
13
|
+
|
|
14
|
+
## GenAI Span Attributes
|
|
15
|
+
|
|
16
|
+
Use these attributes on all spans representing interactions with generative AI models:
|
|
17
|
+
|
|
18
|
+
| Attribute | Type | Description | Example |
|
|
19
|
+
|-----------|------|-------------|---------|
|
|
20
|
+
| `gen_ai.system` | string | GenAI provider system name | `openai`, `anthropic`, `azure_openai` |
|
|
21
|
+
| `gen_ai.request.model` | string | Model name as specified in the request | `gpt-4o`, `claude-sonnet-4-20250514` |
|
|
22
|
+
| `gen_ai.response.model` | string | Model name as returned in the response | `gpt-4o-2024-08-06` |
|
|
23
|
+
| `gen_ai.request.max_tokens` | int | Maximum tokens requested for generation | `4096` |
|
|
24
|
+
| `gen_ai.request.temperature` | float | Temperature parameter | `0.7` |
|
|
25
|
+
| `gen_ai.response.finish_reasons` | string[] | Reasons the model stopped generating | `["stop"]`, `["length"]` |
|
|
26
|
+
| `gen_ai.usage.input_tokens` | int | Tokens in the input/prompt | `1250` |
|
|
27
|
+
| `gen_ai.usage.output_tokens` | int | Tokens in the generated output | `530` |
|
|
28
|
+
|
|
29
|
+
- Always set `gen_ai.system` and `gen_ai.request.model` on every GenAI span.
|
|
30
|
+
- Record `gen_ai.usage.input_tokens` and `gen_ai.usage.output_tokens` from the API response for cost dashboards.
|
|
31
|
+
- Use `gen_ai.response.finish_reasons` to detect truncated outputs (`length`) and trigger re-prompting.
|
|
32
|
+
|
|
33
|
+
## Agent Invocation Spans
|
|
34
|
+
|
|
35
|
+
Instrument the full lifecycle of an agent invocation with a dedicated span. This span is the parent for all LLM calls, tool executions, and sub-agent delegations.
|
|
36
|
+
|
|
37
|
+
- **Span name pattern:** `agent.{agent_name}.invoke`
|
|
38
|
+
- **Required attributes:** `agent.id`, `agent.name`, `agent.parent_id`, `agent.task`, `agent.framework`
|
|
39
|
+
- **Span events for state transitions:** `agent.planning`, `agent.tool_selection`, `agent.awaiting_human`, `agent.delegating`, `agent.completed`, `agent.error`
|
|
40
|
+
|
|
41
|
+
```typescript
|
|
42
|
+
const agentSpan = tracer.startSpan('agent.code_reviewer.invoke', {
|
|
43
|
+
attributes: {
|
|
44
|
+
'agent.id': invocationId,
|
|
45
|
+
'agent.name': 'code_reviewer',
|
|
46
|
+
'agent.parent_id': parentAgentId ?? '',
|
|
47
|
+
'agent.task': `review PR #${prNumber}`,
|
|
48
|
+
'agent.framework': 'custom',
|
|
49
|
+
},
|
|
50
|
+
});
|
|
51
|
+
agentSpan.addEvent('agent.planning');
|
|
52
|
+
// ... agent reasoning and tool calls happen as child spans ...
|
|
53
|
+
agentSpan.addEvent('agent.completed');
|
|
54
|
+
agentSpan.end();
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
## Tool Call Spans
|
|
58
|
+
|
|
59
|
+
Every tool invocation by an agent creates a child span of the agent invocation span.
|
|
60
|
+
|
|
61
|
+
- **Span name pattern:** `tool.{tool_name}.execute`
|
|
62
|
+
- **Required attributes:** `tool.name`, `tool.input_hash` (SHA-256), `tool.output_status`, `tool.duration_ms`, `tool.parameters_count`
|
|
63
|
+
- Tool spans must be children of the invoking agent span. Set span status to `ERROR` when `tool.output_status` is `error` or `timeout`.
|
|
64
|
+
- For tools performing I/O, create nested child spans using appropriate semantic conventions (`http.*`, `db.*`).
|
|
65
|
+
|
|
66
|
+
```typescript
|
|
67
|
+
const toolSpan = tracer.startSpan(
|
|
68
|
+
'tool.git_diff.execute',
|
|
69
|
+
{ attributes: { 'tool.name': 'git_diff' } },
|
|
70
|
+
trace.setSpan(context.active(), agentSpan),
|
|
71
|
+
);
|
|
72
|
+
try {
|
|
73
|
+
const result = await tools.gitDiff(params);
|
|
74
|
+
toolSpan.setAttributes({
|
|
75
|
+
'tool.output_status': 'success',
|
|
76
|
+
'tool.duration_ms': performance.now() - startTime,
|
|
77
|
+
'tool.input_hash': hashInput(params),
|
|
78
|
+
});
|
|
79
|
+
} catch (err) {
|
|
80
|
+
toolSpan.setAttributes({ 'tool.output_status': 'error' });
|
|
81
|
+
toolSpan.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
|
|
82
|
+
toolSpan.recordException(err);
|
|
83
|
+
throw err;
|
|
84
|
+
} finally {
|
|
85
|
+
toolSpan.end();
|
|
86
|
+
}
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## LLM Request/Response Tracing
|
|
90
|
+
|
|
91
|
+
- **Span name pattern:** `gen_ai.{operation}` (e.g., `gen_ai.chat`, `gen_ai.completion`)
|
|
92
|
+
- **Token tracking:** Capture `gen_ai.usage.input_tokens` and `gen_ai.usage.output_tokens`. Aggregate in metrics: Counter `gen_ai.tokens_total` with labels `{direction, model, agent_name}`, Histogram `gen_ai.request_duration_ms`.
|
|
93
|
+
- **Model version tracking:** Record both `gen_ai.request.model` and `gen_ai.response.model` for drift detection.
|
|
94
|
+
- **Retry spans:** Each retry attempt is a separate child span. Set `gen_ai.request.retries` on the final span. Record `http.response.status_code` on failed spans (429 vs 500+).
|
|
95
|
+
- Never log raw prompt content or full model responses as span attributes. Use token counts for cost tracking and correlated logs for prompt debugging in non-production environments.
|
|
96
|
+
- Sample GenAI spans at 50-100% in production (higher than general spans) because each call is expensive and low volume.
|
|
97
|
+
|
|
98
|
+
## Tool Call Audit Trail
|
|
99
|
+
|
|
100
|
+
Maintain a structured audit log for every tool invocation in agentic workflows, separate from tracing spans.
|
|
101
|
+
|
|
102
|
+
| Field | Type | Description |
|
|
103
|
+
|-------|------|-------------|
|
|
104
|
+
| `tool.name` | string | Name of the tool invoked |
|
|
105
|
+
| `tool.input_hash` | string | SHA-256 hash of tool input (never log raw input) |
|
|
106
|
+
| `tool.output_status` | string | `success`, `error`, `timeout`, or `denied` |
|
|
107
|
+
| `tool.duration_ms` | float | Execution time in milliseconds |
|
|
108
|
+
| `agent.id` | string | ID of the invoking agent |
|
|
109
|
+
| `agent.name` | string | Human-readable agent name |
|
|
110
|
+
| `correlation.id` | string | Trace correlation ID |
|
|
111
|
+
| `timestamp` | string | ISO 8601 timestamp |
|
|
112
|
+
| `session.id` | string | Session identifier |
|
|
113
|
+
|
|
114
|
+
- Log tool invocations at `info` level, failures at `error` level with `error.type` and `error.message`.
|
|
115
|
+
- Aggregate tool call counts per agent per session for anomaly detection.
|
|
116
|
+
- Retain audit logs for a minimum of 90 days.
|
|
117
|
+
|
|
118
|
+
## Correlation IDs for Agent Workflows
|
|
119
|
+
|
|
120
|
+
- Use UUIDv4 with workflow-type prefix: `{workflow-type}-{uuid}` (e.g., `agent-run-550e8400-...`).
|
|
121
|
+
- Generate at the workflow entry point. Propagate to all sub-agents and tool calls.
|
|
122
|
+
- Every log entry, span, and metric must include `correlation.id`.
|
|
123
|
+
- Cross-process: propagate via `X-Correlation-ID` header alongside W3C Trace Context.
|
|
124
|
+
- Use OpenTelemetry `SpanLink` for cross-workflow references (e.g., agent run triggered by CI event).
|
|
125
|
+
|
|
126
|
+
```typescript
|
|
127
|
+
import { randomUUID } from 'node:crypto';
|
|
128
|
+
import { context, trace, SpanStatusCode } from '@opentelemetry/api';
|
|
129
|
+
|
|
130
|
+
function generateCorrelationId(workflowType: string): string {
|
|
131
|
+
return `${workflowType}-${randomUUID()}`;
|
|
132
|
+
}
|
|
133
|
+
|
|
134
|
+
async function runAgentWorkflow(task: string): Promise<void> {
|
|
135
|
+
const correlationId = generateCorrelationId('agent-run');
|
|
136
|
+
const tracer = trace.getTracer('agent-orchestrator');
|
|
137
|
+
const rootSpan = tracer.startSpan('agent.orchestrator.invoke', {
|
|
138
|
+
attributes: {
|
|
139
|
+
'correlation.id': correlationId,
|
|
140
|
+
'agent.name': 'orchestrator',
|
|
141
|
+
'agent.task': task,
|
|
142
|
+
},
|
|
143
|
+
});
|
|
144
|
+
try {
|
|
145
|
+
await context.with(trace.setSpan(context.active(), rootSpan), async () => {
|
|
146
|
+
await delegateToSubAgent('code_reviewer', {
|
|
147
|
+
correlationId,
|
|
148
|
+
parentSpanId: rootSpan.spanContext().spanId,
|
|
149
|
+
task: 'review changes',
|
|
150
|
+
});
|
|
151
|
+
});
|
|
152
|
+
} catch (err) {
|
|
153
|
+
rootSpan.setStatus({ code: SpanStatusCode.ERROR, message: (err as Error).message });
|
|
154
|
+
rootSpan.recordException(err as Error);
|
|
155
|
+
throw err;
|
|
156
|
+
} finally {
|
|
157
|
+
rootSpan.end();
|
|
158
|
+
}
|
|
159
|
+
}
|
|
160
|
+
```
|
|
@@ -0,0 +1,63 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Extended tracing reference -- AI agent instrumentation, tool call audit trails, LLM request tracing, and correlation ID patterns
|
|
3
|
+
globs: ["**/*trac*", "**/*span*", "**/*telemetry*", "**/*otel*", "**/*agent*", "**/observability/**"]
|
|
4
|
+
alwaysApply: false
|
|
5
|
+
---
|
|
6
|
+
# Observability -- Tracing Extended Reference
|
|
7
|
+
|
|
8
|
+
On-demand companion to `hatch3r-observability-tracing`. Load when instrumenting AI agent systems, implementing tool call audit trails, or setting up correlation IDs for multi-agent workflows.
|
|
9
|
+
|
|
10
|
+
## GenAI Span Attributes
|
|
11
|
+
|
|
12
|
+
Use these attributes on all spans representing interactions with generative AI models:
|
|
13
|
+
|
|
14
|
+
| Attribute | Type | Description | Example |
|
|
15
|
+
|-----------|------|-------------|---------|
|
|
16
|
+
| `gen_ai.system` | string | GenAI provider system name | `openai`, `anthropic`, `azure_openai` |
|
|
17
|
+
| `gen_ai.request.model` | string | Model name as specified in the request | `gpt-4o`, `claude-sonnet-4-20250514` |
|
|
18
|
+
| `gen_ai.response.model` | string | Model name as returned in the response | `gpt-4o-2024-08-06` |
|
|
19
|
+
| `gen_ai.request.max_tokens` | int | Maximum tokens requested for generation | `4096` |
|
|
20
|
+
| `gen_ai.request.temperature` | float | Temperature parameter | `0.7` |
|
|
21
|
+
| `gen_ai.response.finish_reasons` | string[] | Reasons the model stopped generating | `["stop"]`, `["length"]` |
|
|
22
|
+
| `gen_ai.usage.input_tokens` | int | Tokens in the input/prompt | `1250` |
|
|
23
|
+
| `gen_ai.usage.output_tokens` | int | Tokens in the generated output | `530` |
|
|
24
|
+
|
|
25
|
+
- Always set `gen_ai.system` and `gen_ai.request.model` on every GenAI span.
|
|
26
|
+
- Record `gen_ai.usage.input_tokens` and `gen_ai.usage.output_tokens` from the API response for cost dashboards.
|
|
27
|
+
- Use `gen_ai.response.finish_reasons` to detect truncated outputs (`length`) and trigger re-prompting.
|
|
28
|
+
|
|
29
|
+
## Agent Invocation Spans
|
|
30
|
+
|
|
31
|
+
Instrument the full lifecycle of an agent invocation with a dedicated span. This span is the parent for all LLM calls, tool executions, and sub-agent delegations.
|
|
32
|
+
|
|
33
|
+
- **Span name pattern:** `agent.{agent_name}.invoke`
|
|
34
|
+
- **Required attributes:** `agent.id`, `agent.name`, `agent.parent_id`, `agent.task`, `agent.framework`
|
|
35
|
+
- **Span events for state transitions:** `agent.planning`, `agent.tool_selection`, `agent.awaiting_human`, `agent.delegating`, `agent.completed`, `agent.error`
|
|
36
|
+
|
|
37
|
+
## Tool Call Spans
|
|
38
|
+
|
|
39
|
+
Every tool invocation by an agent creates a child span of the agent invocation span.
|
|
40
|
+
|
|
41
|
+
- **Span name pattern:** `tool.{tool_name}.execute`
|
|
42
|
+
- **Required attributes:** `tool.name`, `tool.input_hash` (SHA-256), `tool.output_status`, `tool.duration_ms`, `tool.parameters_count`
|
|
43
|
+
- Tool spans must be children of the invoking agent span. Set span status to `ERROR` when `tool.output_status` is `error` or `timeout`.
|
|
44
|
+
|
|
45
|
+
## LLM Request/Response Tracing
|
|
46
|
+
|
|
47
|
+
- **Span name pattern:** `gen_ai.{operation}` (e.g., `gen_ai.chat`, `gen_ai.completion`)
|
|
48
|
+
- **Token tracking:** Capture `gen_ai.usage.input_tokens` and `gen_ai.usage.output_tokens`. Aggregate in metrics.
|
|
49
|
+
- **Model version tracking:** Record both `gen_ai.request.model` and `gen_ai.response.model` for drift detection.
|
|
50
|
+
- **Retry spans:** Each retry attempt is a separate child span. Set `gen_ai.request.retries` on the final span.
|
|
51
|
+
- Never log raw prompt content or full model responses as span attributes.
|
|
52
|
+
|
|
53
|
+
## Tool Call Audit Trail
|
|
54
|
+
|
|
55
|
+
Maintain a structured audit log for every tool invocation in agentic workflows. Key fields: `tool.name`, `tool.input_hash`, `tool.output_status`, `tool.duration_ms`, `agent.id`, `agent.name`, `correlation.id`, `timestamp`, `session.id`. Retain for 90 days minimum.
|
|
56
|
+
|
|
57
|
+
## Correlation IDs for Agent Workflows
|
|
58
|
+
|
|
59
|
+
- Use UUIDv4 with workflow-type prefix: `{workflow-type}-{uuid}`.
|
|
60
|
+
- Generate at the workflow entry point. Propagate to all sub-agents and tool calls.
|
|
61
|
+
- Every log entry, span, and metric must include `correlation.id`.
|
|
62
|
+
- Cross-process: propagate via `X-Correlation-ID` header alongside W3C Trace Context.
|
|
63
|
+
- Use OpenTelemetry `SpanLink` for cross-workflow references.
|
|
@@ -0,0 +1,86 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: hatch3r-observability-tracing
|
|
3
|
+
type: rule
|
|
4
|
+
description: Distributed tracing and OpenTelemetry core conventions for the project
|
|
5
|
+
scope: conditional
|
|
6
|
+
globs: "**/*trac*,**/*span*,**/*telemetry*,**/*otel*,**/observability/**"
|
|
7
|
+
tags: [devops]
|
|
8
|
+
quality_charter: agents/shared/quality-charter.md
|
|
9
|
+
---
|
|
10
|
+
# Observability -- Distributed Tracing & OpenTelemetry
|
|
11
|
+
|
|
12
|
+
Core distributed tracing and OpenTelemetry conventions. For structured logging see `hatch3r-observability-logging`. For metrics, SLOs, alerting, and dashboards see `hatch3r-observability-metrics`. For AI agent instrumentation, tool call audit trails, and correlation ID patterns see `hatch3r-observability-tracing-detail`.
|
|
13
|
+
|
|
14
|
+
## Distributed Tracing
|
|
15
|
+
|
|
16
|
+
- Use OpenTelemetry SDK for all tracing instrumentation. Initialize the TracerProvider once at application startup before any instrumented libraries load.
|
|
17
|
+
- Propagate trace context via W3C Trace Context headers (`traceparent`, `tracestate`) across all service boundaries, queues, and async workflows.
|
|
18
|
+
- Span naming conventions:
|
|
19
|
+
|
|
20
|
+
| Span Type | Pattern | Example |
|
|
21
|
+
| ----------- | ------------------------------ | --------------------------- |
|
|
22
|
+
| HTTP server | `HTTP {method} {route}` | `HTTP GET /api/users/:id` |
|
|
23
|
+
| HTTP client | `HTTP {method} {host}{path}` | `HTTP POST api.stripe.com/` |
|
|
24
|
+
| DB query | `{db.system} {operation}` | `firestore getDoc` |
|
|
25
|
+
| Queue | `{queue} {operation}` | `tasks-queue publish` |
|
|
26
|
+
| Internal | `{module}.{function}` | `auth.verifyToken` |
|
|
27
|
+
|
|
28
|
+
- Required span attributes: `service.name`, `service.version`, `deployment.environment`. Add domain-specific attributes (e.g., `user.id`, `tenant.id`) where relevant.
|
|
29
|
+
- Parent-child span relationships: every outbound call (HTTP, DB, queue) creates a child span of the current context. Never create orphan spans.
|
|
30
|
+
- Sampling strategies: use `ParentBased(TraceIdRatioBased(0.1))` in production (10% sample rate). Always sample errors and slow requests (> p95 latency) at 100%.
|
|
31
|
+
- Use the OpenTelemetry Collector as a gateway between applications and backends to enable batching, retrying, and vendor-neutral export.
|
|
32
|
+
- Keep span event count low (< 32 per span). For high-volume events, use correlated logs or `SpanLink` instead.
|
|
33
|
+
|
|
34
|
+
## OpenTelemetry Semantic Conventions
|
|
35
|
+
|
|
36
|
+
Follow the [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/) (v1.29+) for consistent attribute naming across all telemetry signals.
|
|
37
|
+
|
|
38
|
+
### Standard Attribute Namespaces
|
|
39
|
+
|
|
40
|
+
| Namespace | Scope | Key Attributes |
|
|
41
|
+
|-----------|-------|----------------|
|
|
42
|
+
| `http.*` | HTTP client and server spans | `http.request.method`, `http.response.status_code`, `http.route`, `url.full`, `url.scheme` |
|
|
43
|
+
| `db.*` | Database client spans | `db.system`, `db.operation.name`, `db.collection.name`, `db.query.text` (sanitized) |
|
|
44
|
+
| `rpc.*` | RPC client and server spans | `rpc.system`, `rpc.service`, `rpc.method`, `rpc.grpc.status_code` |
|
|
45
|
+
| `messaging.*` | Message queue spans | `messaging.system`, `messaging.operation.type`, `messaging.destination.name` |
|
|
46
|
+
| `faas.*` | Serverless/FaaS invocations | `faas.trigger`, `faas.invoked_name`, `faas.coldstart` |
|
|
47
|
+
| `cloud.*` | Cloud provider context | `cloud.provider`, `cloud.region`, `cloud.availability_zone` |
|
|
48
|
+
| `k8s.*` | Kubernetes context | `k8s.namespace.name`, `k8s.pod.name`, `k8s.deployment.name` |
|
|
49
|
+
|
|
50
|
+
- Use semantic convention attribute names exactly as specified. Do not invent custom alternatives for concepts already covered.
|
|
51
|
+
- When semantic conventions are marked "Experimental," prefer them over project-specific names to ease future migration.
|
|
52
|
+
|
|
53
|
+
### Resource Semantic Conventions
|
|
54
|
+
|
|
55
|
+
Every telemetry-producing service must declare resource attributes at startup:
|
|
56
|
+
|
|
57
|
+
| Attribute | Requirement | Description |
|
|
58
|
+
|-----------|-------------|-------------|
|
|
59
|
+
| `service.name` | Required | Logical name of the service |
|
|
60
|
+
| `service.version` | Recommended | Semantic version of the service |
|
|
61
|
+
| `deployment.environment.name` | Recommended | Deployment environment (production, staging, development) |
|
|
62
|
+
| `service.instance.id` | Recommended | Unique instance identifier (pod name, container ID) |
|
|
63
|
+
|
|
64
|
+
- Configure via environment variables (`OTEL_SERVICE_NAME`, `OTEL_RESOURCE_ATTRIBUTES`) or programmatically at SDK initialization.
|
|
65
|
+
- Do not use the default `unknown_service` value in any deployed environment.
|
|
66
|
+
|
|
67
|
+
### Span Status Codes
|
|
68
|
+
|
|
69
|
+
| Code | When to Set |
|
|
70
|
+
|------|-------------|
|
|
71
|
+
| `UNSET` | Default. Span completed without error indication. |
|
|
72
|
+
| `OK` | Set only when the application explicitly considers the operation successful and wants to override lower-level error signals. Use sparingly. |
|
|
73
|
+
| `ERROR` | Operation failed: exception caught, HTTP 5xx, or business-logic error visible in error rate metrics. |
|
|
74
|
+
|
|
75
|
+
- Set `ERROR` for server-side errors (5xx) and unhandled exceptions. Do not set `ERROR` for client errors (4xx) on the server span.
|
|
76
|
+
- Attach exceptions as span events (`exception.type`, `exception.message`, `exception.stacktrace`) when setting `ERROR`.
|
|
77
|
+
|
|
78
|
+
### Attribute Naming Guidelines
|
|
79
|
+
|
|
80
|
+
- Use dot-separated namespaces: `http.request.method`, not `httpRequestMethod`.
|
|
81
|
+
- Attribute values should be low-cardinality. Never use unbounded values (full URLs with query params, raw SQL) as attribute values.
|
|
82
|
+
- Prefer semantic convention attributes over custom attributes. Prefix custom attributes with your project namespace (e.g., `myapp.feature.flag_key`).
|
|
83
|
+
|
|
84
|
+
### AI Agent Semantic Conventions (Summary)
|
|
85
|
+
|
|
86
|
+
Follow the [OpenTelemetry GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) for AI/LLM agent instrumentation. Key attributes: `gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`. For full attribute tables, code examples, tool call audit trails, and correlation ID patterns, see `hatch3r-observability-tracing-detail`.
|
|
@@ -0,0 +1,77 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Distributed tracing and OpenTelemetry core conventions for the project
|
|
3
|
+
globs: ["**/*trac*", "**/*span*", "**/*telemetry*", "**/*otel*", "**/observability/**"]
|
|
4
|
+
alwaysApply: false
|
|
5
|
+
---
|
|
6
|
+
# Observability -- Distributed Tracing & OpenTelemetry
|
|
7
|
+
|
|
8
|
+
Core distributed tracing and OpenTelemetry conventions. For structured logging see `hatch3r-observability-logging`. For metrics, SLOs, alerting, and dashboards see `hatch3r-observability-metrics`. For AI agent instrumentation, tool call audit trails, and correlation ID patterns see `hatch3r-observability-tracing-detail`.
|
|
9
|
+
|
|
10
|
+
## Distributed Tracing
|
|
11
|
+
|
|
12
|
+
- Use OpenTelemetry SDK for all tracing instrumentation. Initialize the TracerProvider once at application startup before any instrumented libraries load.
|
|
13
|
+
- Propagate trace context via W3C Trace Context headers (`traceparent`, `tracestate`) across all service boundaries, queues, and async workflows.
|
|
14
|
+
- Span naming conventions:
|
|
15
|
+
|
|
16
|
+
| Span Type | Pattern | Example |
|
|
17
|
+
| ----------- | ------------------------------ | --------------------------- |
|
|
18
|
+
| HTTP server | `HTTP {method} {route}` | `HTTP GET /api/users/:id` |
|
|
19
|
+
| HTTP client | `HTTP {method} {host}{path}` | `HTTP POST api.stripe.com/` |
|
|
20
|
+
| DB query | `{db.system} {operation}` | `firestore getDoc` |
|
|
21
|
+
| Queue | `{queue} {operation}` | `tasks-queue publish` |
|
|
22
|
+
| Internal | `{module}.{function}` | `auth.verifyToken` |
|
|
23
|
+
|
|
24
|
+
- Required span attributes: `service.name`, `service.version`, `deployment.environment`. Add domain-specific attributes where relevant.
|
|
25
|
+
- Parent-child span relationships: every outbound call (HTTP, DB, queue) creates a child span of the current context. Never create orphan spans.
|
|
26
|
+
- Sampling strategies: use `ParentBased(TraceIdRatioBased(0.1))` in production (10% sample rate). Always sample errors and slow requests (> p95 latency) at 100%.
|
|
27
|
+
- Use the OpenTelemetry Collector as a gateway between applications and backends to enable batching, retrying, and vendor-neutral export.
|
|
28
|
+
- Keep span event count low (< 32 per span). For high-volume events, use correlated logs or `SpanLink` instead.
|
|
29
|
+
|
|
30
|
+
## OpenTelemetry Semantic Conventions
|
|
31
|
+
|
|
32
|
+
Follow the [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/) (v1.29+) for consistent attribute naming.
|
|
33
|
+
|
|
34
|
+
### Standard Attribute Namespaces
|
|
35
|
+
|
|
36
|
+
| Namespace | Scope | Key Attributes |
|
|
37
|
+
|-----------|-------|----------------|
|
|
38
|
+
| `http.*` | HTTP client and server spans | `http.request.method`, `http.response.status_code`, `http.route` |
|
|
39
|
+
| `db.*` | Database client spans | `db.system`, `db.operation.name`, `db.collection.name` |
|
|
40
|
+
| `rpc.*` | RPC client and server spans | `rpc.system`, `rpc.service`, `rpc.method` |
|
|
41
|
+
| `messaging.*` | Message queue spans | `messaging.system`, `messaging.operation.type`, `messaging.destination.name` |
|
|
42
|
+
| `faas.*` | Serverless/FaaS invocations | `faas.trigger`, `faas.invoked_name`, `faas.coldstart` |
|
|
43
|
+
| `cloud.*` | Cloud provider context | `cloud.provider`, `cloud.region`, `cloud.availability_zone` |
|
|
44
|
+
| `k8s.*` | Kubernetes context | `k8s.namespace.name`, `k8s.pod.name`, `k8s.deployment.name` |
|
|
45
|
+
|
|
46
|
+
- Use semantic convention attribute names exactly as specified.
|
|
47
|
+
- Prefer experimental conventions over project-specific names for future migration.
|
|
48
|
+
|
|
49
|
+
### Resource Semantic Conventions
|
|
50
|
+
|
|
51
|
+
| Attribute | Requirement | Description |
|
|
52
|
+
|-----------|-------------|-------------|
|
|
53
|
+
| `service.name` | Required | Logical name of the service |
|
|
54
|
+
| `service.version` | Recommended | Semantic version of the service |
|
|
55
|
+
| `deployment.environment.name` | Recommended | Deployment environment |
|
|
56
|
+
| `service.instance.id` | Recommended | Unique instance identifier |
|
|
57
|
+
|
|
58
|
+
### Span Status Codes
|
|
59
|
+
|
|
60
|
+
| Code | When to Set |
|
|
61
|
+
|------|-------------|
|
|
62
|
+
| `UNSET` | Default. Span completed without error indication. |
|
|
63
|
+
| `OK` | Explicitly override lower-level error signals. Use sparingly. |
|
|
64
|
+
| `ERROR` | Exception caught, HTTP 5xx, or business-logic error. |
|
|
65
|
+
|
|
66
|
+
- Set `ERROR` for server-side errors (5xx) and unhandled exceptions. Do not set `ERROR` for client errors (4xx).
|
|
67
|
+
- Attach exceptions as span events when setting `ERROR`.
|
|
68
|
+
|
|
69
|
+
### Attribute Naming Guidelines
|
|
70
|
+
|
|
71
|
+
- Use dot-separated namespaces: `http.request.method`, not `httpRequestMethod`.
|
|
72
|
+
- Attribute values should be low-cardinality. Never use unbounded values as attribute values.
|
|
73
|
+
- Prefix custom attributes with your project namespace.
|
|
74
|
+
|
|
75
|
+
### AI Agent Semantic Conventions (Summary)
|
|
76
|
+
|
|
77
|
+
Follow the [OpenTelemetry GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) for AI/LLM agent instrumentation. Key attributes: `gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`. For full attribute tables, code examples, tool call audit trails, and correlation ID patterns, see `hatch3r-observability-tracing-detail`.
|