@sentry/junior-datadog 0.39.0 → 0.40.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +29 -12
- package/package.json +1 -1
- package/plugin.yaml +49 -26
- package/skills/datadog/SKILL.md +25 -23
- package/skills/datadog/references/api-surface.md +43 -45
- package/skills/datadog/references/common-use-cases.md +40 -35
- package/skills/datadog/references/query-syntax.md +54 -36
- package/skills/datadog/references/troubleshooting-workarounds.md +21 -12
package/README.md
CHANGED
|
@@ -1,11 +1,6 @@
|
|
|
1
1
|
# @sentry/junior-datadog
|
|
2
2
|
|
|
3
|
-
|
|
4
|
-
> **This plugin does not currently work.** Datadog's hosted MCP server requires OAuth Dynamic Client Registration (DCR, [RFC 7591](https://www.rfc-editor.org/rfc/rfc7591)) for third-party clients like Junior, and DCR is locked down on Datadog's side. Until Datadog exposes DCR (or an equivalent registration path) on `mcp.datadoghq.com`, Junior cannot complete the OAuth handshake and every Datadog tool call will fail.
|
|
5
|
-
>
|
|
6
|
-
> The package is kept in-tree so the integration is ready to ship the moment Datadog unblocks DCR. Do not add it to a production deployment in the meantime.
|
|
7
|
-
|
|
8
|
-
`@sentry/junior-datadog` adds read-only Datadog telemetry workflows to Junior through Datadog's hosted MCP server.
|
|
3
|
+
`@sentry/junior-datadog` adds read-only Datadog telemetry workflows to Junior through Datadog's Pup CLI.
|
|
9
4
|
|
|
10
5
|
Install it alongside `@sentry/junior`:
|
|
11
6
|
|
|
@@ -21,13 +16,35 @@ juniorNitro({
|
|
|
21
16
|
});
|
|
22
17
|
```
|
|
23
18
|
|
|
24
|
-
|
|
19
|
+
Set Datadog credentials in the Junior deployment environment:
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
DATADOG_API_KEY=...
|
|
23
|
+
DATADOG_APP_KEY=...
|
|
24
|
+
DATADOG_SITE=datadoghq.com # optional; defaults to US1
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
Use `DATADOG_API_KEY`, `DATADOG_APP_KEY`, and `DATADOG_SITE` in the Junior deployment environment. The plugin maps those host-side `DATADOG_*` values to Datadog API headers and Pup's sandbox `DD_*` env values.
|
|
28
|
+
|
|
29
|
+
The real API and application keys stay host-side. Junior injects them into matching Datadog API requests as `DD-API-KEY` and `DD-APPLICATION-KEY` headers; the sandbox only receives non-secret placeholder values so Pup can perform its normal auth checks.
|
|
25
30
|
|
|
26
|
-
Junior
|
|
31
|
+
Junior keeps this package read-only by setting Pup's read-only mode and by guiding the skill to use `pup --read-only --agent` commands. The plugin is intended for searches, fetches, and analytics across logs, metrics, traces/spans, monitors, incidents, dashboards, hosts, services, and RUM.
|
|
27
32
|
|
|
28
33
|
## Datadog site
|
|
29
34
|
|
|
30
|
-
The packaged manifest defaults to the US1 endpoint
|
|
35
|
+
The packaged manifest defaults to the US1 API endpoint. Teams on other Datadog sites set `DATADOG_SITE` in their Junior deployment env to their site host. Setting deployment `DD_SITE` alone has no effect.
|
|
36
|
+
|
|
37
|
+
| Datadog site | `DATADOG_SITE` value |
|
|
38
|
+
| ------------ | ------------------------------------ |
|
|
39
|
+
| US1 | _unset_ (default) or `datadoghq.com` |
|
|
40
|
+
| US3 | `us3.datadoghq.com` |
|
|
41
|
+
| US5 | `us5.datadoghq.com` |
|
|
42
|
+
| EU | `datadoghq.eu` |
|
|
43
|
+
| AP1 | `ap1.datadoghq.com` |
|
|
44
|
+
| AP2 | `ap2.datadoghq.com` |
|
|
45
|
+
| GovCloud | `ddog-gov.com` |
|
|
46
|
+
|
|
47
|
+
The packaged API allowlist covers those standard Datadog sites. Custom or staging Datadog domains require a manifest change so the sandbox network header transform is allowed for that host.
|
|
31
48
|
|
|
32
49
|
## Optional channel defaults
|
|
33
50
|
|
|
@@ -42,8 +59,8 @@ These defaults are optional fallbacks. If a user names a different env or servic
|
|
|
42
59
|
|
|
43
60
|
## Auth model
|
|
44
61
|
|
|
45
|
-
-
|
|
46
|
-
-
|
|
47
|
-
-
|
|
62
|
+
- This package uses deployment-level Datadog API and application keys, not per-user OAuth.
|
|
63
|
+
- Use a Datadog application key with the smallest read scopes/role that covers the telemetry users need.
|
|
64
|
+
- Real key values never enter the sandbox env, files, or command arguments.
|
|
48
65
|
|
|
49
66
|
Full setup guide: https://junior.sentry.dev/extend/datadog-plugin/
|
package/package.json
CHANGED
package/plugin.yaml
CHANGED
|
@@ -1,36 +1,59 @@
|
|
|
1
1
|
name: datadog
|
|
2
|
-
description: Query Datadog telemetry (logs, metrics, traces, monitors, incidents, dashboards)
|
|
2
|
+
description: Query Datadog telemetry (logs, metrics, traces, monitors, incidents, dashboards) with Datadog's Pup CLI
|
|
3
3
|
|
|
4
4
|
config-keys:
|
|
5
5
|
- env
|
|
6
6
|
- service
|
|
7
7
|
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
#
|
|
8
|
+
capabilities:
|
|
9
|
+
- api
|
|
10
|
+
|
|
11
|
+
# Datadog orgs are region-pinned. Pup routes requests to api.${DATADOG_SITE}.
|
|
12
|
+
# Deployment env vars use DATADOG_* names; Pup receives DD_* command env.
|
|
13
|
+
# Non-US1 operators set DATADOG_SITE to their site host (e.g. us5.datadoghq.com,
|
|
14
|
+
# datadoghq.eu, ap1.datadoghq.com, ddog-gov.com). US1 operators can leave
|
|
15
|
+
# DATADOG_SITE unset and the default applies.
|
|
12
16
|
env-vars:
|
|
17
|
+
DATADOG_API_KEY:
|
|
18
|
+
DATADOG_APP_KEY:
|
|
13
19
|
DATADOG_SITE:
|
|
14
20
|
default: datadoghq.com
|
|
15
21
|
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
22
|
+
api-domains:
|
|
23
|
+
- api.datadoghq.com
|
|
24
|
+
- api.us3.datadoghq.com
|
|
25
|
+
- api.us5.datadoghq.com
|
|
26
|
+
- api.ap1.datadoghq.com
|
|
27
|
+
- api.ap2.datadoghq.com
|
|
28
|
+
- api.datadoghq.eu
|
|
29
|
+
- api.ddog-gov.com
|
|
30
|
+
|
|
31
|
+
api-headers:
|
|
32
|
+
DD-API-KEY: ${DATADOG_API_KEY}
|
|
33
|
+
DD-APPLICATION-KEY: ${DATADOG_APP_KEY}
|
|
34
|
+
|
|
35
|
+
command-env:
|
|
36
|
+
DD_API_KEY: host_managed_credential
|
|
37
|
+
DD_APP_KEY: host_managed_credential
|
|
38
|
+
DD_SITE: ${DATADOG_SITE}
|
|
39
|
+
DD_READ_ONLY: "1"
|
|
40
|
+
FORCE_AGENT_MODE: "1"
|
|
41
|
+
|
|
42
|
+
runtime-postinstall:
|
|
43
|
+
- cmd: bash
|
|
44
|
+
args:
|
|
45
|
+
- -lc
|
|
46
|
+
- |
|
|
47
|
+
set -euo pipefail
|
|
48
|
+
version=0.58.5
|
|
49
|
+
archive="pup_${version}_Linux_x86_64.tar.gz"
|
|
50
|
+
url="https://github.com/DataDog/pup/releases/download/v${version}/${archive}"
|
|
51
|
+
sha256="9543d968a6bd3b00da7ef20053717494beba7962e6cea01368d82857c8ea926b"
|
|
52
|
+
tmp="$(mktemp -d)"
|
|
53
|
+
trap 'rm -rf "$tmp"' EXIT
|
|
54
|
+
curl -fsSL "$url" -o "$tmp/$archive"
|
|
55
|
+
echo "${sha256} $tmp/$archive" | sha256sum -c -
|
|
56
|
+
tar -xzf "$tmp/$archive" -C "$tmp"
|
|
57
|
+
mkdir -p /vercel/sandbox/.junior/bin
|
|
58
|
+
install -m 0755 "$tmp/pup" /vercel/sandbox/.junior/bin/pup
|
|
59
|
+
pup --version
|
package/skills/datadog/SKILL.md
CHANGED
|
@@ -1,11 +1,11 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: datadog
|
|
3
|
-
description: Query live Datadog telemetry (logs, metrics, traces, spans, monitors, incidents, dashboards, services, hosts) through Datadog's
|
|
3
|
+
description: Query live Datadog telemetry (logs, metrics, traces, spans, monitors, incidents, dashboards, services, hosts) through Datadog's Pup CLI. Use when users ask to investigate production behavior in Datadog, including searching logs, checking monitor status, inspecting traces or spans, looking up incidents, finding services, or correlating metrics. Do not use it for Sentry issues, repository/source-code work, or ticketing.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
6
|
# Datadog Operations
|
|
7
7
|
|
|
8
|
-
Use this skill for Datadog observability investigations.
|
|
8
|
+
Use this skill for read-only Datadog observability investigations.
|
|
9
9
|
|
|
10
10
|
## Reference loading
|
|
11
11
|
|
|
@@ -25,41 +25,43 @@ Load references conditionally based on the request:
|
|
|
25
25
|
- Prefer explicit env, service, host, monitor/incident IDs, trace IDs, or Datadog URLs when the user provides them.
|
|
26
26
|
- When the user did not specify a scope, treat `datadog.env` and `datadog.service` conversation config as optional defaults. Explicit user input always wins over config.
|
|
27
27
|
- Only set or change `datadog.env` and `datadog.service` when the user explicitly asks to store a default for this conversation or channel.
|
|
28
|
-
- If the request refers to an earlier telemetry item indirectly
|
|
28
|
+
- If the request refers to an earlier telemetry item indirectly, inspect the current thread for the existing ID or URL before asking the user to restate it.
|
|
29
29
|
- Ask one concise follow-up only when a search is genuinely under-specified, for example when the user asks about "errors" with no env, service, or time window hint and the thread has no prior context.
|
|
30
30
|
|
|
31
|
-
2. Use
|
|
32
|
-
|
|
33
|
-
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
- Known
|
|
37
|
-
- Known
|
|
38
|
-
-
|
|
39
|
-
-
|
|
40
|
-
- For
|
|
41
|
-
-
|
|
42
|
-
-
|
|
31
|
+
2. Use Pup:
|
|
32
|
+
|
|
33
|
+
- Run Datadog commands with `pup --read-only --agent ...`. The plugin also sets read-only/agent env vars, but include the flags so command transcripts show the intended mode.
|
|
34
|
+
- If you are unsure about a command or flag, inspect Pup's schema with `pup --read-only --agent agent schema --compact` or the relevant `pup --read-only --agent <group> --help` output before guessing.
|
|
35
|
+
- Start narrow: pick the single most direct command for the request before broader search.
|
|
36
|
+
- Known incident ID: `pup --read-only --agent incidents get <incident_id>`
|
|
37
|
+
- Known monitor ID: `pup --read-only --agent monitors get <monitor_id>`
|
|
38
|
+
- Known notebook ID: `pup --read-only --agent notebooks get <notebook_id>`
|
|
39
|
+
- Known metric name: `pup --read-only --agent metrics query --query="avg:<metric>{...}" --from="15m" --to="now"`; use `metrics metadata get` or `metrics tags list` when the user wants available tags or dimensions.
|
|
40
|
+
- For exploratory questions, prefer one focused Pup search/list/aggregate command, then one follow-up fetch if needed.
|
|
41
|
+
- For "current error rate / log volume / top offenders" questions, prefer `pup logs aggregate` over pulling raw log pages back through `pup logs search`.
|
|
42
|
+
- For service-topology questions ("what calls checkout?", "what does the payment API depend on?"), prefer `pup apm dependencies list` or `pup apm flow-map` over stitching spans together manually.
|
|
43
|
+
- Use `pup monitors search` or `pup monitors list` for "is this alerting?" and `pup incidents list` / `pup incidents get` for incident context.
|
|
44
|
+
- Use RUM commands only when the user asks about real-user / browser telemetry, not for backend issues.
|
|
43
45
|
|
|
44
46
|
3. Bound every query:
|
|
45
47
|
|
|
46
48
|
- Always constrain time windows. Default to the last 15 minutes for "right now" questions and the last 24 hours for retrospective questions; otherwise use the window the user named.
|
|
47
49
|
- Always include `env:` when `datadog.env` is set or the user named an env.
|
|
48
|
-
- Always include `service:` when the user named a service or `datadog.service` is set and the
|
|
49
|
-
- Cap result size. Prefer the default or small page sizes; do not page through thousands of logs when an aggregate
|
|
50
|
+
- Always include `service:` when the user named a service or `datadog.service` is set and the command is service-scoped.
|
|
51
|
+
- Cap result size. Prefer the default or small page sizes; do not page through thousands of logs when an aggregate command answers the question.
|
|
50
52
|
|
|
51
53
|
4. Report the result:
|
|
52
54
|
|
|
53
55
|
- Return the concrete answer first (counts, status, incident severity, trace timing, top offenders), then a short evidence block.
|
|
54
|
-
- Include Datadog deep links
|
|
55
|
-
- Preserve interesting spans, log lines, or metric values inline only when they are
|
|
56
|
-
- Keep routine tool chatter silent. Do not narrate
|
|
56
|
+
- Include Datadog deep links when Pup returns them or when you can construct a stable app link from an ID. Do not fabricate links from incomplete identifiers.
|
|
57
|
+
- Preserve interesting spans, log lines, or metric values inline only when they are evidence for the answer. Do not dump raw command output.
|
|
58
|
+
- Keep routine tool chatter silent. Do not narrate every Pup search or fetch step.
|
|
57
59
|
|
|
58
60
|
## Guardrails
|
|
59
61
|
|
|
60
|
-
- Read-only only in this skill. Do not create, edit, mute, or resolve monitors, incidents, notebooks, dashboards, SLOs,
|
|
62
|
+
- Read-only only in this skill. Do not create, edit, mute, delete, import, submit, or resolve monitors, incidents, notebooks, dashboards, SLOs, metrics, API keys, RUM resources, or other Datadog objects.
|
|
61
63
|
- Log, RUM, APM, and incident payloads can contain PII or sensitive customer data. Quote only the minimum needed to answer the question. Do not paste full raw log bodies or span payloads when a summary plus a deep link is enough.
|
|
62
|
-
- If
|
|
64
|
+
- If Pup returns `403`, `permission denied`, or similar, stop and tell the user the Datadog credentials could not access the requested resource. Do not guess at missing RBAC scopes.
|
|
63
65
|
- If Datadog responds with `429 Too Many Requests`, wait briefly and retry the same query once. If it still fails, report the throttle and stop.
|
|
64
|
-
- For large traces
|
|
66
|
+
- For large traces or span responses that are incomplete, report that fact; do not pretend the shown spans are complete.
|
|
65
67
|
- Do not use this skill for Sentry issues, Linear/GitHub ticketing, or source-code investigation. Hand those off to the matching skill.
|
|
@@ -4,56 +4,54 @@ Use this reference for any Datadog operation.
|
|
|
4
4
|
|
|
5
5
|
## Provider surface
|
|
6
6
|
|
|
7
|
-
The packaged plugin
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
|
14
|
-
|
|
|
15
|
-
|
|
|
16
|
-
|
|
|
17
|
-
| `
|
|
18
|
-
|
|
|
19
|
-
|
|
|
20
|
-
|
|
|
21
|
-
|
|
|
22
|
-
|
|
|
23
|
-
|
|
|
24
|
-
|
|
|
25
|
-
|
|
|
26
|
-
|
|
|
27
|
-
| `
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
-
|
|
35
|
-
- Monitor, SLO, or incident mutations.
|
|
36
|
-
- Feature-flag, DBM, and security toolsets (the packaged URL does not request them).
|
|
7
|
+
The packaged plugin installs Datadog's `pup` CLI and configures it for agent-mode, read-only Datadog API access. Pup defaults to JSON output, which is the right format for analysis.
|
|
8
|
+
|
|
9
|
+
Run commands as `pup --read-only --agent ...`. If a command surface is unclear, inspect `pup --read-only --agent agent schema --compact` or `pup --read-only --agent <group> --help` before guessing.
|
|
10
|
+
|
|
11
|
+
### Read-oriented commands
|
|
12
|
+
|
|
13
|
+
| Need | Pup command pattern |
|
|
14
|
+
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
|
|
15
|
+
| Raw logs | `pup --read-only --agent logs search --query="service:checkout env:prod status:error" --from="15m" --limit=20` |
|
|
16
|
+
| Log aggregation | `pup --read-only --agent logs aggregate --query="service:checkout env:prod" --compute=count --group-by=status` |
|
|
17
|
+
| Metrics | `pup --read-only --agent metrics list`, `metrics search`, `metrics query`, `metrics metadata get`, `metrics tags list` |
|
|
18
|
+
| Spans / traces | `pup --read-only --agent traces search --query="service:checkout status:error" --from="15m" --limit=20` |
|
|
19
|
+
| Span aggregation | `pup --read-only --agent traces aggregate --query="service:checkout" --compute="percentile(@duration, 95)" --group-by=resource_name` |
|
|
20
|
+
| APM services | `pup --read-only --agent apm services list --env prod`, `apm services stats --env prod` |
|
|
21
|
+
| Service dependencies | `pup --read-only --agent apm dependencies list --env prod` or `apm flow-map --query="service:checkout"` |
|
|
22
|
+
| Monitors | `pup --read-only --agent monitors search --query="service:checkout"`, `monitors list --tags=service:checkout`, `monitors get <id>` |
|
|
23
|
+
| Incidents | `pup --read-only --agent incidents list --query="state:active" --limit=20`, `incidents get <id>` |
|
|
24
|
+
| Hosts | `pup --read-only --agent infrastructure hosts list --filter="env:prod" --count=50`, `infrastructure hosts get <host>` |
|
|
25
|
+
| Dashboards | `pup --read-only --agent dashboards list`, `dashboards get <id>`, `dashboards url <id>` |
|
|
26
|
+
| Notebooks | `pup --read-only --agent notebooks list`, `notebooks get <id>` |
|
|
27
|
+
| RUM events and sessions | `pup --read-only --agent rum events --query='@type:error'`, `rum aggregate`, `rum sessions search` |
|
|
28
|
+
|
|
29
|
+
### Commands to avoid
|
|
30
|
+
|
|
31
|
+
Do not run write commands, even with `--read-only` present:
|
|
32
|
+
|
|
33
|
+
- `create`, `update`, `delete`, `import`, `submit`, `cancel`, `mute`, `resolve`, or any command that writes a JSON file to Datadog.
|
|
34
|
+
- API key, app key, user, org policy, security, SLO, dashboard, monitor, incident, notebook, RUM metric, retention filter, playlist, or workflow mutations.
|
|
37
35
|
|
|
38
36
|
If a user asks for a mutation, stop and explain that this skill is read-only.
|
|
39
37
|
|
|
40
38
|
## Operation patterns
|
|
41
39
|
|
|
42
|
-
| Intent | Minimum
|
|
43
|
-
| ------------------------------------------------ |
|
|
44
|
-
| "Why is service X failing right now?" | `
|
|
45
|
-
| "Show me errors for service X in the last hour." | `
|
|
46
|
-
| "What is the status of monitor X?" | `
|
|
47
|
-
| "Tell me about incident INC-123." | `
|
|
48
|
-
| "What depends on
|
|
49
|
-
| "How did this trace spend its time?" | `
|
|
50
|
-
| "What tag values are valid for this metric?" | `
|
|
51
|
-
| "Which hosts are unhealthy?" | `
|
|
52
|
-
| "Find slow page loads." | `
|
|
40
|
+
| Intent | Minimum command pattern |
|
|
41
|
+
| ------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------- |
|
|
42
|
+
| "Why is service X failing right now?" | `monitors search/list` + `logs aggregate` for top errors + optionally `traces search` for representative failing spans. |
|
|
43
|
+
| "Show me errors for service X in the last hour." | `logs aggregate` for counts/top-N first; only use `logs search` if the user asked for specific log lines. |
|
|
44
|
+
| "What is the status of monitor X?" | `monitors search --query=...` or `monitors get <id>`, then cite state and last transition if present. |
|
|
45
|
+
| "Tell me about incident INC-123." | `incidents get <id>` directly. Only fall back to `incidents list --query=...` if no ID is known. |
|
|
46
|
+
| "What depends on checkout?" | `apm dependencies list --env <env>` or `apm flow-map --query="service:checkout" --env <env>`. |
|
|
47
|
+
| "How did this trace spend its time?" | `traces search --query="trace_id:<id>"`; cite slowest/error spans. Pup exposes span search, not a guaranteed full tree. |
|
|
48
|
+
| "What tag values are valid for this metric?" | `metrics metadata get <metric>` and `metrics tags list <metric> --from=... --to=...` before `metrics query`. |
|
|
49
|
+
| "Which hosts are unhealthy?" | `infrastructure hosts list --filter=...` with env/service/role filters. |
|
|
50
|
+
| "Find slow page loads." | `rum aggregate` or `rum events` with RUM facets and a bounded time window. |
|
|
53
51
|
|
|
54
52
|
## Content expectations
|
|
55
53
|
|
|
56
|
-
- Translate Slack-thread wording into stable observability language
|
|
57
|
-
- Preserve material URLs present in the conversation
|
|
58
|
-
- Include Datadog deep links
|
|
59
|
-
- Label assumptions clearly when the thread leaves important details uncertain
|
|
54
|
+
- Translate Slack-thread wording into stable observability language: env, service, status, span, monitor, incident, host.
|
|
55
|
+
- Preserve material URLs present in the conversation when they add evidence.
|
|
56
|
+
- Include Datadog deep links when Pup returns them or when a stable ID-specific link is obvious.
|
|
57
|
+
- Label assumptions clearly when the thread leaves important details uncertain: chosen env, chosen time window, chosen service.
|
|
@@ -5,77 +5,82 @@ Use these patterns to shape concrete Datadog requests.
|
|
|
5
5
|
## 1. Triage "service X is failing right now"
|
|
6
6
|
|
|
7
7
|
- Default the time window to the last 15 minutes unless the user gave a different one.
|
|
8
|
-
- Constrain by `service:` and `env
|
|
9
|
-
- `
|
|
10
|
-
- Then `
|
|
11
|
-
- If the user asks "why",
|
|
12
|
-
- Report monitor state, top error, and one
|
|
8
|
+
- Constrain by `service:` and `env:`. Explicit user input wins; fall back to `datadog.service` / `datadog.env`.
|
|
9
|
+
- Run `pup --read-only --agent monitors search --query="service:<x>"` or `monitors list --tags=service:<x>,env:<env>` first; a firing monitor usually names the failure mode.
|
|
10
|
+
- Then run `pup --read-only --agent logs aggregate --query="service:<x> env:<env>" --from="15m" --to="now" --compute=count --group-by=status` or group by an error facet such as `@error.kind`.
|
|
11
|
+
- If the user asks "why", search representative failing spans with `pup --read-only --agent traces search --query="service:<x> env:<env> status:error" --from="15m" --limit=20`.
|
|
12
|
+
- Report monitor state, top error, and one representative trace/span link when available.
|
|
13
13
|
|
|
14
14
|
## 2. "Is this monitor alerting?"
|
|
15
15
|
|
|
16
|
-
-
|
|
17
|
-
-
|
|
18
|
-
-
|
|
16
|
+
- If the user gave a monitor ID, run `pup --read-only --agent monitors get <id>`.
|
|
17
|
+
- Otherwise run `pup --read-only --agent monitors search --query="<name or tag>"` or `monitors list --name="<name>" --tags=...`.
|
|
18
|
+
- Report state (`OK`, `Warn`, `Alert`, `No Data`), last transition if present, and the monitor link.
|
|
19
|
+
- If the monitor is in `No Data`, note that explicitly; it is not the same as healthy.
|
|
19
20
|
|
|
20
21
|
## 3. "Tell me about incident INC-123" or "What is the status of the Redis incident?"
|
|
21
22
|
|
|
22
|
-
- If the user named the incident ID,
|
|
23
|
-
- If only a topic was named,
|
|
24
|
-
- Report severity, state, owner, and link to the incident.
|
|
25
|
-
-
|
|
23
|
+
- If the user named the incident ID, run `pup --read-only --agent incidents get <id>`.
|
|
24
|
+
- If only a topic was named, run `pup --read-only --agent incidents list --query="state:active <topic>" --limit=20` and scan for a match in the thread's time window.
|
|
25
|
+
- Report severity, state, owner/team if present, and link to the incident.
|
|
26
|
+
- Do not fabricate timeline entries if Pup does not return them.
|
|
26
27
|
|
|
27
28
|
## 4. Log search with a specific query
|
|
28
29
|
|
|
29
|
-
-
|
|
30
|
-
- Constrain with `service:`, `env:`, `status:`, `host:`, or `@<faceted_field>:` as appropriate
|
|
30
|
+
- Use `pup --read-only --agent logs search` only when the user explicitly wants raw log lines.
|
|
31
|
+
- Constrain with `service:`, `env:`, `status:`, `host:`, or `@<faceted_field>:` as appropriate.
|
|
31
32
|
- Cap page size and time window to avoid huge responses.
|
|
32
|
-
- Report a short summary plus a Datadog logs deep link. Quote only the minimum log content.
|
|
33
|
+
- Report a short summary plus a Datadog logs deep link when available. Quote only the minimum log content.
|
|
33
34
|
|
|
34
35
|
## 5. "What are the top errors for service X right now?"
|
|
35
36
|
|
|
36
|
-
- Prefer `
|
|
37
|
+
- Prefer `pup --read-only --agent logs aggregate --query="service:<x> env:<env> status:error" --compute=count --group-by=@error.kind --limit=10`.
|
|
38
|
+
- Use `--group-by=@http.status_code`, `status`, `service`, `host`, or another facet when it better matches the question.
|
|
37
39
|
- Report the top 3-5 buckets with counts, not an exhaustive table.
|
|
38
|
-
- Include the aggregated query link so the user can open the same view in Datadog.
|
|
39
40
|
|
|
40
41
|
## 6. Trace inspection by ID
|
|
41
42
|
|
|
42
|
-
- Use `
|
|
43
|
-
- Cite the top 3 slowest or error-tagged spans
|
|
44
|
-
- If the
|
|
43
|
+
- Pup exposes span search. Use `pup --read-only --agent traces search --query="trace_id:<id>" --from=<window> --to=<window> --limit=100`.
|
|
44
|
+
- Cite the top 3 slowest or error-tagged spans: service, resource/operation, duration, error state.
|
|
45
|
+
- If the returned spans look partial, say so. Do not claim a complete trace tree unless the output proves it.
|
|
45
46
|
|
|
46
47
|
## 7. Span search for a known error pattern
|
|
47
48
|
|
|
48
|
-
- Use `
|
|
49
|
-
-
|
|
49
|
+
- Use `pup --read-only --agent traces search --query='service:<x> env:<env> status:error resource_name:"..."' --from=... --to=...`.
|
|
50
|
+
- For counts or latency buckets, use `pup --read-only --agent traces aggregate --query="service:<x> env:<env>" --compute=count --group-by=resource_name`.
|
|
51
|
+
- Report counts plus the most illustrative span's trace link when available.
|
|
50
52
|
|
|
51
53
|
## 8. Service topology lookup
|
|
52
54
|
|
|
53
|
-
- Use `
|
|
54
|
-
-
|
|
55
|
+
- Use `pup --read-only --agent apm dependencies list --env <env> --from=... --to=...` to answer dependency questions.
|
|
56
|
+
- Use `pup --read-only --agent apm flow-map --query="service:<x>" --env <env> --from=... --to=...` when the question is centered on one service.
|
|
57
|
+
- Return the dependency list with service names and a Service Catalog/APM link when available.
|
|
55
58
|
|
|
56
59
|
## 9. Metric lookup
|
|
57
60
|
|
|
58
|
-
- Use `
|
|
59
|
-
- Once the metric name is known, use `
|
|
60
|
-
- Use `
|
|
61
|
-
- Report headline numbers
|
|
61
|
+
- Use `pup --read-only --agent metrics search --query="<pattern>"` or `metrics list --filter="<pattern>"` when the user is unsure of the metric name.
|
|
62
|
+
- Once the metric name is known, use `pup --read-only --agent metrics query --query="avg:<metric>{env:<env>,service:<service>}" --from=... --to=...`.
|
|
63
|
+
- Use `pup --read-only --agent metrics metadata get <metric>` and `metrics tags list <metric> --from=... --to=...` before querying if the user wants valid tags.
|
|
64
|
+
- Report headline numbers: current, peak, delta, or bucketed values as appropriate.
|
|
62
65
|
|
|
63
66
|
## 10. Host health
|
|
64
67
|
|
|
65
|
-
- Use `
|
|
66
|
-
-
|
|
68
|
+
- Use `pup --read-only --agent infrastructure hosts list --filter="env:<env> <role-or-service>" --count=50`.
|
|
69
|
+
- Use `pup --read-only --agent infrastructure hosts get <hostname>` for a specific host.
|
|
70
|
+
- Return counts, unhealthy host names/tags, and a host map link when available.
|
|
67
71
|
|
|
68
72
|
## 11. RUM / frontend slowness
|
|
69
73
|
|
|
70
|
-
- Use `
|
|
74
|
+
- Use `pup --read-only --agent rum aggregate` for top views/errors and `rum events` only when the user needs example events.
|
|
75
|
+
- Use `pup --read-only --agent rum sessions search` for session questions.
|
|
71
76
|
- Constrain to `@type:error`, slow page loads, or specific views; bound the time window.
|
|
72
|
-
- Do not use RUM for backend errors
|
|
77
|
+
- Do not use RUM for backend errors; those live in logs/APM.
|
|
73
78
|
|
|
74
79
|
## 12. Dashboards and notebooks
|
|
75
80
|
|
|
76
|
-
- `
|
|
77
|
-
- `
|
|
78
|
-
- This skill does not create or edit dashboards or notebooks.
|
|
81
|
+
- `pup --read-only --agent dashboards list` and `dashboards get <id>` are useful for "do we already have a dashboard for X?".
|
|
82
|
+
- `pup --read-only --agent notebooks list` and `notebooks get <id>` are for reading investigation notebooks.
|
|
83
|
+
- This skill does not create or edit dashboards or notebooks.
|
|
79
84
|
|
|
80
85
|
## 13. Storing channel defaults
|
|
81
86
|
|
|
@@ -1,22 +1,22 @@
|
|
|
1
1
|
# Query Syntax
|
|
2
2
|
|
|
3
|
-
Use this reference when forming Datadog log queries, span queries, and
|
|
3
|
+
Use this reference when forming Datadog log queries, span queries, RUM queries, and Pup aggregate commands.
|
|
4
4
|
|
|
5
5
|
## Log search query syntax
|
|
6
6
|
|
|
7
7
|
Datadog log search queries are tag-and-facet based. Core building blocks:
|
|
8
8
|
|
|
9
|
-
| Form | Meaning
|
|
10
|
-
| ------------------ |
|
|
11
|
-
| `service:<name>` | Reserved attribute
|
|
12
|
-
| `env:<name>` | Reserved attribute
|
|
13
|
-
| `host:<name>` | Reserved attribute
|
|
14
|
-
| `status:<level>` | Log level: `error`, `warn`, `info`, `debug`, etc.
|
|
15
|
-
| `source:<name>` | Log source integration
|
|
16
|
-
| `@<field>:<value>` | Faceted attribute
|
|
17
|
-
| `"some phrase"` | Free-text phrase search.
|
|
18
|
-
| `AND`, `OR`, `-` | Boolean ops; `-` negates. Default operator between terms is `AND`.
|
|
19
|
-
| `(a OR b) AND c` | Parenthesized boolean expression.
|
|
9
|
+
| Form | Meaning |
|
|
10
|
+
| ------------------ | ------------------------------------------------------------------- |
|
|
11
|
+
| `service:<name>` | Reserved attribute: service emitting the log. |
|
|
12
|
+
| `env:<name>` | Reserved attribute: deployment environment tag. |
|
|
13
|
+
| `host:<name>` | Reserved attribute: emitting host. |
|
|
14
|
+
| `status:<level>` | Log level: `error`, `warn`, `info`, `debug`, etc. |
|
|
15
|
+
| `source:<name>` | Log source integration, for example `nginx` or `python`. |
|
|
16
|
+
| `@<field>:<value>` | Faceted attribute: custom JSON field, e.g. `@http.status_code:500`. |
|
|
17
|
+
| `"some phrase"` | Free-text phrase search. |
|
|
18
|
+
| `AND`, `OR`, `-` | Boolean ops; `-` negates. Default operator between terms is `AND`. |
|
|
19
|
+
| `(a OR b) AND c` | Parenthesized boolean expression. |
|
|
20
20
|
|
|
21
21
|
Common examples:
|
|
22
22
|
|
|
@@ -31,47 +31,65 @@ Tips:
|
|
|
31
31
|
- `status` and `@http.status_code` are different. `status` is the log level; `@http.status_code` is the HTTP response code.
|
|
32
32
|
- Reserved attributes (`service`, `env`, `host`, `status`, `source`) do not take the `@` prefix. Custom fields do.
|
|
33
33
|
|
|
34
|
+
## Pup log commands
|
|
35
|
+
|
|
36
|
+
- Raw logs: `pup --read-only --agent logs search --query="service:checkout env:prod status:error" --from="15m" --to="now" --limit=20`
|
|
37
|
+
- Alternate v2 listing: `pup --read-only --agent logs list --query="service:checkout env:prod" --from="1h" --limit=20`
|
|
38
|
+
- Aggregation: `pup --read-only --agent logs aggregate --query="service:checkout env:prod status:error" --compute=count --group-by=@error.kind --limit=10`
|
|
39
|
+
|
|
40
|
+
`logs aggregate` options to prefer for analytics:
|
|
41
|
+
|
|
42
|
+
- `--compute=count` for volume.
|
|
43
|
+
- `--compute="avg(@duration)"`, `sum(...)`, `min(...)`, `max(...)`, or `percentile(@duration, 95)` for numeric fields.
|
|
44
|
+
- `--group-by=status`, `service`, `host`, `@http.status_code`, `@error.kind`, or another facet.
|
|
45
|
+
- `--limit=10` unless the user needs more.
|
|
46
|
+
|
|
34
47
|
## Span / APM search
|
|
35
48
|
|
|
36
49
|
APM span search shares the same query language, plus a few APM-specific attributes:
|
|
37
50
|
|
|
38
|
-
| Attribute | Meaning
|
|
39
|
-
| ------------------ |
|
|
40
|
-
| `service:<name>` | Service emitting the span.
|
|
41
|
-
| `env:<name>` | Deployment environment tag.
|
|
42
|
-
| `operation_name:X` | Span operation name
|
|
43
|
-
| `resource_name:X` | Endpoint or handler.
|
|
44
|
-
| `status:error` | Span is marked as an error.
|
|
45
|
-
|
|
|
51
|
+
| Attribute | Meaning |
|
|
52
|
+
| ------------------ | ----------------------------------------- |
|
|
53
|
+
| `service:<name>` | Service emitting the span. |
|
|
54
|
+
| `env:<name>` | Deployment environment tag. |
|
|
55
|
+
| `operation_name:X` | Span operation name, e.g. `http.request`. |
|
|
56
|
+
| `resource_name:X` | Endpoint or handler. |
|
|
57
|
+
| `status:error` | Span is marked as an error. |
|
|
58
|
+
| `@duration:>...` | Duration filter in nanoseconds. |
|
|
59
|
+
|
|
60
|
+
Commands:
|
|
61
|
+
|
|
62
|
+
- `pup --read-only --agent traces search --query="service:checkout env:prod status:error" --from="15m" --limit=20`
|
|
63
|
+
- `pup --read-only --agent traces aggregate --query="service:checkout env:prod" --compute="percentile(@duration, 95)" --group-by=resource_name`
|
|
64
|
+
- For a trace ID, use `traces search --query="trace_id:<id>"` with a window that brackets the trace. Pup returns matching spans; do not assume it returned a complete tree unless the output proves it.
|
|
65
|
+
|
|
66
|
+
## RUM queries
|
|
46
67
|
|
|
47
|
-
|
|
68
|
+
Use RUM only for browser/user-experience questions:
|
|
48
69
|
|
|
49
|
-
`
|
|
70
|
+
- `pup --read-only --agent rum events --query='@type:error @application.name:"Web"' --from="1h" --limit=20`
|
|
71
|
+
- `pup --read-only --agent rum aggregate --query='@type:view' --compute="percentile(@view.loading_time, 95)" --group-by=@view.name`
|
|
72
|
+
- `pup --read-only --agent rum sessions search --query='@session.type:user' --from="1h" --limit=20`
|
|
50
73
|
|
|
51
|
-
|
|
74
|
+
## Metric queries
|
|
52
75
|
|
|
53
|
-
|
|
54
|
-
- Use `COUNT(*)` for volume, `COUNT(DISTINCT <field>)` for unique cardinality.
|
|
55
|
-
- `GROUP BY` faceted fields (without `@` in the SQL form — the tool's schema specifies how to reference them; follow the tool's input schema exactly).
|
|
56
|
-
- Cap with `ORDER BY ... DESC LIMIT N` — top 5-10 is usually enough.
|
|
76
|
+
Datadog metric query strings follow the usual metric explorer shape:
|
|
57
77
|
|
|
58
|
-
|
|
78
|
+
- `avg:system.cpu.user{env:prod,service:checkout}`
|
|
79
|
+
- `sum:trace.http.request.errors{env:prod,service:checkout}.as_count()`
|
|
80
|
+
- `p95:trace.http.request.duration{env:prod,service:checkout}`
|
|
59
81
|
|
|
60
|
-
|
|
61
|
-
- HTTP 5xx count by status code in the last 15 minutes, grouped by `@http.status_code`.
|
|
62
|
-
- Log volume by `host` over the last hour to spot a noisy emitter.
|
|
82
|
+
Use `metrics search` or `metrics list` to find names, `metrics metadata get` for metadata, and `metrics tags list` for tag dimensions before querying when needed.
|
|
63
83
|
|
|
64
84
|
## Time windows
|
|
65
85
|
|
|
66
86
|
- For "right now" questions, default to the last 15 minutes.
|
|
67
87
|
- For "what happened earlier today" questions, default to the last 24 hours.
|
|
68
88
|
- For incident-linked questions, prefer a window that brackets the incident `created` time.
|
|
69
|
-
- Always include a time window
|
|
89
|
+
- Always include a time window. Unbounded queries are slow and easy to misinterpret.
|
|
70
90
|
|
|
71
91
|
## What to cite back
|
|
72
92
|
|
|
73
|
-
- The exact query string used
|
|
74
|
-
- A Datadog deep link that encodes the same filter:
|
|
75
|
-
- `https://app.datadoghq.com/logs?query=<url-encoded-query>&from_ts=<ms>&to_ts=<ms>`
|
|
76
|
-
- `https://app.datadoghq.com/apm/traces?query=<url-encoded-query>`
|
|
93
|
+
- The exact query string used, for example `service:checkout env:prod status:error`.
|
|
77
94
|
- The time window you used.
|
|
95
|
+
- A Datadog deep link when Pup returns one or when a stable ID-specific app link is available.
|
|
@@ -1,17 +1,23 @@
|
|
|
1
1
|
# Troubleshooting and Workarounds
|
|
2
2
|
|
|
3
|
-
Use this reference when
|
|
3
|
+
Use this reference when Pup commands fail or return unexpected results.
|
|
4
4
|
|
|
5
5
|
## Permission and scope errors
|
|
6
6
|
|
|
7
|
-
- A
|
|
8
|
-
- Stop and tell the user the current Datadog
|
|
7
|
+
- A `403 Forbidden` or `permission denied` response means the configured Datadog API/application keys cannot read that resource: metrics, APM, incidents, RUM, and so on.
|
|
8
|
+
- Stop and tell the user the current Datadog integration could not access the requested data. Suggest the operator verify the Datadog application key scopes/role.
|
|
9
9
|
- Do not guess specific missing permission names unless Datadog explicitly named one in the error.
|
|
10
10
|
- Do not loop retrying a 403.
|
|
11
11
|
|
|
12
|
+
## Authentication errors
|
|
13
|
+
|
|
14
|
+
- A `401 Unauthorized`, `missing API key`, or `missing application key` error usually means `DATADOG_API_KEY` or `DATADOG_APP_KEY` is missing from the Junior deployment env, or the key was revoked.
|
|
15
|
+
- Pup receives placeholder env values in the sandbox so it will make HTTP requests; the host injects the real `DD-API-KEY` and `DD-APPLICATION-KEY` headers for Datadog API domains.
|
|
16
|
+
- Do not ask the user to paste keys into Slack or the sandbox. Tell the operator to fix the deployment env and retry.
|
|
17
|
+
|
|
12
18
|
## Rate limits
|
|
13
19
|
|
|
14
|
-
- Datadog
|
|
20
|
+
- Datadog API endpoints can return `429 Too Many Requests`.
|
|
15
21
|
- Retry the same query once after a short wait.
|
|
16
22
|
- If it fails again, report the throttle and stop. Do not fall back to larger scans that will throttle harder.
|
|
17
23
|
|
|
@@ -19,21 +25,24 @@ Use this reference when Datadog MCP calls fail or return unexpected results.
|
|
|
19
25
|
|
|
20
26
|
- Double-check that `env:` and `service:` match real values. Datadog tag values are case-sensitive.
|
|
21
27
|
- Widen the time window before widening the filter. Many "no results" cases are just too narrow a window.
|
|
22
|
-
- If searching logs with `@<field>:value`, confirm the field exists as a facet
|
|
23
|
-
- If an expected monitor or incident is missing, the
|
|
28
|
+
- If searching logs or RUM with `@<field>:value`, confirm the field exists as a facet.
|
|
29
|
+
- If an expected monitor or incident is missing, the application key may not have access to that team/resource.
|
|
24
30
|
|
|
25
31
|
## Too many results / large payloads
|
|
26
32
|
|
|
27
|
-
- Prefer `
|
|
28
|
-
- For
|
|
33
|
+
- Prefer `pup --read-only --agent logs aggregate` or `traces aggregate` with `--group-by` + `--limit` over paging raw events.
|
|
34
|
+
- For span/trace responses that look partial, say so in the reply. Do not pretend the shown spans are complete.
|
|
29
35
|
- Quote only the minimum log / span / metric content needed as evidence. Link to Datadog for the rest.
|
|
30
36
|
|
|
31
37
|
## Multiple Datadog sites
|
|
32
38
|
|
|
33
|
-
- The packaged plugin defaults to
|
|
34
|
-
-
|
|
39
|
+
- The packaged plugin defaults to US1 (`datadoghq.com`) and sets Pup's `DD_SITE` from the manifest `DATADOG_SITE` env var.
|
|
40
|
+
- Non-US1 operators set `DATADOG_SITE` in their Junior deployment env to their site host, for example `us5.datadoghq.com`, `datadoghq.eu`, or `ddog-gov.com`.
|
|
41
|
+
- Setting deployment `DD_SITE` alone has no effect; the plugin owns Pup's sandbox `DD_SITE` through `DATADOG_SITE`.
|
|
42
|
+
- The packaged plugin allows the standard Datadog API hosts for US1, US3, US5, EU, AP1, AP2, and GovCloud. A custom or staging Datadog domain needs a manifest change so the API domain allowlist matches.
|
|
43
|
+
- If the user's Datadog account lives on a different site than the deployment is configured for, advise the operator to update `DATADOG_SITE`. Do not try to work around this silently inside a turn.
|
|
35
44
|
|
|
36
45
|
## Read-only scope
|
|
37
46
|
|
|
38
|
-
- This skill intentionally
|
|
39
|
-
- If the user asks to create a notebook, edit a monitor, mute an alert, or resolve an incident, stop and tell them those actions are not in scope.
|
|
47
|
+
- This skill intentionally uses only read-oriented Pup commands.
|
|
48
|
+
- If the user asks to create a notebook, edit a monitor, mute an alert, submit a metric, or resolve an incident, stop and tell them those actions are not in scope.
|