@kinetica/admin-agent 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -23,6 +23,18 @@ Severity order for filtering is `WARN < UERR < ERROR < FATAL`, so `min_severity=
23
23
  - Timestamps are plain local strings without a timezone; compare them lexically and treat cross-rank timing cautiously.
24
24
  - **Ranks vs. the host manager:** `rank` selects a numeric rank (`r0`, `r1`, …) only. The host manager (`core-gpudb-rolling-hm.log`) is a singleton service, NOT a rank — search or timeline it with `host_manager: true`, never `rank: "hm"`. By default both `log_timeline` and `search_logs` already cover the host manager along with the numeric ranks; `kinetica_bundle_list_files` lists it under `services_present`.
25
25
 
26
+ ### Finding a crash's triggering SQL
27
+
28
+ When a worker rank segfaults mid-query, that rank's log holds the **backtrace** but NOT the **SQL** — the query text and predicates are logged on **rank 0** (the coordinator), never on workers. Do not conclude the SQL is "only in `ki_query_history`" (a live table, unavailable offline) just because it is absent from the crashing rank.
29
+
30
+ Workflow, given a `JobId` from a worker's crash stack:
31
+
32
+ 1. `kinetica_bundle_search_logs` with `rank: "r0"` and `regex` = the JobId. r0 logs the `/execute/sql` receipt (submitting user), the `Sql/SqlDriver.cpp … Executing SQL:` line, and per-operation endpoint lines.
33
+ 2. The per-operation lines (`Endpoint_aggregate_group_by.cpp`, filter/join endpoints) carry `table:`, `column_names:`/`aliases:` (the SELECT list), and `expr:` (the full WHERE predicate) — reconstruct the query from these.
34
+ 3. **Quirk:** if `Found plan for the SQL in cache` precedes it, the `Executing SQL:` line is truncated to just `SELECT`. Use the per-operation endpoint lines (step 2) — their predicate survives a cache hit. A `datetime()`/timestamp filter showing up here often _is_ the input that triggered a parser segfault.
35
+
36
+ See `rank-architecture.md` (Where queries are logged) for why this locality holds.
37
+
26
38
  ### Files of interest
27
39
 
28
40
  `kinetica_bundle_list_files` annotates every file with a `description` of what it contains, so consult that first. The canonical OS-diagnostic / host files (each is `EXEC_CMD`-wrapped — read with `kinetica_bundle_read_sysinfo`):
@@ -37,4 +49,13 @@ Severity order for filtering is `WARN < UERR < ERROR < FATAL`, so `min_severity=
37
49
  - **Packages / accounts:** `deb.txt` / `rpm.txt` (installed packages), `user.txt` (users/groups, gpudb account), `ld.so.conf.txt`, `etc_*.txt` (system shell/host config).
38
50
  - **Evidence Gaps:** `errors.txt` / `proc-logs-erros.txt` — collection commands that FAILED. `logfiles.txt` — manifest of log dirs the collector enumerated.
39
51
 
40
- Rolling core logs under `logs-local/` are the primary source. The small last-2h Loki tails under `logs/` are searched only when no rolling core logs were collected. Each `*.txt` artifact records the exact shell command that produced it in its `EXEC_CMD:` header, so `kinetica_bundle_read_sysinfo` always shows you precisely what ran.
52
+ ### Two log families and why every rank is reachable
53
+
54
+ A bundle carries per-rank logs in up to two places, and the collector host usually holds only a couple of the cluster's ranks:
55
+
56
+ - **`logs-local/` — rolling core logs** (`core-gpudb-rolling-r{N}.log`, plus rotations `…​.log.1`): the full local history, but ONLY for the ranks running on the host the collector ran on (often just r0/r1). These are the primary, richest source for those ranks.
57
+ - **`logs/` — Loki/promtail exports** (`rank{N}.log` for every rank cluster-wide, `hostmanager.log`, and per-component tails like `sql.log`, `graph.log`): pulled from centralized logging, so these are the ONLY evidence for ranks on hosts the collector didn't run on (e.g. r2…r8). They are JSON-wrapped on disk, but the tools unwrap them transparently — you still filter by `min_severity`, `from_ts`/`to_ts`, and get clean messages.
58
+
59
+ You do not choose between families. `rank: "r{N}"` resolves to that rank's rolling log if present, else its Loki tail — so **all** ranks reported under `ranks_present` are searchable the same way, and a default (no-rank) search/timeline spans the whole cluster. `host_manager: true` likewise prefers `core-gpudb-rolling-hm.log`, falling back to `logs/hostmanager.log`. Trust `ranks_present` from `kinetica_bundle_list_files` for the true rank count; don't infer it from `logs-local/` alone.
60
+
61
+ Each `*.txt` artifact records the exact shell command that produced it in its `EXEC_CMD:` header, so `kinetica_bundle_read_sysinfo` always shows you precisely what ran.
@@ -31,6 +31,18 @@ resource-group reports.
31
31
  - All 16,384 shards map to worker ranks (rank 0 holds no shards).
32
32
  - RAM limits typically 5+ GB per rank.
33
33
 
34
+ ## Where queries are logged — rank 0 only (crash forensics)
35
+
36
+ The **SQL text and query predicates are logged on rank 0**, the coordinator that receives and plans every query. Worker ranks log only their slice of execution plus any fault. So when a worker rank crashes mid-query, its log holds the **stack trace** but NOT the **triggering SQL** — those are two different ranks.
37
+
38
+ To recover the triggering SQL for a crashing `JobId`, search **rank 0's** log (the rolling `core-gpudb-rolling-r0.log` or the Loki `rank0.log`) for that JobId. Rank 0 logs, in order:
39
+
40
+ - `Endpoint.cpp` — `Request URI: /execute/sql … user: …` (who submitted it)
41
+ - `Sql/SqlDriver.cpp … Executing SQL: <text>` — the SQL text
42
+ - per-operation endpoint lines (`Endpoint_aggregate_group_by.cpp`, filter/join endpoints) — `table:`, `column_names:`/`aliases:` (the SELECT list), and `expr:` (the full WHERE predicate)
43
+
44
+ **Quirk:** when the line `SqlDriver.cpp … Found plan for the SQL in cache` precedes it, the `Executing SQL:` line is **truncated to just `SELECT`** (a cached plan skips re-logging the text). Reconstruct the query from the per-operation endpoint lines instead — their `table` + `column_names` + `expr` survive regardless of plan cache state. This is the reliable path to a crash's exact query (including timestamp/`datetime()` filters that often trigger parser faults).
45
+
34
46
  ## Interpreting Metrics — Key Rule
35
47
 
36
48
  **Rank 0's low resource usage is normal — it is NOT a sign of
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@kinetica/admin-agent",
3
- "version": "0.2.0",
3
+ "version": "0.2.1",
4
4
  "description": "Autonomous diagnostic agent for Kinetica databases",
5
5
  "license": "Apache-2.0",
6
6
  "author": "Kinetica",