PyPI - holmesgpt - Versions diffs - 0.13.3a0__py3-none-any.whl → 0.14.1__py3-none-any.whl - Mend

holmesgpt 0.13.3a0py3-none-any.whl → 0.14.1py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of holmesgpt might be problematic. Click here for more details.

Files changed (86) hide show

holmes/__init__.py +1 -1
holmes/clients/robusta_client.py +15 -4
holmes/common/env_vars.py +8 -1
holmes/config.py +66 -139
holmes/core/investigation.py +1 -2
holmes/core/llm.py +295 -52
holmes/core/models.py +2 -0
holmes/core/safeguards.py +4 -4
holmes/core/supabase_dal.py +14 -8
holmes/core/tool_calling_llm.py +202 -177
holmes/core/tools.py +260 -25
holmes/core/tools_utils/data_types.py +81 -0
holmes/core/tools_utils/tool_context_window_limiter.py +33 -0
holmes/core/tools_utils/tool_executor.py +2 -2
holmes/core/toolset_manager.py +150 -3
holmes/core/tracing.py +6 -1
holmes/core/transformers/__init__.py +23 -0
holmes/core/transformers/base.py +62 -0
holmes/core/transformers/llm_summarize.py +174 -0
holmes/core/transformers/registry.py +122 -0
holmes/core/transformers/transformer.py +31 -0
holmes/main.py +5 -0
holmes/plugins/prompts/_fetch_logs.jinja2 +10 -1
holmes/plugins/toolsets/aks-node-health.yaml +46 -0
holmes/plugins/toolsets/aks.yaml +64 -0
holmes/plugins/toolsets/atlas_mongodb/mongodb_atlas.py +17 -15
holmes/plugins/toolsets/azure_sql/tools/analyze_connection_failures.py +8 -4
holmes/plugins/toolsets/azure_sql/tools/analyze_database_connections.py +7 -3
holmes/plugins/toolsets/azure_sql/tools/analyze_database_health_status.py +3 -3
holmes/plugins/toolsets/azure_sql/tools/analyze_database_performance.py +3 -3
holmes/plugins/toolsets/azure_sql/tools/analyze_database_storage.py +7 -3
holmes/plugins/toolsets/azure_sql/tools/get_active_alerts.py +4 -4
holmes/plugins/toolsets/azure_sql/tools/get_slow_queries.py +7 -3
holmes/plugins/toolsets/azure_sql/tools/get_top_cpu_queries.py +7 -3
holmes/plugins/toolsets/azure_sql/tools/get_top_data_io_queries.py +7 -3
holmes/plugins/toolsets/azure_sql/tools/get_top_log_io_queries.py +7 -3
holmes/plugins/toolsets/bash/bash_toolset.py +6 -6
holmes/plugins/toolsets/bash/common/bash.py +7 -7
holmes/plugins/toolsets/coralogix/toolset_coralogix_logs.py +5 -3
holmes/plugins/toolsets/datadog/datadog_api.py +490 -24
holmes/plugins/toolsets/datadog/datadog_logs_instructions.jinja2 +21 -10
holmes/plugins/toolsets/datadog/toolset_datadog_general.py +345 -207
holmes/plugins/toolsets/datadog/toolset_datadog_logs.py +190 -19
holmes/plugins/toolsets/datadog/toolset_datadog_metrics.py +96 -32
holmes/plugins/toolsets/datadog/toolset_datadog_rds.py +10 -10
holmes/plugins/toolsets/datadog/toolset_datadog_traces.py +21 -22
holmes/plugins/toolsets/git.py +22 -22
holmes/plugins/toolsets/grafana/common.py +14 -2
holmes/plugins/toolsets/grafana/grafana_tempo_api.py +473 -0
holmes/plugins/toolsets/grafana/toolset_grafana.py +4 -4
holmes/plugins/toolsets/grafana/toolset_grafana_loki.py +5 -4
holmes/plugins/toolsets/grafana/toolset_grafana_tempo.jinja2 +246 -11
holmes/plugins/toolsets/grafana/toolset_grafana_tempo.py +662 -290
holmes/plugins/toolsets/grafana/trace_parser.py +1 -1
holmes/plugins/toolsets/internet/internet.py +3 -3
holmes/plugins/toolsets/internet/notion.py +3 -3
holmes/plugins/toolsets/investigator/core_investigation.py +3 -3
holmes/plugins/toolsets/kafka.py +18 -18
holmes/plugins/toolsets/kubernetes.yaml +58 -0
holmes/plugins/toolsets/kubernetes_logs.py +6 -6
holmes/plugins/toolsets/kubernetes_logs.yaml +32 -0
holmes/plugins/toolsets/logging_utils/logging_api.py +1 -1
holmes/plugins/toolsets/mcp/toolset_mcp.py +4 -4
holmes/plugins/toolsets/newrelic.py +8 -8
holmes/plugins/toolsets/opensearch/opensearch.py +5 -5
holmes/plugins/toolsets/opensearch/opensearch_logs.py +7 -7
holmes/plugins/toolsets/opensearch/opensearch_traces.py +10 -10
holmes/plugins/toolsets/prometheus/prometheus.py +841 -351
holmes/plugins/toolsets/prometheus/prometheus_instructions.jinja2 +39 -2
holmes/plugins/toolsets/prometheus/utils.py +28 -0
holmes/plugins/toolsets/rabbitmq/toolset_rabbitmq.py +6 -4
holmes/plugins/toolsets/robusta/robusta.py +10 -10
holmes/plugins/toolsets/runbook/runbook_fetcher.py +4 -4
holmes/plugins/toolsets/servicenow/servicenow.py +6 -6
holmes/plugins/toolsets/utils.py +88 -0
holmes/utils/config_utils.py +91 -0
holmes/utils/env.py +7 -0
holmes/utils/holmes_status.py +2 -1
holmes/utils/sentry_helper.py +41 -0
holmes/utils/stream.py +9 -0
{holmesgpt-0.13.3a0.dist-info → holmesgpt-0.14.1.dist-info}/METADATA +11 -15
{holmesgpt-0.13.3a0.dist-info → holmesgpt-0.14.1.dist-info}/RECORD +85 -75
holmes/plugins/toolsets/grafana/tempo_api.py +0 -124
{holmesgpt-0.13.3a0.dist-info → holmesgpt-0.14.1.dist-info}/LICENSE.txt +0 -0
{holmesgpt-0.13.3a0.dist-info → holmesgpt-0.14.1.dist-info}/WHEEL +0 -0
{holmesgpt-0.13.3a0.dist-info → holmesgpt-0.14.1.dist-info}/entry_points.txt +0 -0

holmes/plugins/toolsets/grafana/toolset_grafana_tempo.jinja2 CHANGED Viewed

@@ -1,12 +1,247 @@
-Use Tempo when investigating latency or performance issues. Tempo provides traces information for application running on the cluster.
+Grafana Tempo provides distributed tracing data through its REST API. Each tool maps directly to a specific Tempo API endpoint.
 Assume every application provides tempo traces.
-1. Start by identifying an initial filter to use. This can be a pod name, a deployment name or a service name
-2. Call fetch_tempo_traces_comparative_sample first when investigating performance issues via traces. This tool provides comprehensive analysis for identifying patterns. For other issues not related to performance, you can start with fetch_tempo_traces.
-3. Use `fetch_tempo_traces` setting the appropriate query params
-    - Use the min_duration filter to ensure you get traces that trigger the alert when you are investigating a performance issue
-    - If possible, use start and end date to narrow down your search.
-        - Use fetch_finding_by_id if you are provided with a finding/alert id. It will contain details about when the alert was triggered
-    - Use at least one of the following argument to ensure you get relevant traces: `service_name`, `pod_name` or `deployment_name`.
-4. When you have a specific trace ID to investigate, use `fetch_tempo_trace_by_id` to get detailed information about that trace.
-5. Look at the duration of each span in any single trace and deduce any issues.
-6. ALWAYS fetch the logs for a pod once you identify a span that is taking a long time. There may be an explanation for the slowness in the logs.
+## API Endpoints and Tool Mapping
+1. **Trace Search** (GET /api/search)
+   - `tempo_search_traces_by_query`: Use with 'q' parameter for TraceQL queries
+   - `tempo_search_traces_by_tags`: Use with 'tags' parameter for logfmt queries
+2. **Trace Details** (GET /api/v2/traces/{trace_id})
+   - `tempo_query_trace_by_id`: Retrieve full trace data
+3. **Tag Discovery**
+   - `tempo_search_tag_names` (GET /api/v2/search/tags): List available tags
+   - `tempo_search_tag_values` (GET /api/v2/search/tag/{tag}/values): Get values for a tag
+4. **TraceQL Metrics**
+   - `tempo_query_metrics_instant` (GET /api/metrics/query): Single value computation
+   - `tempo_query_metrics_range` (GET /api/metrics/query_range): Time series data
+## Usage Workflow
+### 1. Discovering Available Data
+Start by understanding what tags and values exist:
+- Use `tempo_search_tag_names` to discover available tags
+- Use `tempo_search_tag_values` to see all values for a specific tag (e.g., service names)
+### 2. Searching for Traces
+**TraceQL Search (recommended):**
+Use `tempo_search_traces_by_query` with TraceQL syntax for powerful filtering.
+**TraceQL Capabilities:**
+TraceQL can select traces based on the following:
+- **Span and resource attributes** - Filter by any attribute on spans or resources
+- **Timing and duration** - Filter by trace/span duration
+- **Basic aggregates** - Use aggregate functions to compute values across spans
+**Supported Aggregate Functions:**
+- `count()` - Count the number of spans matching the criteria
+- `avg(attribute)` - Calculate average of a numeric attribute across spans
+- `min(attribute)` - Find minimum value of a numeric attribute
+- `max(attribute)` - Find maximum value of a numeric attribute
+- `sum(attribute)` - Sum values of a numeric attribute across spans
+**Aggregate Function Usage:**
+Aggregates are used with the pipe operator `|` to filter traces based on computed values across their spans.
+**Aggregate Examples:**
+- `{ span.http.status_code = 200 } | count() > 3` - Find traces with more than 3 spans having HTTP 200 status
+- `{ } | sum(span.bytesProcessed) > 1000000000` - Find traces where total processed bytes exceed 1 GB
+- `{ status = error } | by(resource.service.name) | count() > 1` - Find services with more than 1 error
+**Select Function:**
+- `{ status = error } | select(span.http.status_code, span.http.url)` - Select specific attributes from error spans
+**TraceQL Query Structure:**
+TraceQL queries follow the pattern: `{span-selectors} | aggregate`
+**TraceQL Query Examples (from official docs):**
+1. **Find traces of a specific operation:**
+   ```
+   {resource.service.name = "frontend" && name = "POST /api/orders"}
+   ```
+   ```
+   {
+     resource.service.namespace = "ecommerce" &&
+     resource.service.name = "frontend" &&
+     resource.deployment.environment = "production" &&
+     name = "POST /api/orders"
+   }
+   ```
+2. **Find traces with a particular outcome:**
+   ```
+   {
+     resource.service.name="frontend" &&
+     name = "POST /api/orders" &&
+     status = error
+   }
+   ```
+   ```
+   {
+     resource.service.name="frontend" &&
+     name = "POST /api/orders" &&
+     span.http.status_code >= 500
+   }
+   ```
+3. **Find traces with a particular behavior:**
+   ```
+   {span.service.name="frontend" && name = "GET /api/products/{id}"} && {span.db.system="postgresql"}
+   ```
+4. **Find traces across environments:**
+   ```
+   { resource.deployment.environment = "production" } && { resource.deployment.environment = "staging" }
+   ```
+5. **Structural operators (advanced):**
+   ```
+   { resource.service.name="frontend" } >> { status = error }  # Frontend spans followed by errors
+   { } !< { resource.service.name = "productcatalogservice" }  # Traces without productcatalog as child
+   { resource.service.name = "productcatalogservice" } ~ { resource.service.name="frontend" }  # Sibling spans
+   ```
+6. **Additional operator examples:**
+   ```
+   { span.http.method = "GET" && status = ok } && { span.http.method = "DELETE" && status != ok }  # && for multiple conditions
+   ```
+   ```
+   { resource.deployment.environment =~ "prod-.*" && span.http.status_code = 200 }  # =~ regex match
+   { span.http.method =~ "DELETE|GET" }  # Regex match multiple values
+   { trace:rootName !~ ".*perf.*" }  # !~ negated regex
+   { resource.cloud.region = "us-east-1" } || { resource.cloud.region = "us-west-1" }  # || OR operator
+   ```
+   ```
+   { span.http.status_code >= 400 && span.http.status_code < 500 }  # Client errors (4xx)
+   { span.http.url = "/path/of/api" } >> { span.db.name = "db-shard-001" }  # >> descendant
+   { span.http.status_code = 200 } | select(resource.service.name)  # Select specific attributes
+   ```
+**Common Attributes to Query:**
+- `resource.service.name` - Service name
+- `resource.k8s.*` - Kubernetes metadata (pod.name, namespace.name, deployment.name, etc.)
+- `span.http.*` - HTTP attributes (status_code, method, route, url, etc.)
+- `name` - Span name
+- `status` - Span status (error, ok)
+- `duration` - Span duration
+- `kind` - Span kind (server, client, producer, consumer, internal)
+**Tag-based Search (legacy):**
+Use `tempo_search_traces_by_tags` with logfmt format when you need min/max duration filters:
+- Example: `service.name="api" http.status_code="500"`
+- Supports `min_duration` and `max_duration` parameters
+### 3. Analyzing Specific Traces
+When you have trace IDs from search results:
+- Use `tempo_query_trace_by_id` to get full trace details
+- Examine spans for errors, slow operations, and bottlenecks
+### 4. Computing Metrics from Traces
+**TraceQL metrics** compute aggregated metrics from your trace data, helping you answer critical questions like:
+- How many database calls across all systems are downstream of your application?
+- What services beneath a given endpoint are failing?
+- What services beneath an endpoint are slow?
+TraceQL metrics parse your traces in aggregate to provide RED (Rate, Error, Duration) metrics from trace data.
+**Supported Functions:**
+- `rate` - Calculate rate of spans/traces
+- `count_over_time` - Count spans/traces over time
+- `sum_over_time` - Sum span attributes
+- `avg_over_time` - Average of span attributes
+- `max_over_time` - Maximum value over time
+- `min_over_time` - Minimum value over time
+- `quantile_over_time` - Calculate quantiles
+- `histogram_over_time` - Generate histogram data
+- `compare` - Compare metrics between time periods
+**Modifiers:**
+- `topk` - Return top N results
+- `bottomk` - Return bottom N results
+**TraceQL Metrics Query Examples:**
+1. **rate** - Calculate error rate by service and HTTP route:
+   ```
+   { resource.service.name = "foo" && status = error } | rate() by (span.http.route)
+   ```
+2. **count_over_time** - Count spans by HTTP status code:
+   ```
+   { name = "GET /:endpoint" } | count_over_time() by (span.http.status_code)
+   ```
+3. **sum_over_time** - Sum HTTP response sizes by service:
+   ```
+   { name = "GET /:endpoint" } | sum_over_time(span.http.response.size) by (resource.service.name)
+   ```
+4. **avg_over_time** - Average duration by HTTP status code:
+   ```
+   { name = "GET /:endpoint" } | avg_over_time(duration) by (span.http.status_code)
+   ```
+5. **max_over_time** - Maximum response size by HTTP target:
+   ```
+   { name = "GET /:endpoint" } | max_over_time(span.http.response.size) by (span.http.target)
+   ```
+6. **min_over_time** - Minimum duration by HTTP target:
+   ```
+   { name = "GET /:endpoint" } | min_over_time(duration) by (span.http.target)
+   ```
+7. **quantile_over_time** - Calculate multiple percentiles (99th, 90th, 50th) with exemplars:
+   ```
+   { span:name = "GET /:endpoint" } | quantile_over_time(duration, .99, .9, .5) by (span.http.target) with (exemplars=true)
+   ```
+8. **histogram_over_time** - Build duration histogram grouped by custom attribute:
+   ```
+   { name = "GET /:endpoint" } | histogram_over_time(duration) by (span.foo)
+   ```
+9. **compare** - Compare error spans against baseline (10 attributes):
+   ```
+   { resource.service.name="a" && span.http.path="/myapi" } | compare({status=error}, 10)
+   ```
+10. **Using topk modifier** - Find top 10 endpoints by request rate:
+   ```
+   { resource.service.name = "foo" } | rate() by (span.http.url) | topk(10)
+   ```
+**Choosing Between Instant and Range Queries:**
+**Instant Metrics** (`tempo_query_metrics_instant`) - Returns a single aggregated value for the entire time range. Use this when:
+- You need a total count or sum across the whole period
+- You want a single metric value (e.g., total error count, average latency)
+- You don't need to see how the metric changes over time
+- You're computing a KPI or summary statistic
+**Time Series Metrics** (`tempo_query_metrics_range`) - Returns values at regular intervals controlled by the 'step' parameter. Use this when:
+- You need to graph metrics over time or analyze trends
+- You want to see patterns, spikes, or changes in metrics
+- You're troubleshooting time-based issues
+- You need to correlate metrics with specific time periods
+## Special workflow for performance issues
+When investigating performance issues in kubernetes via traces, call tempo_fetch_traces_comparative_sample. This tool provides comprehensive analysis for identifying patterns.
+## Important Notes
+- TraceQL is the modern query language - prefer it over tag-based search
+- TraceQL metrics are computed from trace data, not traditional Prometheus metrics
+- TraceQL metrics is an experimental feature that computes RED (Rate, Error, Duration) metrics from trace data
+- Common attributes to use in queries: resource.service.name, span.http.route, span.http.status_code, span.http.target, status, name, duration
+- All timestamps can be Unix epoch seconds or RFC3339 format
+- Use time filters (start/end) to improve query performance
+- To get information about Kubernetes resources try these first: resource.service.name, resource.k8s.pod.name, resource.k8s.namespace.name, resource.k8s.deployment.name, resource.k8s.node.name, resource.k8s.container.name
+- TraceQL and TraceQL metrics language are complex. If you get empty data, try to simplify your query and try again!
+- IMPORTANT: TraceQL is not the same as 'TraceQL metrics' - Make sure you use the correct syntax and functions

holmesgpt 0.13.3a0__py3-none-any.whl → 0.14.1__py3-none-any.whl

Potentially problematic release.

holmesgpt 0.13.3a0py3-none-any.whl → 0.14.1py3-none-any.whl