holmesgpt 0.12.3a1__py3-none-any.whl → 0.12.4__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of holmesgpt might be problematic. Click here for more details.

Files changed (52) hide show
  1. holmes/__init__.py +1 -1
  2. holmes/config.py +75 -33
  3. holmes/core/config.py +5 -0
  4. holmes/core/conversations.py +17 -2
  5. holmes/core/investigation.py +1 -0
  6. holmes/core/llm.py +1 -2
  7. holmes/core/prompt.py +29 -4
  8. holmes/core/supabase_dal.py +49 -13
  9. holmes/core/tool_calling_llm.py +26 -1
  10. holmes/core/tools.py +2 -1
  11. holmes/core/tools_utils/tool_executor.py +1 -0
  12. holmes/core/toolset_manager.py +10 -3
  13. holmes/core/tracing.py +77 -10
  14. holmes/interactive.py +110 -20
  15. holmes/main.py +13 -18
  16. holmes/plugins/destinations/slack/plugin.py +19 -9
  17. holmes/plugins/prompts/_fetch_logs.jinja2 +11 -1
  18. holmes/plugins/prompts/_general_instructions.jinja2 +6 -37
  19. holmes/plugins/prompts/_permission_errors.jinja2 +6 -0
  20. holmes/plugins/prompts/_runbook_instructions.jinja2 +13 -5
  21. holmes/plugins/prompts/_toolsets_instructions.jinja2 +22 -14
  22. holmes/plugins/prompts/generic_ask.jinja2 +6 -0
  23. holmes/plugins/prompts/generic_ask_conversation.jinja2 +1 -0
  24. holmes/plugins/prompts/generic_ask_for_issue_conversation.jinja2 +1 -0
  25. holmes/plugins/prompts/generic_investigation.jinja2 +1 -0
  26. holmes/plugins/prompts/kubernetes_workload_ask.jinja2 +0 -2
  27. holmes/plugins/runbooks/__init__.py +20 -4
  28. holmes/plugins/toolsets/__init__.py +7 -9
  29. holmes/plugins/toolsets/aks-node-health.yaml +0 -8
  30. holmes/plugins/toolsets/argocd.yaml +4 -1
  31. holmes/plugins/toolsets/azure_sql/apis/azure_sql_api.py +1 -1
  32. holmes/plugins/toolsets/azure_sql/apis/connection_failure_api.py +2 -0
  33. holmes/plugins/toolsets/confluence.yaml +1 -1
  34. holmes/plugins/toolsets/datadog/datadog_metrics_instructions.jinja2 +54 -4
  35. holmes/plugins/toolsets/datadog/toolset_datadog_metrics.py +150 -6
  36. holmes/plugins/toolsets/kubernetes.yaml +6 -0
  37. holmes/plugins/toolsets/prometheus/prometheus.py +2 -6
  38. holmes/plugins/toolsets/prometheus/prometheus_instructions.jinja2 +2 -2
  39. holmes/plugins/toolsets/runbook/runbook_fetcher.py +65 -6
  40. holmes/plugins/toolsets/service_discovery.py +1 -1
  41. holmes/plugins/toolsets/slab.yaml +1 -1
  42. holmes/utils/colors.py +7 -0
  43. holmes/utils/console/consts.py +5 -0
  44. holmes/utils/console/result.py +2 -1
  45. holmes/utils/keygen_utils.py +6 -0
  46. holmes/version.py +2 -2
  47. holmesgpt-0.12.4.dist-info/METADATA +258 -0
  48. {holmesgpt-0.12.3a1.dist-info → holmesgpt-0.12.4.dist-info}/RECORD +51 -47
  49. holmesgpt-0.12.3a1.dist-info/METADATA +0 -400
  50. {holmesgpt-0.12.3a1.dist-info → holmesgpt-0.12.4.dist-info}/LICENSE.txt +0 -0
  51. {holmesgpt-0.12.3a1.dist-info → holmesgpt-0.12.4.dist-info}/WHEEL +0 -0
  52. {holmesgpt-0.12.3a1.dist-info → holmesgpt-0.12.4.dist-info}/entry_points.txt +0 -0
@@ -1,5 +1,8 @@
1
1
  # In general
2
2
 
3
+ {% if cluster_name -%}
4
+ * You are running on cluster {{ cluster_name }}.
5
+ {%- endif %}
3
6
  * when it can provide extra information, first run as many tools as you need to gather more information, then respond.
4
7
  * if possible, do so repeatedly with different tool calls each time to gather more information.
5
8
  * do not stop investigating until you are at the final root cause you are able to find.
@@ -9,7 +12,8 @@
9
12
  * in this case, try to find substrings or search for the correct spellings
10
13
  * always provide detailed information like exact resource names, versions, labels, etc
11
14
  * even if you found the root cause, keep investigating to find other possible root causes and to gather data for the answer like exact names
12
- * if a runbook url is present as well as tool that can fetch it, you MUST fetch the runbook before beginning your investigation.
15
+ * if a runbook url is present you MUST fetch the runbook before beginning your investigation
16
+ * when the user mentions any operational issue (high CPU, memory issues, database down, application errors, etc.), ALWAYS check if there's a matching runbook in the catalog first
13
17
  * if you don't know, say that the analysis was inconclusive.
14
18
  * if there are multiple possible causes list them in a numbered list.
15
19
  * there will often be errors in the data that are not relevant or that do not have an impact - ignore them in your conclusion if you were not able to tie them to an actual error.
@@ -32,42 +36,7 @@
32
36
 
33
37
  {% include '_toolsets_instructions.jinja2' %}
34
38
 
35
- {% include '_fetch_logs.jinja2' %}
36
-
37
- # Handling Permission Errors
38
-
39
- If during the investigation you encounter a permissions error (e.g., `Error from server (Forbidden):`), **ALWAYS** follow these steps to ensure a thorough resolution:
40
- 1.**Analyze the Error Message**
41
- - Identify the missing resource, API group, and verbs from the error details.
42
- - Never stop at reporting the error
43
- - Proceed with an in-depth investigation.
44
- 2.**Locate the Relevant Helm Release**
45
- Check if Helm tools are available, if they are available always use Helm commands to help user find the release associated with the Holmes pod:
46
- - Run `helm list -A | grep holmes` to identify the release name.
47
- - Run `helm get values <RELEASE_NAME> -n <NAMESPACE>` to retrieve details such as `customClusterRoleRules` and `clusterName`.
48
- If Helm tools are unavailable, skip this step.
49
- 3. **Check for Missing Permissions**
50
- - Check for a cluster role with <RELEASE_NAME>-holmes-cluster-role in its name and a service account with <RELEASE_NAME>-holmes-service-account in its name to troubleshoot missing permissions where release name is the name you found earlier if helm tools are available (If the exact cluster role or service account isn't found, search for similar or related names, including variations or prefixes/suffixes that might be used in the cluster.)
51
- - Focus on identifying absent permissions that align with the error message.
52
- 4. **Update the Configuration**
53
- If necessary permissions are absent both in customClusterRoleRules and the cluster role mentioned previously, ALWAYS advise the user to update their configuration by modifying the `generated_values.yaml` file as follows:
54
- ```
55
- holmes:
56
- customClusterRoleRules:
57
- - apiGroups: ["<API_GROUP>"]
58
- resources: ["<RESOURCE_1>", "<RESOURCE_2>"]
59
- verbs: ["<VERB_1>", "<VERB_2>", "<VERB_3>"]
60
- ```
61
- After that instruct them to apply the changes with::
62
- ```
63
- helm upgrade <RELEASE_NAME> robusta/robusta --values=generated_values.yaml --set clusterName=<YOUR_CLUSTER_NAME>
64
- ```
65
- 5. **Fallback Guidelines**
66
- - If you cannot determine the release or cluster name, use placeholders `<RELEASE_NAME>` and `<YOUR_CLUSTER_NAME>`.
67
- - While you should attempt to retrieve details using Helm commands, do **not** direct the user to execute these commands themselves.
68
- Reminder:
69
- * Always adhere to this process, even if Helm tools are unavailable.
70
- * Strive for thoroughness and precision, ensuring the issue is fully addressed.
39
+ {% include '_permission_errors.jinja2' %}
71
40
 
72
41
  # Special cases and how to reply
73
42
 
@@ -0,0 +1,6 @@
1
+ # Handling Permission Errors
2
+
3
+ If during the investigation you encounter a permissions error (e.g., `Error from server (Forbidden):`), **ALWAYS** follow these steps to ensure a thorough resolution:
4
+ 1. Analyze the Error Message: Identify the missing resource, API group, and verbs from the error details.
5
+ 2. Check which user/service account you're running with and what permissions it has
6
+ 3. Report this to the user and refer them to https://robusta-dev.github.io/holmesgpt/data-sources/permissions/
@@ -1,13 +1,21 @@
1
1
  {% if runbooks and runbooks.catalog|length > 0 %}
2
2
  # Runbook Selection
3
3
 
4
- ## Available Runbooks
4
+ You (HolmesGPT) have access to a set of runbooks that provide step-by-step troubleshooting instructions for various known issues.
5
+ If one of the following runbooks relates to the user's issue, you MUST fetch it with the fetch_runbook tool.
6
+
7
+ ## Available Runbooks for fetch_runbook tool
5
8
  {% for runbook in runbooks.catalog %}
6
9
  ### description: {{ runbook.description }}
7
10
  link: {{ runbook.link }}
8
11
  {% endfor %}
9
- ALWAYS try to find the runbooks that can provide troubleshooting instructions when the user describes an operational issue, debugging scenario, or asks for step‑by‑step troubleshooting.
10
- To get the runbook details, use `fetch_runbook` tool by comparing the runbook description with the user prompt.
11
- ALWAYS follow the steps described in the runbook.
12
- If you decided not to follow one or more steps, ALWAYS explain why.
12
+
13
+ If there is a runbook that MIGHT match the user's issue, you MUST:
14
+ 1. Fetch the runbook with the `fetch_runbook` tool.
15
+ 2. Decide based on the runbook's contents if it is relevant or not.
16
+ 3. If it seems relevant, inform the user that you accesses a runbook and will use it to troubleshoot the issue.
17
+ 4. To the maximum extent possible, follow the runbook instructions step-by-step.
18
+ 5. Provide a detailed report of the steps you performed, including any findings or errors encountered.
19
+ 6. If a runbook step requires tools or integrations you don't have access to tell the user that you cannot perform that step due to missing tools.
20
+
13
21
  {%- endif -%}
@@ -1,3 +1,5 @@
1
+ # Toolset Setup and Configuration Instructions
2
+
1
3
  {%- set enabled_toolsets_with_instructions = [] -%}
2
4
  {%- set disabled_toolsets = [] -%}
3
5
 
@@ -9,8 +11,10 @@
9
11
  {%- endif -%}
10
12
  {%- endfor -%}
11
13
 
12
- {% if enabled_toolsets_with_instructions|list -%}
13
14
  # Available Toolsets
15
+ {% include '_fetch_logs.jinja2' %}
16
+
17
+ {% if enabled_toolsets_with_instructions|list %}
14
18
  {%- for toolset in enabled_toolsets_with_instructions -%}
15
19
  {% if toolset.llm_instructions %}
16
20
 
@@ -19,13 +23,13 @@
19
23
  {%- endif -%}
20
24
  {%- endfor -%}
21
25
  {%- endif -%}
22
- {% if disabled_toolsets %}
23
- # Disabled & failed Toolsets
24
26
 
27
+ # Disabled & failed Toolsets
28
+ {% if disabled_toolsets %}
25
29
  The following toolsets are either disabled or failed to initialize:
26
30
  {% for toolset in disabled_toolsets %}
27
31
  * toolset "{{ toolset.name }}": {{ toolset.description }}
28
- {%- if toolset.status == "failed" %}
32
+ {%- if toolset.status.value == "failed" %}
29
33
  * status: The toolset is enabled but misconfigured and failed to initialize.
30
34
  {%- if toolset.error %}
31
35
  * error: {{ toolset.error }}
@@ -37,20 +41,24 @@ The following toolsets are either disabled or failed to initialize:
37
41
  * setup instructions: {{ toolset.docs_url }}
38
42
  {%- endif -%}
39
43
  {%- endfor %}
44
+ {% else %}
45
+ <no toolsets are disabled or failed>
46
+ {% endif %}
40
47
 
41
48
  If you need a toolset to access a system that you don't otherwise have access to:
42
49
  - Check the list of toolsets above and see if any loosely match the needs
43
50
  - If the toolset has `status: failed`: Tell the user and copy the error in your response for the user to see
44
- - If the toolset has `status: disabled`: Ask the user to configure the it.
51
+ - If the toolset has `status: disabled`: Ask the user to configure it.
45
52
  - Share the setup instructions URL with the user
46
- - Invoke the tool fetch_webpage on the toolset URL and summarize setup steps
47
- - If there are no relevant toolsets in the list below, tell the user that you are missing an integration to access XYZ:
48
- you should give an answer similar to "I don't have access to <system>. Please add a Holmes integration for <system> so
49
- that I can investigate this."
50
- {% else %}
53
+ - If there are no relevant toolsets in the list above, tell the user that you are missing an integration to access XYZ:
54
+ You should give an answer similar to "I don't have access to <system>. To add a HolmesGPT integration for <system> you can [connect an MCP server](https://robusta-dev.github.io/holmesgpt/data-sources/remote-mcp-servers/) or add a [custom toolset](https://robusta-dev.github.io/holmesgpt/data-sources/custom-toolsets/)."
51
55
 
52
- # Disabled & failed Toolsets
56
+ Likewise, if users ask about setting up or configuring integrations (e.g., "How can I give you access to ArgoCD applications?"):
57
+ ALWAYS check if there's a disabled or failed toolset that matches what the user is asking about. If you find one:
58
+ 1. If the toolset has a specific documentation URL (toolset.docs_url), ALWAYS direct them to that URL first
59
+ 2. If no specific documentation exists, then direct them to the general Holmes documentation:
60
+ - For all toolset configurations: https://robusta-dev.github.io/holmesgpt/data-sources/
61
+ - For custom toolsets: https://robusta-dev.github.io/holmesgpt/data-sources/custom-toolsets/
62
+ - For remote MCP servers: https://robusta-dev.github.io/holmesgpt/data-sources/remote-mcp-servers/
53
63
 
54
- If you need a toolset to access a system that you don't otherwise have access to, tell the user that you are missing an integration to access XYZ.
55
- You should give an answer similar to "I don't have access to <system>. Please add a Holmes integration for <system> so that I can investigate this."
56
- {%- endif -%}
64
+ When providing configuration guidance, always prefer the specific toolset documentation URL when available.
@@ -1,8 +1,10 @@
1
1
  You are a tool-calling AI assist provided with common devops and IT tools that you can use to troubleshoot problems or answer questions.
2
2
  Whenever possible you MUST first use tools to investigate then answer the question.
3
+ Ask for multiple tool calls at the same time as it saves time for the user.
3
4
  Do not say 'based on the tool output' or explicitly refer to tools at all.
4
5
  If you output an answer and then realize you need to call more tools or there are possible next steps, you may do so by calling tools at that point in time.
5
6
  If you have a good and concrete suggestion for how the user can fix something, tell them even if not asked explicitly
7
+ {% include '_current_date_time.jinja2' %}
6
8
 
7
9
  Use conversation history to maintain continuity when appropriate, ensuring efficiency in your responses.
8
10
 
@@ -34,3 +36,7 @@ Relevant logs:
34
36
  ```
35
37
 
36
38
  Validation error led to unhandled Java exception causing a crash.
39
+
40
+ {% if system_prompt_additions %}
41
+ {{ system_prompt_additions }}
42
+ {% endif %}
@@ -1,5 +1,6 @@
1
1
  You are a tool-calling AI assist provided with common devops and IT tools that you can use to troubleshoot problems or answer questions.
2
2
  Whenever possible you MUST first use tools to investigate then answer the question.
3
+ Ask for multiple tool calls at the same time as it saves time for the user.
3
4
  Do not say 'based on the tool output' or explicitly refer to tools at all.
4
5
  If you output an answer and then realize you need to call more tools or there are possible next steps, you may do so by calling tools at that point in time.
5
6
  If you have a good and concrete suggestion for how the user can fix something, tell them even if not asked explicitly
@@ -1,5 +1,6 @@
1
1
  You are a tool-calling AI assist provided with common devops and IT tools that you can use to troubleshoot problems or answer questions.
2
2
  Whenever possible you MUST first use tools to investigate then answer the question.
3
+ Ask for multiple tool calls at the same time as it saves time for the user.
3
4
  Do not say 'based on the tool output' or explicitly refer to tools at all.
4
5
  If you output an answer and then realize you need to call more tools or there are possible next steps, you may do so by calling tools at that point in time.
5
6
  {% include '_current_date_time.jinja2' %}
@@ -1,5 +1,6 @@
1
1
  You are a tool-calling AI assist provided with common devops and IT tools that you can use to troubleshoot problems or answer questions.
2
2
  Whenever possible you MUST first use tools to investigate then answer the question.
3
+ Ask for multiple tool calls at the same time as it saves time for the user.
3
4
  Do not say 'based on the tool output'
4
5
 
5
6
  Provide an terse analysis of the following {{ issue.source_type }} alert/issue and why it is firing.
@@ -43,8 +43,6 @@ In general:
43
43
 
44
44
  {% include '_toolsets_instructions.jinja2' %}
45
45
 
46
- {% include '_fetch_logs.jinja2' %}
47
-
48
46
  Style guide:
49
47
  * Be painfully concise.
50
48
  * Leave out "the" and filler words when possible.
@@ -11,6 +11,7 @@ from pydantic import BaseModel, PrivateAttr
11
11
  from holmes.utils.pydantic_utils import RobustaBaseConfig, load_model_from_file
12
12
 
13
13
  THIS_DIR = os.path.abspath(os.path.dirname(__file__))
14
+ DEFAULT_RUNBOOK_SEARCH_PATH = THIS_DIR
14
15
 
15
16
  CATALOG_FILE = "catalog.json"
16
17
 
@@ -94,7 +95,22 @@ def load_runbook_catalog() -> Optional[RunbookCatalog]:
94
95
  return None
95
96
 
96
97
 
97
- def get_runbook_by_path(runbook_relative_path: str) -> str:
98
- runbook_folder = os.path.dirname(os.path.realpath(__file__))
99
- runbook_path = os.path.join(runbook_folder, runbook_relative_path)
100
- return runbook_path
98
+ def get_runbook_by_path(
99
+ runbook_relative_path: str, search_paths: List[str]
100
+ ) -> Optional[str]:
101
+ """
102
+ Find a runbook by searching through provided paths.
103
+
104
+ Args:
105
+ runbook_relative_path: The relative path to the runbook
106
+ search_paths: Optional list of directories to search. If None, uses default runbook folder.
107
+
108
+ Returns:
109
+ Full path to the runbook if found, None otherwise
110
+ """
111
+ for search_path in search_paths:
112
+ runbook_path = os.path.join(search_path, runbook_relative_path)
113
+ if os.path.exists(runbook_path):
114
+ return runbook_path
115
+
116
+ return None
@@ -3,14 +3,16 @@ import os
3
3
  import os.path
4
4
  from typing import Any, List, Optional, Union
5
5
 
6
- from holmes.common.env_vars import USE_LEGACY_KUBERNETES_LOGS
7
6
  import yaml # type: ignore
8
7
  from pydantic import ValidationError
9
8
 
10
- from holmes.plugins.toolsets.azure_sql.azure_sql_toolset import AzureSQLToolset
11
9
  import holmes.utils.env as env_utils
10
+ from holmes.common.env_vars import USE_LEGACY_KUBERNETES_LOGS
12
11
  from holmes.core.supabase_dal import SupabaseDal
13
12
  from holmes.core.tools import Toolset, ToolsetType, ToolsetYamlFromConfig, YAMLToolset
13
+ from holmes.plugins.toolsets.atlas_mongodb.mongodb_atlas import MongoDBAtlasToolset
14
+ from holmes.plugins.toolsets.azure_sql.azure_sql_toolset import AzureSQLToolset
15
+ from holmes.plugins.toolsets.bash.bash_toolset import BashExecutorToolset
14
16
  from holmes.plugins.toolsets.coralogix.toolset_coralogix_logs import (
15
17
  CoralogixLogsToolset,
16
18
  )
@@ -18,18 +20,15 @@ from holmes.plugins.toolsets.datadog.toolset_datadog_logs import DatadogLogsTool
18
20
  from holmes.plugins.toolsets.datadog.toolset_datadog_metrics import (
19
21
  DatadogMetricsToolset,
20
22
  )
21
- from holmes.plugins.toolsets.datadog.toolset_datadog_traces import (
22
- DatadogTracesToolset,
23
- )
24
- from holmes.plugins.toolsets.kubernetes_logs import KubernetesLogsToolset
23
+ from holmes.plugins.toolsets.datadog.toolset_datadog_traces import DatadogTracesToolset
25
24
  from holmes.plugins.toolsets.git import GitToolset
26
25
  from holmes.plugins.toolsets.grafana.toolset_grafana import GrafanaToolset
27
- from holmes.plugins.toolsets.bash.bash_toolset import BashExecutorToolset
28
26
  from holmes.plugins.toolsets.grafana.toolset_grafana_loki import GrafanaLokiToolset
29
27
  from holmes.plugins.toolsets.grafana.toolset_grafana_tempo import GrafanaTempoToolset
30
28
  from holmes.plugins.toolsets.internet.internet import InternetToolset
31
29
  from holmes.plugins.toolsets.internet.notion import NotionToolset
32
30
  from holmes.plugins.toolsets.kafka import KafkaToolset
31
+ from holmes.plugins.toolsets.kubernetes_logs import KubernetesLogsToolset
33
32
  from holmes.plugins.toolsets.mcp.toolset_mcp import RemoteMCPToolset
34
33
  from holmes.plugins.toolsets.newrelic import NewRelicToolset
35
34
  from holmes.plugins.toolsets.opensearch.opensearch import OpenSearchToolset
@@ -38,7 +37,6 @@ from holmes.plugins.toolsets.opensearch.opensearch_traces import OpenSearchTrace
38
37
  from holmes.plugins.toolsets.prometheus.prometheus import PrometheusToolset
39
38
  from holmes.plugins.toolsets.rabbitmq.toolset_rabbitmq import RabbitMQToolset
40
39
  from holmes.plugins.toolsets.robusta.robusta import RobustaToolset
41
- from holmes.plugins.toolsets.atlas_mongodb.mongodb_atlas import MongoDBAtlasToolset
42
40
  from holmes.plugins.toolsets.runbook.runbook_fetcher import RunbookToolset
43
41
  from holmes.plugins.toolsets.servicenow.servicenow import ServiceNowToolset
44
42
 
@@ -156,7 +154,7 @@ def load_toolsets_from_config(
156
154
  toolset_type = config.get("type", ToolsetType.BUILTIN.value)
157
155
  # MCP server is not a built-in toolset, so we need to set the type explicitly
158
156
  validated_toolset: Optional[Toolset] = None
159
- if toolset_type is ToolsetType.MCP:
157
+ if toolset_type == ToolsetType.MCP.value:
160
158
  validated_toolset = RemoteMCPToolset(**config, name=name)
161
159
  elif strict_check:
162
160
  validated_toolset = YAMLToolset(**config, name=name) # type: ignore
@@ -55,11 +55,3 @@ toolsets:
55
55
  user_description: "lists all VMSS names in {{ NODE_RESOURCE_GROUP }}"
56
56
  command: |
57
57
  az vmss list -g {{ NODE_RESOURCE_GROUP }} --query '[*].name' -o tsv --only-show-errors
58
- - name: "vmss_run_command"
59
- description: |
60
- Execute a shell command on a specific VMSS VM instance using az vmss run-command.
61
- VM_ID is the instance ID of the VMSS, which can be derived from node names.
62
- Prerequisites: get_node_resource_group, list_vmss_names
63
- user_description: "run command {{ SHELL_COMMAND }} on VM #{{ VM_ID }} of VMSS {{ VMSS_NAME }}"
64
- command: |
65
- az vmss run-command invoke --resource-group {{ NODE_RESOURCE_GROUP }} --name {{ VMSS_NAME }} --instance-id {{ VM_ID }} --command-id RunShellScript --scripts {{ SHELL_COMMAND }}
@@ -6,13 +6,16 @@ toolsets:
6
6
  llm_instructions: |
7
7
  You have access to a set of ArgoCD tools for debugging Kubernetes application deployments.
8
8
  If an application's name does not exist in kubernetes, it may exist in argocd: call the tool `argocd_app_list` to find it.
9
+ IMPORTANT: If you are asked about health issues, ALWAYS check if the argo cd apps are in a healthy state.
10
+ If some resource is out of sync, ALWAYS show the diff, using the argocd_app_diff tool, between the desired state and the current state.
9
11
  These tools help you investigate issues with GitOps-managed applications in your Kubernetes clusters.
10
- ALWAYS follow these steps:
12
+ In addition to the general investigation steps, ALWAYS follow these steps as well:
11
13
  1. List the applications
12
14
  2. Retrieve the application status and its config
13
15
  3. Retrieve the application's manifests for issues
14
16
  4. Compare the ArgoCD config with the kubernetes status using kubectl tools
15
17
  5. Check for resources mismatch between argocd and kubernetes
18
+ 6. If an application is OutOfSync, pull the diff using the argocd_app_diff tool
16
19
  {% if tool_names|list|length > 0 %}
17
20
  The following commands are available to introspect into ArgoCD: {{ ", ".join(tool_names) }}
18
21
  {% endif %}
@@ -179,7 +179,7 @@ class AzureSQLAPIClient:
179
179
  server_name=server_name,
180
180
  database_name=database_name,
181
181
  )
182
- return tuning.as_dict()
182
+ return dict(tuning.as_dict())
183
183
 
184
184
  def get_top_cpu_queries(
185
185
  self,
@@ -134,6 +134,8 @@ class ConnectionFailureAPI:
134
134
  for metric in metrics.value:
135
135
  if metric.timeseries:
136
136
  for timeseries in metric.timeseries:
137
+ if timeseries.data is None:
138
+ continue
137
139
  for data_point in timeseries.data:
138
140
  if data_point.time_stamp:
139
141
  metric_values.append(
@@ -14,6 +14,6 @@ toolsets:
14
14
 
15
15
  tools:
16
16
  - name: "fetch_confluence_url"
17
- description: "Fetch a page in confluence. Use this to fetch confluence runbooks if they are present before starting your investigation."
17
+ description: "Fetch a page in confluence."
18
18
  user_description: "fetch confluence page {{ confluence_page_id }}"
19
19
  command: "curl -u ${CONFLUENCE_USER}:${CONFLUENCE_API_KEY} -X GET -H 'Content-Type: application/json' ${CONFLUENCE_BASE_URL}/wiki/rest/api/content/{{ confluence_page_id }}?expand=body.storage"
@@ -15,12 +15,62 @@ When investigating metrics-related issues:
15
15
  - Provides metric type (gauge/count/rate), unit, and description
16
16
  - Accepts comma-separated list for batch queries
17
17
 
18
+ 4. **Use `list_datadog_metric_tags`** to understand which tags are available for a given metric
19
+ - Provides a set of tags and aggregations
20
+ - Can help to build the correct `tag_filter`, to find which metrics are available for a given resource
21
+
22
+ ### General guideline
23
+ - This toolset is used to generate visualizations and graphs.
24
+ - Assume the resource should have metrics. If metrics not found, try to adjust tag filters
25
+ - IMPORTANT: This toolset DOES NOT support promql queries.
26
+
27
+ ### CRITICAL: Pod Name Resolution Workflow
28
+ When users ask for metrics about a deployment, service, or workload (e.g., "my-workload", "nginx-deployment"):
29
+
30
+ **ALWAYS follow this two-step process:**
31
+ 1. **First**: Use `kubectl_find_resource` to find the actual pod names
32
+ - Example: `kubectl_find_resource` with "my-workload" → finds pods like "my-workload-8f8cdfxyz-c7zdr"
33
+ 2. **Then**: Use those specific pod names in Datadog queries
34
+ - Correct: `container.cpu.usage{pod_name:my-workload-8f8cdfxyz-c7zdr}`
35
+ - WRONG: `container.cpu.usage{pod_name:my-workload}` ← This will return no data!
36
+
37
+ **Why this matters:**
38
+ - Pod names in Datadog are the actual Kubernetes pod names (with random suffixes)
39
+ - Deployment/service names are NOT pod names
40
+ - Using deployment names as pod_name filters will always return empty results
41
+
18
42
  ### Time Parameters
19
43
  - Use RFC3339 format: `2023-03-01T10:30:00Z`
20
44
  - Or relative seconds: `-3600` for 1 hour ago
21
45
  - Defaults to 1 hour window if not specified
22
46
 
23
- ### Common Patterns
24
- - CPU investigation: First list metrics with `tag_filter:kube_node_name:nodename`, then query specific metrics
25
- - Memory issues: Look for `system.mem.*` or `kubernetes.memory.*` metrics
26
- - Container metrics: Filter by pod/container tags
47
+ ### Common Investigation Patterns
48
+
49
+ **For Pod/Container Metrics (MOST COMMON):**
50
+ 1. User asks: "Show CPU for my-workload"
51
+ 2. Use `kubectl_find_resource` → find pod "my-workload-abc123-xyz"
52
+ 3. Query Datadog: `container.cpu.usage{pod_name:my-workload-abc123-xyz}`
53
+
54
+ **For Node-level Metrics:**
55
+ 1. Use `tag_filter:kube_node_name:nodename` to filter by node
56
+ 2. Query system-level metrics like `system.cpu.user{kube_node_name:worker-1}`
57
+
58
+ **For Service-level Metrics:**
59
+ 1. First resolve service to pods using `kubectl_find_resource`
60
+ 2. Query metrics for all pods belonging to that service
61
+ 3. Use namespace filtering: `tag_filter:kube_namespace:default`
62
+
63
+
64
+ # Handling queries results
65
+ * ALWAYS embed the execution results into your answer
66
+ * You only need to embed the partial result in your response. Include the "tool_name" and "random_key". For example: << {"type": "datadogql", "tool_name": "query_datadog_metrics", "random_key": "92jf2hf"} >>
67
+ * Post processing will parse your response, re-run the query from the tool output and create a chart visible to the user
68
+ * You MUST ensure that the query is successful.
69
+ * ALWAYS embed a DataDog graph in the response. The graph should visualize data related to the incident.
70
+ * Embed at most 2 graphs
71
+ * When embedding multiple graphs, always add line spacing between them
72
+ For example:
73
+
74
+ <<{"type": "datadogql", "tool_name": "query_datadog_metrics", "random_key": "lBaA"}>>
75
+
76
+ <<{"type": "datadogql", "tool_name": "query_datadog_metrics", "random_key": "IKtq"}>>