holmesgpt 0.12.3a1__py3-none-any.whl → 0.12.4__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of holmesgpt might be problematic. Click here for more details.
- holmes/__init__.py +1 -1
- holmes/config.py +75 -33
- holmes/core/config.py +5 -0
- holmes/core/conversations.py +17 -2
- holmes/core/investigation.py +1 -0
- holmes/core/llm.py +1 -2
- holmes/core/prompt.py +29 -4
- holmes/core/supabase_dal.py +49 -13
- holmes/core/tool_calling_llm.py +26 -1
- holmes/core/tools.py +2 -1
- holmes/core/tools_utils/tool_executor.py +1 -0
- holmes/core/toolset_manager.py +10 -3
- holmes/core/tracing.py +77 -10
- holmes/interactive.py +110 -20
- holmes/main.py +13 -18
- holmes/plugins/destinations/slack/plugin.py +19 -9
- holmes/plugins/prompts/_fetch_logs.jinja2 +11 -1
- holmes/plugins/prompts/_general_instructions.jinja2 +6 -37
- holmes/plugins/prompts/_permission_errors.jinja2 +6 -0
- holmes/plugins/prompts/_runbook_instructions.jinja2 +13 -5
- holmes/plugins/prompts/_toolsets_instructions.jinja2 +22 -14
- holmes/plugins/prompts/generic_ask.jinja2 +6 -0
- holmes/plugins/prompts/generic_ask_conversation.jinja2 +1 -0
- holmes/plugins/prompts/generic_ask_for_issue_conversation.jinja2 +1 -0
- holmes/plugins/prompts/generic_investigation.jinja2 +1 -0
- holmes/plugins/prompts/kubernetes_workload_ask.jinja2 +0 -2
- holmes/plugins/runbooks/__init__.py +20 -4
- holmes/plugins/toolsets/__init__.py +7 -9
- holmes/plugins/toolsets/aks-node-health.yaml +0 -8
- holmes/plugins/toolsets/argocd.yaml +4 -1
- holmes/plugins/toolsets/azure_sql/apis/azure_sql_api.py +1 -1
- holmes/plugins/toolsets/azure_sql/apis/connection_failure_api.py +2 -0
- holmes/plugins/toolsets/confluence.yaml +1 -1
- holmes/plugins/toolsets/datadog/datadog_metrics_instructions.jinja2 +54 -4
- holmes/plugins/toolsets/datadog/toolset_datadog_metrics.py +150 -6
- holmes/plugins/toolsets/kubernetes.yaml +6 -0
- holmes/plugins/toolsets/prometheus/prometheus.py +2 -6
- holmes/plugins/toolsets/prometheus/prometheus_instructions.jinja2 +2 -2
- holmes/plugins/toolsets/runbook/runbook_fetcher.py +65 -6
- holmes/plugins/toolsets/service_discovery.py +1 -1
- holmes/plugins/toolsets/slab.yaml +1 -1
- holmes/utils/colors.py +7 -0
- holmes/utils/console/consts.py +5 -0
- holmes/utils/console/result.py +2 -1
- holmes/utils/keygen_utils.py +6 -0
- holmes/version.py +2 -2
- holmesgpt-0.12.4.dist-info/METADATA +258 -0
- {holmesgpt-0.12.3a1.dist-info → holmesgpt-0.12.4.dist-info}/RECORD +51 -47
- holmesgpt-0.12.3a1.dist-info/METADATA +0 -400
- {holmesgpt-0.12.3a1.dist-info → holmesgpt-0.12.4.dist-info}/LICENSE.txt +0 -0
- {holmesgpt-0.12.3a1.dist-info → holmesgpt-0.12.4.dist-info}/WHEEL +0 -0
- {holmesgpt-0.12.3a1.dist-info → holmesgpt-0.12.4.dist-info}/entry_points.txt +0 -0
|
@@ -1,5 +1,8 @@
|
|
|
1
1
|
# In general
|
|
2
2
|
|
|
3
|
+
{% if cluster_name -%}
|
|
4
|
+
* You are running on cluster {{ cluster_name }}.
|
|
5
|
+
{%- endif %}
|
|
3
6
|
* when it can provide extra information, first run as many tools as you need to gather more information, then respond.
|
|
4
7
|
* if possible, do so repeatedly with different tool calls each time to gather more information.
|
|
5
8
|
* do not stop investigating until you are at the final root cause you are able to find.
|
|
@@ -9,7 +12,8 @@
|
|
|
9
12
|
* in this case, try to find substrings or search for the correct spellings
|
|
10
13
|
* always provide detailed information like exact resource names, versions, labels, etc
|
|
11
14
|
* even if you found the root cause, keep investigating to find other possible root causes and to gather data for the answer like exact names
|
|
12
|
-
* if a runbook url is present
|
|
15
|
+
* if a runbook url is present you MUST fetch the runbook before beginning your investigation
|
|
16
|
+
* when the user mentions any operational issue (high CPU, memory issues, database down, application errors, etc.), ALWAYS check if there's a matching runbook in the catalog first
|
|
13
17
|
* if you don't know, say that the analysis was inconclusive.
|
|
14
18
|
* if there are multiple possible causes list them in a numbered list.
|
|
15
19
|
* there will often be errors in the data that are not relevant or that do not have an impact - ignore them in your conclusion if you were not able to tie them to an actual error.
|
|
@@ -32,42 +36,7 @@
|
|
|
32
36
|
|
|
33
37
|
{% include '_toolsets_instructions.jinja2' %}
|
|
34
38
|
|
|
35
|
-
{% include '
|
|
36
|
-
|
|
37
|
-
# Handling Permission Errors
|
|
38
|
-
|
|
39
|
-
If during the investigation you encounter a permissions error (e.g., `Error from server (Forbidden):`), **ALWAYS** follow these steps to ensure a thorough resolution:
|
|
40
|
-
1.**Analyze the Error Message**
|
|
41
|
-
- Identify the missing resource, API group, and verbs from the error details.
|
|
42
|
-
- Never stop at reporting the error
|
|
43
|
-
- Proceed with an in-depth investigation.
|
|
44
|
-
2.**Locate the Relevant Helm Release**
|
|
45
|
-
Check if Helm tools are available, if they are available always use Helm commands to help user find the release associated with the Holmes pod:
|
|
46
|
-
- Run `helm list -A | grep holmes` to identify the release name.
|
|
47
|
-
- Run `helm get values <RELEASE_NAME> -n <NAMESPACE>` to retrieve details such as `customClusterRoleRules` and `clusterName`.
|
|
48
|
-
If Helm tools are unavailable, skip this step.
|
|
49
|
-
3. **Check for Missing Permissions**
|
|
50
|
-
- Check for a cluster role with <RELEASE_NAME>-holmes-cluster-role in its name and a service account with <RELEASE_NAME>-holmes-service-account in its name to troubleshoot missing permissions where release name is the name you found earlier if helm tools are available (If the exact cluster role or service account isn't found, search for similar or related names, including variations or prefixes/suffixes that might be used in the cluster.)
|
|
51
|
-
- Focus on identifying absent permissions that align with the error message.
|
|
52
|
-
4. **Update the Configuration**
|
|
53
|
-
If necessary permissions are absent both in customClusterRoleRules and the cluster role mentioned previously, ALWAYS advise the user to update their configuration by modifying the `generated_values.yaml` file as follows:
|
|
54
|
-
```
|
|
55
|
-
holmes:
|
|
56
|
-
customClusterRoleRules:
|
|
57
|
-
- apiGroups: ["<API_GROUP>"]
|
|
58
|
-
resources: ["<RESOURCE_1>", "<RESOURCE_2>"]
|
|
59
|
-
verbs: ["<VERB_1>", "<VERB_2>", "<VERB_3>"]
|
|
60
|
-
```
|
|
61
|
-
After that instruct them to apply the changes with::
|
|
62
|
-
```
|
|
63
|
-
helm upgrade <RELEASE_NAME> robusta/robusta --values=generated_values.yaml --set clusterName=<YOUR_CLUSTER_NAME>
|
|
64
|
-
```
|
|
65
|
-
5. **Fallback Guidelines**
|
|
66
|
-
- If you cannot determine the release or cluster name, use placeholders `<RELEASE_NAME>` and `<YOUR_CLUSTER_NAME>`.
|
|
67
|
-
- While you should attempt to retrieve details using Helm commands, do **not** direct the user to execute these commands themselves.
|
|
68
|
-
Reminder:
|
|
69
|
-
* Always adhere to this process, even if Helm tools are unavailable.
|
|
70
|
-
* Strive for thoroughness and precision, ensuring the issue is fully addressed.
|
|
39
|
+
{% include '_permission_errors.jinja2' %}
|
|
71
40
|
|
|
72
41
|
# Special cases and how to reply
|
|
73
42
|
|
|
@@ -0,0 +1,6 @@
|
|
|
1
|
+
# Handling Permission Errors
|
|
2
|
+
|
|
3
|
+
If during the investigation you encounter a permissions error (e.g., `Error from server (Forbidden):`), **ALWAYS** follow these steps to ensure a thorough resolution:
|
|
4
|
+
1. Analyze the Error Message: Identify the missing resource, API group, and verbs from the error details.
|
|
5
|
+
2. Check which user/service account you're running with and what permissions it has
|
|
6
|
+
3. Report this to the user and refer them to https://robusta-dev.github.io/holmesgpt/data-sources/permissions/
|
|
@@ -1,13 +1,21 @@
|
|
|
1
1
|
{% if runbooks and runbooks.catalog|length > 0 %}
|
|
2
2
|
# Runbook Selection
|
|
3
3
|
|
|
4
|
-
|
|
4
|
+
You (HolmesGPT) have access to a set of runbooks that provide step-by-step troubleshooting instructions for various known issues.
|
|
5
|
+
If one of the following runbooks relates to the user's issue, you MUST fetch it with the fetch_runbook tool.
|
|
6
|
+
|
|
7
|
+
## Available Runbooks for fetch_runbook tool
|
|
5
8
|
{% for runbook in runbooks.catalog %}
|
|
6
9
|
### description: {{ runbook.description }}
|
|
7
10
|
link: {{ runbook.link }}
|
|
8
11
|
{% endfor %}
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
12
|
+
|
|
13
|
+
If there is a runbook that MIGHT match the user's issue, you MUST:
|
|
14
|
+
1. Fetch the runbook with the `fetch_runbook` tool.
|
|
15
|
+
2. Decide based on the runbook's contents if it is relevant or not.
|
|
16
|
+
3. If it seems relevant, inform the user that you accesses a runbook and will use it to troubleshoot the issue.
|
|
17
|
+
4. To the maximum extent possible, follow the runbook instructions step-by-step.
|
|
18
|
+
5. Provide a detailed report of the steps you performed, including any findings or errors encountered.
|
|
19
|
+
6. If a runbook step requires tools or integrations you don't have access to tell the user that you cannot perform that step due to missing tools.
|
|
20
|
+
|
|
13
21
|
{%- endif -%}
|
|
@@ -1,3 +1,5 @@
|
|
|
1
|
+
# Toolset Setup and Configuration Instructions
|
|
2
|
+
|
|
1
3
|
{%- set enabled_toolsets_with_instructions = [] -%}
|
|
2
4
|
{%- set disabled_toolsets = [] -%}
|
|
3
5
|
|
|
@@ -9,8 +11,10 @@
|
|
|
9
11
|
{%- endif -%}
|
|
10
12
|
{%- endfor -%}
|
|
11
13
|
|
|
12
|
-
{% if enabled_toolsets_with_instructions|list -%}
|
|
13
14
|
# Available Toolsets
|
|
15
|
+
{% include '_fetch_logs.jinja2' %}
|
|
16
|
+
|
|
17
|
+
{% if enabled_toolsets_with_instructions|list %}
|
|
14
18
|
{%- for toolset in enabled_toolsets_with_instructions -%}
|
|
15
19
|
{% if toolset.llm_instructions %}
|
|
16
20
|
|
|
@@ -19,13 +23,13 @@
|
|
|
19
23
|
{%- endif -%}
|
|
20
24
|
{%- endfor -%}
|
|
21
25
|
{%- endif -%}
|
|
22
|
-
{% if disabled_toolsets %}
|
|
23
|
-
# Disabled & failed Toolsets
|
|
24
26
|
|
|
27
|
+
# Disabled & failed Toolsets
|
|
28
|
+
{% if disabled_toolsets %}
|
|
25
29
|
The following toolsets are either disabled or failed to initialize:
|
|
26
30
|
{% for toolset in disabled_toolsets %}
|
|
27
31
|
* toolset "{{ toolset.name }}": {{ toolset.description }}
|
|
28
|
-
{%- if toolset.status == "failed" %}
|
|
32
|
+
{%- if toolset.status.value == "failed" %}
|
|
29
33
|
* status: The toolset is enabled but misconfigured and failed to initialize.
|
|
30
34
|
{%- if toolset.error %}
|
|
31
35
|
* error: {{ toolset.error }}
|
|
@@ -37,20 +41,24 @@ The following toolsets are either disabled or failed to initialize:
|
|
|
37
41
|
* setup instructions: {{ toolset.docs_url }}
|
|
38
42
|
{%- endif -%}
|
|
39
43
|
{%- endfor %}
|
|
44
|
+
{% else %}
|
|
45
|
+
<no toolsets are disabled or failed>
|
|
46
|
+
{% endif %}
|
|
40
47
|
|
|
41
48
|
If you need a toolset to access a system that you don't otherwise have access to:
|
|
42
49
|
- Check the list of toolsets above and see if any loosely match the needs
|
|
43
50
|
- If the toolset has `status: failed`: Tell the user and copy the error in your response for the user to see
|
|
44
|
-
- If the toolset has `status: disabled`: Ask the user to configure
|
|
51
|
+
- If the toolset has `status: disabled`: Ask the user to configure it.
|
|
45
52
|
- Share the setup instructions URL with the user
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
you should give an answer similar to "I don't have access to <system>. Please add a Holmes integration for <system> so
|
|
49
|
-
that I can investigate this."
|
|
50
|
-
{% else %}
|
|
53
|
+
- If there are no relevant toolsets in the list above, tell the user that you are missing an integration to access XYZ:
|
|
54
|
+
You should give an answer similar to "I don't have access to <system>. To add a HolmesGPT integration for <system> you can [connect an MCP server](https://robusta-dev.github.io/holmesgpt/data-sources/remote-mcp-servers/) or add a [custom toolset](https://robusta-dev.github.io/holmesgpt/data-sources/custom-toolsets/)."
|
|
51
55
|
|
|
52
|
-
|
|
56
|
+
Likewise, if users ask about setting up or configuring integrations (e.g., "How can I give you access to ArgoCD applications?"):
|
|
57
|
+
ALWAYS check if there's a disabled or failed toolset that matches what the user is asking about. If you find one:
|
|
58
|
+
1. If the toolset has a specific documentation URL (toolset.docs_url), ALWAYS direct them to that URL first
|
|
59
|
+
2. If no specific documentation exists, then direct them to the general Holmes documentation:
|
|
60
|
+
- For all toolset configurations: https://robusta-dev.github.io/holmesgpt/data-sources/
|
|
61
|
+
- For custom toolsets: https://robusta-dev.github.io/holmesgpt/data-sources/custom-toolsets/
|
|
62
|
+
- For remote MCP servers: https://robusta-dev.github.io/holmesgpt/data-sources/remote-mcp-servers/
|
|
53
63
|
|
|
54
|
-
|
|
55
|
-
You should give an answer similar to "I don't have access to <system>. Please add a Holmes integration for <system> so that I can investigate this."
|
|
56
|
-
{%- endif -%}
|
|
64
|
+
When providing configuration guidance, always prefer the specific toolset documentation URL when available.
|
|
@@ -1,8 +1,10 @@
|
|
|
1
1
|
You are a tool-calling AI assist provided with common devops and IT tools that you can use to troubleshoot problems or answer questions.
|
|
2
2
|
Whenever possible you MUST first use tools to investigate then answer the question.
|
|
3
|
+
Ask for multiple tool calls at the same time as it saves time for the user.
|
|
3
4
|
Do not say 'based on the tool output' or explicitly refer to tools at all.
|
|
4
5
|
If you output an answer and then realize you need to call more tools or there are possible next steps, you may do so by calling tools at that point in time.
|
|
5
6
|
If you have a good and concrete suggestion for how the user can fix something, tell them even if not asked explicitly
|
|
7
|
+
{% include '_current_date_time.jinja2' %}
|
|
6
8
|
|
|
7
9
|
Use conversation history to maintain continuity when appropriate, ensuring efficiency in your responses.
|
|
8
10
|
|
|
@@ -34,3 +36,7 @@ Relevant logs:
|
|
|
34
36
|
```
|
|
35
37
|
|
|
36
38
|
Validation error led to unhandled Java exception causing a crash.
|
|
39
|
+
|
|
40
|
+
{% if system_prompt_additions %}
|
|
41
|
+
{{ system_prompt_additions }}
|
|
42
|
+
{% endif %}
|
|
@@ -1,5 +1,6 @@
|
|
|
1
1
|
You are a tool-calling AI assist provided with common devops and IT tools that you can use to troubleshoot problems or answer questions.
|
|
2
2
|
Whenever possible you MUST first use tools to investigate then answer the question.
|
|
3
|
+
Ask for multiple tool calls at the same time as it saves time for the user.
|
|
3
4
|
Do not say 'based on the tool output' or explicitly refer to tools at all.
|
|
4
5
|
If you output an answer and then realize you need to call more tools or there are possible next steps, you may do so by calling tools at that point in time.
|
|
5
6
|
If you have a good and concrete suggestion for how the user can fix something, tell them even if not asked explicitly
|
|
@@ -1,5 +1,6 @@
|
|
|
1
1
|
You are a tool-calling AI assist provided with common devops and IT tools that you can use to troubleshoot problems or answer questions.
|
|
2
2
|
Whenever possible you MUST first use tools to investigate then answer the question.
|
|
3
|
+
Ask for multiple tool calls at the same time as it saves time for the user.
|
|
3
4
|
Do not say 'based on the tool output' or explicitly refer to tools at all.
|
|
4
5
|
If you output an answer and then realize you need to call more tools or there are possible next steps, you may do so by calling tools at that point in time.
|
|
5
6
|
{% include '_current_date_time.jinja2' %}
|
|
@@ -1,5 +1,6 @@
|
|
|
1
1
|
You are a tool-calling AI assist provided with common devops and IT tools that you can use to troubleshoot problems or answer questions.
|
|
2
2
|
Whenever possible you MUST first use tools to investigate then answer the question.
|
|
3
|
+
Ask for multiple tool calls at the same time as it saves time for the user.
|
|
3
4
|
Do not say 'based on the tool output'
|
|
4
5
|
|
|
5
6
|
Provide an terse analysis of the following {{ issue.source_type }} alert/issue and why it is firing.
|
|
@@ -11,6 +11,7 @@ from pydantic import BaseModel, PrivateAttr
|
|
|
11
11
|
from holmes.utils.pydantic_utils import RobustaBaseConfig, load_model_from_file
|
|
12
12
|
|
|
13
13
|
THIS_DIR = os.path.abspath(os.path.dirname(__file__))
|
|
14
|
+
DEFAULT_RUNBOOK_SEARCH_PATH = THIS_DIR
|
|
14
15
|
|
|
15
16
|
CATALOG_FILE = "catalog.json"
|
|
16
17
|
|
|
@@ -94,7 +95,22 @@ def load_runbook_catalog() -> Optional[RunbookCatalog]:
|
|
|
94
95
|
return None
|
|
95
96
|
|
|
96
97
|
|
|
97
|
-
def get_runbook_by_path(
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
98
|
+
def get_runbook_by_path(
|
|
99
|
+
runbook_relative_path: str, search_paths: List[str]
|
|
100
|
+
) -> Optional[str]:
|
|
101
|
+
"""
|
|
102
|
+
Find a runbook by searching through provided paths.
|
|
103
|
+
|
|
104
|
+
Args:
|
|
105
|
+
runbook_relative_path: The relative path to the runbook
|
|
106
|
+
search_paths: Optional list of directories to search. If None, uses default runbook folder.
|
|
107
|
+
|
|
108
|
+
Returns:
|
|
109
|
+
Full path to the runbook if found, None otherwise
|
|
110
|
+
"""
|
|
111
|
+
for search_path in search_paths:
|
|
112
|
+
runbook_path = os.path.join(search_path, runbook_relative_path)
|
|
113
|
+
if os.path.exists(runbook_path):
|
|
114
|
+
return runbook_path
|
|
115
|
+
|
|
116
|
+
return None
|
|
@@ -3,14 +3,16 @@ import os
|
|
|
3
3
|
import os.path
|
|
4
4
|
from typing import Any, List, Optional, Union
|
|
5
5
|
|
|
6
|
-
from holmes.common.env_vars import USE_LEGACY_KUBERNETES_LOGS
|
|
7
6
|
import yaml # type: ignore
|
|
8
7
|
from pydantic import ValidationError
|
|
9
8
|
|
|
10
|
-
from holmes.plugins.toolsets.azure_sql.azure_sql_toolset import AzureSQLToolset
|
|
11
9
|
import holmes.utils.env as env_utils
|
|
10
|
+
from holmes.common.env_vars import USE_LEGACY_KUBERNETES_LOGS
|
|
12
11
|
from holmes.core.supabase_dal import SupabaseDal
|
|
13
12
|
from holmes.core.tools import Toolset, ToolsetType, ToolsetYamlFromConfig, YAMLToolset
|
|
13
|
+
from holmes.plugins.toolsets.atlas_mongodb.mongodb_atlas import MongoDBAtlasToolset
|
|
14
|
+
from holmes.plugins.toolsets.azure_sql.azure_sql_toolset import AzureSQLToolset
|
|
15
|
+
from holmes.plugins.toolsets.bash.bash_toolset import BashExecutorToolset
|
|
14
16
|
from holmes.plugins.toolsets.coralogix.toolset_coralogix_logs import (
|
|
15
17
|
CoralogixLogsToolset,
|
|
16
18
|
)
|
|
@@ -18,18 +20,15 @@ from holmes.plugins.toolsets.datadog.toolset_datadog_logs import DatadogLogsTool
|
|
|
18
20
|
from holmes.plugins.toolsets.datadog.toolset_datadog_metrics import (
|
|
19
21
|
DatadogMetricsToolset,
|
|
20
22
|
)
|
|
21
|
-
from holmes.plugins.toolsets.datadog.toolset_datadog_traces import
|
|
22
|
-
DatadogTracesToolset,
|
|
23
|
-
)
|
|
24
|
-
from holmes.plugins.toolsets.kubernetes_logs import KubernetesLogsToolset
|
|
23
|
+
from holmes.plugins.toolsets.datadog.toolset_datadog_traces import DatadogTracesToolset
|
|
25
24
|
from holmes.plugins.toolsets.git import GitToolset
|
|
26
25
|
from holmes.plugins.toolsets.grafana.toolset_grafana import GrafanaToolset
|
|
27
|
-
from holmes.plugins.toolsets.bash.bash_toolset import BashExecutorToolset
|
|
28
26
|
from holmes.plugins.toolsets.grafana.toolset_grafana_loki import GrafanaLokiToolset
|
|
29
27
|
from holmes.plugins.toolsets.grafana.toolset_grafana_tempo import GrafanaTempoToolset
|
|
30
28
|
from holmes.plugins.toolsets.internet.internet import InternetToolset
|
|
31
29
|
from holmes.plugins.toolsets.internet.notion import NotionToolset
|
|
32
30
|
from holmes.plugins.toolsets.kafka import KafkaToolset
|
|
31
|
+
from holmes.plugins.toolsets.kubernetes_logs import KubernetesLogsToolset
|
|
33
32
|
from holmes.plugins.toolsets.mcp.toolset_mcp import RemoteMCPToolset
|
|
34
33
|
from holmes.plugins.toolsets.newrelic import NewRelicToolset
|
|
35
34
|
from holmes.plugins.toolsets.opensearch.opensearch import OpenSearchToolset
|
|
@@ -38,7 +37,6 @@ from holmes.plugins.toolsets.opensearch.opensearch_traces import OpenSearchTrace
|
|
|
38
37
|
from holmes.plugins.toolsets.prometheus.prometheus import PrometheusToolset
|
|
39
38
|
from holmes.plugins.toolsets.rabbitmq.toolset_rabbitmq import RabbitMQToolset
|
|
40
39
|
from holmes.plugins.toolsets.robusta.robusta import RobustaToolset
|
|
41
|
-
from holmes.plugins.toolsets.atlas_mongodb.mongodb_atlas import MongoDBAtlasToolset
|
|
42
40
|
from holmes.plugins.toolsets.runbook.runbook_fetcher import RunbookToolset
|
|
43
41
|
from holmes.plugins.toolsets.servicenow.servicenow import ServiceNowToolset
|
|
44
42
|
|
|
@@ -156,7 +154,7 @@ def load_toolsets_from_config(
|
|
|
156
154
|
toolset_type = config.get("type", ToolsetType.BUILTIN.value)
|
|
157
155
|
# MCP server is not a built-in toolset, so we need to set the type explicitly
|
|
158
156
|
validated_toolset: Optional[Toolset] = None
|
|
159
|
-
if toolset_type
|
|
157
|
+
if toolset_type == ToolsetType.MCP.value:
|
|
160
158
|
validated_toolset = RemoteMCPToolset(**config, name=name)
|
|
161
159
|
elif strict_check:
|
|
162
160
|
validated_toolset = YAMLToolset(**config, name=name) # type: ignore
|
|
@@ -55,11 +55,3 @@ toolsets:
|
|
|
55
55
|
user_description: "lists all VMSS names in {{ NODE_RESOURCE_GROUP }}"
|
|
56
56
|
command: |
|
|
57
57
|
az vmss list -g {{ NODE_RESOURCE_GROUP }} --query '[*].name' -o tsv --only-show-errors
|
|
58
|
-
- name: "vmss_run_command"
|
|
59
|
-
description: |
|
|
60
|
-
Execute a shell command on a specific VMSS VM instance using az vmss run-command.
|
|
61
|
-
VM_ID is the instance ID of the VMSS, which can be derived from node names.
|
|
62
|
-
Prerequisites: get_node_resource_group, list_vmss_names
|
|
63
|
-
user_description: "run command {{ SHELL_COMMAND }} on VM #{{ VM_ID }} of VMSS {{ VMSS_NAME }}"
|
|
64
|
-
command: |
|
|
65
|
-
az vmss run-command invoke --resource-group {{ NODE_RESOURCE_GROUP }} --name {{ VMSS_NAME }} --instance-id {{ VM_ID }} --command-id RunShellScript --scripts {{ SHELL_COMMAND }}
|
|
@@ -6,13 +6,16 @@ toolsets:
|
|
|
6
6
|
llm_instructions: |
|
|
7
7
|
You have access to a set of ArgoCD tools for debugging Kubernetes application deployments.
|
|
8
8
|
If an application's name does not exist in kubernetes, it may exist in argocd: call the tool `argocd_app_list` to find it.
|
|
9
|
+
IMPORTANT: If you are asked about health issues, ALWAYS check if the argo cd apps are in a healthy state.
|
|
10
|
+
If some resource is out of sync, ALWAYS show the diff, using the argocd_app_diff tool, between the desired state and the current state.
|
|
9
11
|
These tools help you investigate issues with GitOps-managed applications in your Kubernetes clusters.
|
|
10
|
-
ALWAYS follow these steps:
|
|
12
|
+
In addition to the general investigation steps, ALWAYS follow these steps as well:
|
|
11
13
|
1. List the applications
|
|
12
14
|
2. Retrieve the application status and its config
|
|
13
15
|
3. Retrieve the application's manifests for issues
|
|
14
16
|
4. Compare the ArgoCD config with the kubernetes status using kubectl tools
|
|
15
17
|
5. Check for resources mismatch between argocd and kubernetes
|
|
18
|
+
6. If an application is OutOfSync, pull the diff using the argocd_app_diff tool
|
|
16
19
|
{% if tool_names|list|length > 0 %}
|
|
17
20
|
The following commands are available to introspect into ArgoCD: {{ ", ".join(tool_names) }}
|
|
18
21
|
{% endif %}
|
|
@@ -134,6 +134,8 @@ class ConnectionFailureAPI:
|
|
|
134
134
|
for metric in metrics.value:
|
|
135
135
|
if metric.timeseries:
|
|
136
136
|
for timeseries in metric.timeseries:
|
|
137
|
+
if timeseries.data is None:
|
|
138
|
+
continue
|
|
137
139
|
for data_point in timeseries.data:
|
|
138
140
|
if data_point.time_stamp:
|
|
139
141
|
metric_values.append(
|
|
@@ -14,6 +14,6 @@ toolsets:
|
|
|
14
14
|
|
|
15
15
|
tools:
|
|
16
16
|
- name: "fetch_confluence_url"
|
|
17
|
-
description: "Fetch a page in confluence.
|
|
17
|
+
description: "Fetch a page in confluence."
|
|
18
18
|
user_description: "fetch confluence page {{ confluence_page_id }}"
|
|
19
19
|
command: "curl -u ${CONFLUENCE_USER}:${CONFLUENCE_API_KEY} -X GET -H 'Content-Type: application/json' ${CONFLUENCE_BASE_URL}/wiki/rest/api/content/{{ confluence_page_id }}?expand=body.storage"
|
|
@@ -15,12 +15,62 @@ When investigating metrics-related issues:
|
|
|
15
15
|
- Provides metric type (gauge/count/rate), unit, and description
|
|
16
16
|
- Accepts comma-separated list for batch queries
|
|
17
17
|
|
|
18
|
+
4. **Use `list_datadog_metric_tags`** to understand which tags are available for a given metric
|
|
19
|
+
- Provides a set of tags and aggregations
|
|
20
|
+
- Can help to build the correct `tag_filter`, to find which metrics are available for a given resource
|
|
21
|
+
|
|
22
|
+
### General guideline
|
|
23
|
+
- This toolset is used to generate visualizations and graphs.
|
|
24
|
+
- Assume the resource should have metrics. If metrics not found, try to adjust tag filters
|
|
25
|
+
- IMPORTANT: This toolset DOES NOT support promql queries.
|
|
26
|
+
|
|
27
|
+
### CRITICAL: Pod Name Resolution Workflow
|
|
28
|
+
When users ask for metrics about a deployment, service, or workload (e.g., "my-workload", "nginx-deployment"):
|
|
29
|
+
|
|
30
|
+
**ALWAYS follow this two-step process:**
|
|
31
|
+
1. **First**: Use `kubectl_find_resource` to find the actual pod names
|
|
32
|
+
- Example: `kubectl_find_resource` with "my-workload" → finds pods like "my-workload-8f8cdfxyz-c7zdr"
|
|
33
|
+
2. **Then**: Use those specific pod names in Datadog queries
|
|
34
|
+
- Correct: `container.cpu.usage{pod_name:my-workload-8f8cdfxyz-c7zdr}`
|
|
35
|
+
- WRONG: `container.cpu.usage{pod_name:my-workload}` ← This will return no data!
|
|
36
|
+
|
|
37
|
+
**Why this matters:**
|
|
38
|
+
- Pod names in Datadog are the actual Kubernetes pod names (with random suffixes)
|
|
39
|
+
- Deployment/service names are NOT pod names
|
|
40
|
+
- Using deployment names as pod_name filters will always return empty results
|
|
41
|
+
|
|
18
42
|
### Time Parameters
|
|
19
43
|
- Use RFC3339 format: `2023-03-01T10:30:00Z`
|
|
20
44
|
- Or relative seconds: `-3600` for 1 hour ago
|
|
21
45
|
- Defaults to 1 hour window if not specified
|
|
22
46
|
|
|
23
|
-
### Common Patterns
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
47
|
+
### Common Investigation Patterns
|
|
48
|
+
|
|
49
|
+
**For Pod/Container Metrics (MOST COMMON):**
|
|
50
|
+
1. User asks: "Show CPU for my-workload"
|
|
51
|
+
2. Use `kubectl_find_resource` → find pod "my-workload-abc123-xyz"
|
|
52
|
+
3. Query Datadog: `container.cpu.usage{pod_name:my-workload-abc123-xyz}`
|
|
53
|
+
|
|
54
|
+
**For Node-level Metrics:**
|
|
55
|
+
1. Use `tag_filter:kube_node_name:nodename` to filter by node
|
|
56
|
+
2. Query system-level metrics like `system.cpu.user{kube_node_name:worker-1}`
|
|
57
|
+
|
|
58
|
+
**For Service-level Metrics:**
|
|
59
|
+
1. First resolve service to pods using `kubectl_find_resource`
|
|
60
|
+
2. Query metrics for all pods belonging to that service
|
|
61
|
+
3. Use namespace filtering: `tag_filter:kube_namespace:default`
|
|
62
|
+
|
|
63
|
+
|
|
64
|
+
# Handling queries results
|
|
65
|
+
* ALWAYS embed the execution results into your answer
|
|
66
|
+
* You only need to embed the partial result in your response. Include the "tool_name" and "random_key". For example: << {"type": "datadogql", "tool_name": "query_datadog_metrics", "random_key": "92jf2hf"} >>
|
|
67
|
+
* Post processing will parse your response, re-run the query from the tool output and create a chart visible to the user
|
|
68
|
+
* You MUST ensure that the query is successful.
|
|
69
|
+
* ALWAYS embed a DataDog graph in the response. The graph should visualize data related to the incident.
|
|
70
|
+
* Embed at most 2 graphs
|
|
71
|
+
* When embedding multiple graphs, always add line spacing between them
|
|
72
|
+
For example:
|
|
73
|
+
|
|
74
|
+
<<{"type": "datadogql", "tool_name": "query_datadog_metrics", "random_key": "lBaA"}>>
|
|
75
|
+
|
|
76
|
+
<<{"type": "datadogql", "tool_name": "query_datadog_metrics", "random_key": "IKtq"}>>
|