hud-python 0.2.0__tar.gz → 0.2.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of hud-python might be problematic. Click here for more details.
- {hud_python-0.2.0 → hud_python-0.2.2}/PKG-INFO +52 -39
- hud_python-0.2.2/README.md +136 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/advanced/cla-details.mdx +5 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/concepts/environment.mdx +17 -2
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/concepts/task.mdx +37 -6
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/docs.json +13 -4
- hud_python-0.2.2/docs/environments/hud-browser.mdx +67 -0
- hud_python-0.2.2/docs/environments/hud-ubuntu.mdx +55 -0
- hud_python-0.2.2/docs/environments/qa.mdx +68 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/quickstart.mdx +4 -14
- hud_python-0.2.2/docs/running-your-agent.mdx +237 -0
- hud_python-0.2.2/examples/WebVoyager_data.jsonl +643 -0
- hud_python-0.2.2/examples/browser_use.ipynb +120 -0
- hud_python-0.2.2/examples/ds_upload.ipynb +2313 -0
- hud_python-0.2.2/examples/inspect.ipynb +2091 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/examples/jobs.ipynb +4 -6
- {hud_python-0.2.0 → hud_python-0.2.2}/examples/local.ipynb +9 -13
- {hud_python-0.2.0 → hud_python-0.2.2}/examples/osworld.ipynb +3 -3
- hud_python-0.2.2/examples/tasks.ipynb +259 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/__init__.py +3 -2
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/adapters/__init__.py +2 -1
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/adapters/claude/adapter.py +15 -2
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/adapters/common/types.py +7 -3
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/adapters/operator/adapter.py +10 -6
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/agent/__init__.py +2 -1
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/agent/claude.py +22 -2
- hud_python-0.2.2/hud/agent/langchain.py +198 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/agent/operator.py +35 -17
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/env/docker_client.py +1 -1
- hud_python-0.2.2/hud/env/environment.py +354 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/env/local_docker_client.py +3 -1
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/env/remote_client.py +4 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/gym.py +3 -3
- hud_python-0.2.2/hud/job.py +593 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/task.py +41 -30
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/taskset.py +8 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/types.py +5 -3
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/utils/common.py +31 -1
- hud_python-0.2.2/hud/utils/config.py +94 -0
- hud_python-0.2.2/hud/utils/progress.py +136 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/pyproject.toml +10 -4
- {hud_python-0.2.0 → hud_python-0.2.2}/tests/test_import.py +1 -1
- hud_python-0.2.0/README.md +0 -129
- hud_python-0.2.0/examples/browser_use.ipynb +0 -324
- hud_python-0.2.0/examples/tasks.ipynb +0 -117
- hud_python-0.2.0/hud/env/environment.py +0 -181
- hud_python-0.2.0/hud/job.py +0 -185
- hud_python-0.2.0/hud/utils/config.py +0 -185
- {hud_python-0.2.0 → hud_python-0.2.2}/.env.example +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/.github/workflows/ci.yml +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/.github/workflows/release.yml +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/.gitignore +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/LICENSE +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/MANIFEST.in +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/advanced/custom-environments.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/advanced/environment-control.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/api/reference/adapters.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/api-reference/adapters.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/api-reference/env.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/api-reference/gym.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/api-reference/job.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/api-reference/task.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/api-reference/taskset.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/api-reference/trajectory.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/concepts/adapter.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/concepts/agent.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/concepts/job.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/concepts/trajectory.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/examples/basic.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/examples/claude-agent.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/examples/custom-agent.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/favicon.png +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/installation.mdx +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/logo/HUD-light-optimized.svg +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/docs/logo/HUD.svg +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/novnc_ubuntu/Dockerfile +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/novnc_ubuntu/pyproject.toml +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/novnc_ubuntu/src/novnc_ubuntu/__init__.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/novnc_ubuntu/src/novnc_ubuntu/pyautogui_rosetta.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/novnc_ubuntu/src/novnc_ubuntu/step.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/qa_controller/Dockerfile +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/qa_controller/pyproject.toml +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/qa_controller/src/qa_controller/__init__.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/qa_controller/src/qa_controller/evaluate/__init__.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/qa_controller/src/qa_controller/evaluate/matchers.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/qa_controller/src/qa_controller/info.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/qa_controller/src/qa_controller/setup/__init__.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/qa_controller/src/qa_controller/setup/question.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/qa_controller/src/qa_controller/step.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/qa_controller/src/qa_controller/utils/__init__.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/environments/qa_controller/src/qa_controller/utils/state.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/examples/README.md +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/adapters/claude/__init__.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/adapters/common/__init__.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/adapters/common/adapter.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/adapters/operator/__init__.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/agent/base.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/env/__init__.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/env/client.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/env/remote_docker_client.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/evaluators/__init__.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/evaluators/base.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/evaluators/inspect.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/evaluators/judge.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/evaluators/match.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/evaluators/remote.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/py.typed +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/server/__init__.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/server/requests.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/settings.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/trajectory.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/utils/__init__.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/hud/utils/telemetry.py +0 -0
- {hud_python-0.2.0 → hud_python-0.2.2}/tests/__init__.py +0 -0
|
@@ -1,11 +1,11 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: hud-python
|
|
3
|
-
Version: 0.2.
|
|
3
|
+
Version: 0.2.2
|
|
4
4
|
Summary: SDK for the HUD evaluation platform.
|
|
5
|
-
Project-URL: Homepage, https://github.com/
|
|
6
|
-
Project-URL: Bug Tracker, https://github.com/
|
|
5
|
+
Project-URL: Homepage, https://github.com/hud-evals/hud-sdk
|
|
6
|
+
Project-URL: Bug Tracker, https://github.com/hud-evals/hud-sdk/issues
|
|
7
7
|
Project-URL: Documentation, https://hud.so
|
|
8
|
-
Author-email:
|
|
8
|
+
Author-email: HUD SDK <founders@hud.so>
|
|
9
9
|
License: MIT License
|
|
10
10
|
|
|
11
11
|
Copyright (c) 2025 Human Union Data, Inc
|
|
@@ -37,8 +37,14 @@ Classifier: Programming Language :: Python :: 3.12
|
|
|
37
37
|
Classifier: Programming Language :: Python :: 3.13
|
|
38
38
|
Requires-Python: <3.14,>=3.10
|
|
39
39
|
Requires-Dist: aiodocker>=0.24.0
|
|
40
|
+
Requires-Dist: anthropic
|
|
40
41
|
Requires-Dist: httpx<1,>=0.23.0
|
|
41
42
|
Requires-Dist: inspect-ai>=0.3.80
|
|
43
|
+
Requires-Dist: ipykernel
|
|
44
|
+
Requires-Dist: langchain
|
|
45
|
+
Requires-Dist: langchain-openai
|
|
46
|
+
Requires-Dist: numpy
|
|
47
|
+
Requires-Dist: openai
|
|
42
48
|
Requires-Dist: pillow>=11.1.0
|
|
43
49
|
Requires-Dist: pydantic-settings<3,>=2
|
|
44
50
|
Requires-Dist: pydantic<3,>=2
|
|
@@ -57,7 +63,7 @@ Requires-Dist: pytest<9,>=8.1.1; extra == 'dev'
|
|
|
57
63
|
Requires-Dist: ruff==0.9.8; extra == 'dev'
|
|
58
64
|
Description-Content-Type: text/markdown
|
|
59
65
|
|
|
60
|
-
# HUD
|
|
66
|
+
# HUD
|
|
61
67
|
|
|
62
68
|
A Python SDK for creating, evaluating, and benchmarking agent interactions with web browsers and OS environments.
|
|
63
69
|
|
|
@@ -86,21 +92,20 @@ export HUD_API_KEY=your_api_key_here
|
|
|
86
92
|
pip install hud-python
|
|
87
93
|
```
|
|
88
94
|
|
|
89
|
-
### Simple Browser Example with
|
|
95
|
+
### Simple Browser Example with Claude Computer Use
|
|
90
96
|
|
|
91
97
|
> This example uses the `@job("test-run")` decorator, so the results of this run will appear under the job named "test-run" on the your [HUD Jobs page](https://app.hud.so/jobs).
|
|
92
98
|
|
|
99
|
+
Make sure your have defined your `ANTRHOPIC_API_KEY` in environment variables to run Claude.
|
|
100
|
+
|
|
93
101
|
```python
|
|
94
|
-
import os
|
|
95
102
|
import asyncio
|
|
96
103
|
from hud import gym, job
|
|
97
104
|
from hud.task import Task
|
|
98
|
-
from hud.
|
|
99
|
-
from hud.agent import OperatorAgent
|
|
105
|
+
from hud.agent import ClaudeAgent
|
|
100
106
|
|
|
101
107
|
@job("test-run")
|
|
102
108
|
async def main():
|
|
103
|
-
# Define a simple task
|
|
104
109
|
task = Task(
|
|
105
110
|
prompt="Insert the text 'capybara' into the search bar",
|
|
106
111
|
gym="hud-browser",
|
|
@@ -108,26 +113,19 @@ async def main():
|
|
|
108
113
|
evaluate=("contains_text", "capybara")
|
|
109
114
|
)
|
|
110
115
|
|
|
111
|
-
# Create environment
|
|
116
|
+
# Create environment using the gym module
|
|
112
117
|
env = await gym.make(task)
|
|
113
118
|
|
|
114
|
-
# Get URLs and display live view (optional)
|
|
115
|
-
# urls = await env.get_urls()
|
|
116
|
-
# stream(urls["live_url"])
|
|
117
|
-
|
|
118
119
|
# Initialize Operator agent (API key is loaded automatically)
|
|
119
|
-
agent =
|
|
120
|
+
agent = ClaudeAgent()
|
|
120
121
|
|
|
121
|
-
# Agent loop
|
|
122
|
-
obs, _ = env.reset()
|
|
122
|
+
# Agent loop with predict and step functions
|
|
123
|
+
obs, _ = await env.reset() # Gets first observation
|
|
123
124
|
for i in range(5):
|
|
124
125
|
actions, done = await agent.predict(obs)
|
|
125
|
-
|
|
126
|
-
break
|
|
127
|
-
|
|
126
|
+
|
|
128
127
|
obs, reward, terminated, info = await env.step(actions)
|
|
129
|
-
if terminated:
|
|
130
|
-
break
|
|
128
|
+
if done or terminated: break
|
|
131
129
|
|
|
132
130
|
# Evaluate and close
|
|
133
131
|
result = await env.evaluate()
|
|
@@ -139,35 +137,50 @@ if __name__ == "__main__":
|
|
|
139
137
|
|
|
140
138
|
```
|
|
141
139
|
|
|
140
|
+
Alternatively, run a full evaluation set via the ```run_job``` command:
|
|
141
|
+
|
|
142
|
+
```python
|
|
143
|
+
from hud import load_taskset, run_job, ClaudeAgent
|
|
144
|
+
|
|
145
|
+
# load
|
|
146
|
+
taskset = load_taskset("GAIA")
|
|
147
|
+
|
|
148
|
+
# evaluate
|
|
149
|
+
job = await run_job(ClaudeAgent, taskset, "test-gaia-job")
|
|
150
|
+
|
|
151
|
+
# get results OR view them in app.hud.so
|
|
152
|
+
print(await job.get_analytics())
|
|
153
|
+
```
|
|
154
|
+
|
|
142
155
|
## Documentation Sections
|
|
143
156
|
|
|
144
157
|
Explore the core concepts and features of the SDK:
|
|
145
158
|
|
|
146
|
-
* **[Tasks and TaskSets](/concepts/task)**: Define goals, context, setup, and evaluation criteria for agent scenarios.
|
|
147
|
-
* **[Environments](/concepts/environment)**: Understand the browser and OS runtimes where agents interact.
|
|
148
|
-
* **[Agents](/concepts/agent)**: Learn about the agent architecture (Claude, Operator) and how they process observations and predict actions.
|
|
149
|
-
* **[Adapters](/concepts/adapter)**: See how actions and observations are translated between agents and environments.
|
|
150
|
-
* **[Jobs](/concepts/job)**: Group related runs for analysis and viewing on the HUD platform.
|
|
151
|
-
* **[Trajectories](/concepts/trajectory)**: Understand the recorded data from each agent run.
|
|
159
|
+
* **[Tasks and TaskSets](https://documentation.hud.so/concepts/task)**: Define goals, context, setup, and evaluation criteria for agent scenarios. This includes both interactive and **question-answering (QA)** style tasks.
|
|
160
|
+
* **[Environments](https://documentation.hud.so/concepts/environment)**: Understand the browser and OS runtimes where agents interact.
|
|
161
|
+
* **[Agents](https://documentation.hud.so/concepts/agent)**: Learn about the agent architecture (Claude, Operator) and how they process observations and predict actions.
|
|
162
|
+
* **[Adapters](https://documentation.hud.so/concepts/adapter)**: See how actions and observations are translated between agents and environments.
|
|
163
|
+
* **[Jobs](https://documentation.hud.so/concepts/job)**: Group related runs for analysis and viewing on the HUD platform.
|
|
164
|
+
* **[Trajectories](https://documentation.hud.so/concepts/trajectory)**: Understand the recorded data from each agent run.
|
|
152
165
|
* **Advanced Topics**:
|
|
153
|
-
* **[
|
|
154
|
-
* **[
|
|
155
|
-
* **[
|
|
166
|
+
* **[CLA Action Details](https://documentation.hud.so/advanced/cla-details)**: Explore the standardized action format.
|
|
167
|
+
* **[Custom Environments](https://documentation.hud.so/advanced/custom-environments)**: Build your own Docker-based local or remote environments.
|
|
168
|
+
* **[Advanced Environment Control](https://documentation.hud.so/advanced/environment-control)**: Use `invoke`, `execute`, and `_setup` for finer control.
|
|
156
169
|
|
|
157
|
-
* **[Full API Reference](/api-reference/gym)**: Detailed specifications for all modules and classes.
|
|
170
|
+
* **[Full API Reference](https://documentation.hud.so/api-reference/gym)**: Detailed specifications for all modules and classes.
|
|
158
171
|
|
|
159
172
|
## [Examples](examples/)
|
|
160
173
|
|
|
161
|
-
We
|
|
174
|
+
We recommend you first take a look at the example notebooks showing how to use the HUD SDK:
|
|
162
175
|
|
|
163
176
|
1. [Browser Basics](examples/browser_use.ipynb) - Simple browser interaction with live view
|
|
164
177
|
2. [Task Design](examples/tasks.ipynb) - Creating and customizing tasks
|
|
165
|
-
3. [OSWorld](examples/osworld.ipynb) -
|
|
178
|
+
3. [OSWorld](examples/osworld.ipynb) - Running the OSWorld benchmark
|
|
166
179
|
4. [Local Development](examples/local.ipynb) - Setting up local custom environments
|
|
167
180
|
|
|
168
181
|
## Documentation
|
|
169
182
|
|
|
170
|
-
For comprehensive guides, examples, and API reference, visit [our docs](https://
|
|
183
|
+
For comprehensive guides, examples, and API reference, visit [our docs](https://documentation.hud.so/introduction)
|
|
171
184
|
|
|
172
185
|
## License
|
|
173
186
|
|
|
@@ -180,9 +193,9 @@ If you use this SDK in your research, please cite it as follows:
|
|
|
180
193
|
```bibtex
|
|
181
194
|
@software{hud2025agentevalplatform,
|
|
182
195
|
author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Max Muoto and Oskars Putans and Govind Pimpale and Mayank Singamreddy and Nguyen Nhat Minh},
|
|
183
|
-
title = {{HUD: An Evaluation Platform for
|
|
184
|
-
date = {2025-
|
|
185
|
-
url = {https://github.com/
|
|
196
|
+
title = {{HUD: An Evaluation Platform for Agents}},
|
|
197
|
+
date = {2025-04},
|
|
198
|
+
url = {https://github.com/hud-evals/hud-sdk},
|
|
186
199
|
langid = {en}
|
|
187
200
|
}
|
|
188
201
|
```
|
|
@@ -0,0 +1,136 @@
|
|
|
1
|
+
# HUD
|
|
2
|
+
|
|
3
|
+
A Python SDK for creating, evaluating, and benchmarking agent interactions with web browsers and OS environments.
|
|
4
|
+
|
|
5
|
+
> **Early Release Notice**: This SDK is currently in early release status. The API is evolving and may change in future releases as we gather feedback and improve functionality.
|
|
6
|
+
|
|
7
|
+
[](https://pypi.org/project/hud-python/)
|
|
8
|
+
|
|
9
|
+
[📚 Documentation](https://documentation.hud.so) | [🏠 Homepage](https://hud.so)
|
|
10
|
+
|
|
11
|
+
## API Key Setup
|
|
12
|
+
|
|
13
|
+
Before getting started, you'll need to obtain an API key:
|
|
14
|
+
|
|
15
|
+
1. Visit [app.hud.so](https://app.hud.so) to create a free account and generate your API key
|
|
16
|
+
2. Set it in your environment or .env file:
|
|
17
|
+
|
|
18
|
+
```bash
|
|
19
|
+
export HUD_API_KEY=your_api_key_here
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
## Quick Start
|
|
23
|
+
|
|
24
|
+
### Installation
|
|
25
|
+
|
|
26
|
+
```bash
|
|
27
|
+
pip install hud-python
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
### Simple Browser Example with Claude Computer Use
|
|
31
|
+
|
|
32
|
+
> This example uses the `@job("test-run")` decorator, so the results of this run will appear under the job named "test-run" on the your [HUD Jobs page](https://app.hud.so/jobs).
|
|
33
|
+
|
|
34
|
+
Make sure your have defined your `ANTRHOPIC_API_KEY` in environment variables to run Claude.
|
|
35
|
+
|
|
36
|
+
```python
|
|
37
|
+
import asyncio
|
|
38
|
+
from hud import gym, job
|
|
39
|
+
from hud.task import Task
|
|
40
|
+
from hud.agent import ClaudeAgent
|
|
41
|
+
|
|
42
|
+
@job("test-run")
|
|
43
|
+
async def main():
|
|
44
|
+
task = Task(
|
|
45
|
+
prompt="Insert the text 'capybara' into the search bar",
|
|
46
|
+
gym="hud-browser",
|
|
47
|
+
setup=("goto", "google.com"),
|
|
48
|
+
evaluate=("contains_text", "capybara")
|
|
49
|
+
)
|
|
50
|
+
|
|
51
|
+
# Create environment using the gym module
|
|
52
|
+
env = await gym.make(task)
|
|
53
|
+
|
|
54
|
+
# Initialize Operator agent (API key is loaded automatically)
|
|
55
|
+
agent = ClaudeAgent()
|
|
56
|
+
|
|
57
|
+
# Agent loop with predict and step functions
|
|
58
|
+
obs, _ = await env.reset() # Gets first observation
|
|
59
|
+
for i in range(5):
|
|
60
|
+
actions, done = await agent.predict(obs)
|
|
61
|
+
|
|
62
|
+
obs, reward, terminated, info = await env.step(actions)
|
|
63
|
+
if done or terminated: break
|
|
64
|
+
|
|
65
|
+
# Evaluate and close
|
|
66
|
+
result = await env.evaluate()
|
|
67
|
+
print(f"Evaluation result: {result}")
|
|
68
|
+
await env.close()
|
|
69
|
+
|
|
70
|
+
if __name__ == "__main__":
|
|
71
|
+
asyncio.run(main())
|
|
72
|
+
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
Alternatively, run a full evaluation set via the ```run_job``` command:
|
|
76
|
+
|
|
77
|
+
```python
|
|
78
|
+
from hud import load_taskset, run_job, ClaudeAgent
|
|
79
|
+
|
|
80
|
+
# load
|
|
81
|
+
taskset = load_taskset("GAIA")
|
|
82
|
+
|
|
83
|
+
# evaluate
|
|
84
|
+
job = await run_job(ClaudeAgent, taskset, "test-gaia-job")
|
|
85
|
+
|
|
86
|
+
# get results OR view them in app.hud.so
|
|
87
|
+
print(await job.get_analytics())
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
## Documentation Sections
|
|
91
|
+
|
|
92
|
+
Explore the core concepts and features of the SDK:
|
|
93
|
+
|
|
94
|
+
* **[Tasks and TaskSets](https://documentation.hud.so/concepts/task)**: Define goals, context, setup, and evaluation criteria for agent scenarios. This includes both interactive and **question-answering (QA)** style tasks.
|
|
95
|
+
* **[Environments](https://documentation.hud.so/concepts/environment)**: Understand the browser and OS runtimes where agents interact.
|
|
96
|
+
* **[Agents](https://documentation.hud.so/concepts/agent)**: Learn about the agent architecture (Claude, Operator) and how they process observations and predict actions.
|
|
97
|
+
* **[Adapters](https://documentation.hud.so/concepts/adapter)**: See how actions and observations are translated between agents and environments.
|
|
98
|
+
* **[Jobs](https://documentation.hud.so/concepts/job)**: Group related runs for analysis and viewing on the HUD platform.
|
|
99
|
+
* **[Trajectories](https://documentation.hud.so/concepts/trajectory)**: Understand the recorded data from each agent run.
|
|
100
|
+
* **Advanced Topics**:
|
|
101
|
+
* **[CLA Action Details](https://documentation.hud.so/advanced/cla-details)**: Explore the standardized action format.
|
|
102
|
+
* **[Custom Environments](https://documentation.hud.so/advanced/custom-environments)**: Build your own Docker-based local or remote environments.
|
|
103
|
+
* **[Advanced Environment Control](https://documentation.hud.so/advanced/environment-control)**: Use `invoke`, `execute`, and `_setup` for finer control.
|
|
104
|
+
|
|
105
|
+
* **[Full API Reference](https://documentation.hud.so/api-reference/gym)**: Detailed specifications for all modules and classes.
|
|
106
|
+
|
|
107
|
+
## [Examples](examples/)
|
|
108
|
+
|
|
109
|
+
We recommend you first take a look at the example notebooks showing how to use the HUD SDK:
|
|
110
|
+
|
|
111
|
+
1. [Browser Basics](examples/browser_use.ipynb) - Simple browser interaction with live view
|
|
112
|
+
2. [Task Design](examples/tasks.ipynb) - Creating and customizing tasks
|
|
113
|
+
3. [OSWorld](examples/osworld.ipynb) - Running the OSWorld benchmark
|
|
114
|
+
4. [Local Development](examples/local.ipynb) - Setting up local custom environments
|
|
115
|
+
|
|
116
|
+
## Documentation
|
|
117
|
+
|
|
118
|
+
For comprehensive guides, examples, and API reference, visit [our docs](https://documentation.hud.so/introduction)
|
|
119
|
+
|
|
120
|
+
## License
|
|
121
|
+
|
|
122
|
+
[MIT License](LICENSE)
|
|
123
|
+
|
|
124
|
+
## Citation
|
|
125
|
+
|
|
126
|
+
If you use this SDK in your research, please cite it as follows:
|
|
127
|
+
|
|
128
|
+
```bibtex
|
|
129
|
+
@software{hud2025agentevalplatform,
|
|
130
|
+
author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Max Muoto and Oskars Putans and Govind Pimpale and Mayank Singamreddy and Nguyen Nhat Minh},
|
|
131
|
+
title = {{HUD: An Evaluation Platform for Agents}},
|
|
132
|
+
date = {2025-04},
|
|
133
|
+
url = {https://github.com/hud-evals/hud-sdk},
|
|
134
|
+
langid = {en}
|
|
135
|
+
}
|
|
136
|
+
```
|
|
@@ -69,6 +69,11 @@ Here are some key CLA types grouped by category:
|
|
|
69
69
|
* **`ScreenshotFetch`**: Requests a screenshot (used internally, typically not sent by agents directly).
|
|
70
70
|
* **`PositionFetch`**: Requests the current cursor position (used internally).
|
|
71
71
|
|
|
72
|
+
### Response Actions
|
|
73
|
+
|
|
74
|
+
* **`ResponseAction`**: Used to submit a final text answer.
|
|
75
|
+
* `text: str`: The final textual response from the agent.
|
|
76
|
+
|
|
72
77
|
### Custom Actions
|
|
73
78
|
|
|
74
79
|
* **`CustomAction`**: Allows defining arbitrary actions specific to a custom environment controller.
|
|
@@ -48,6 +48,20 @@ env_os = await gym.make("OSWorld-Ubuntu")
|
|
|
48
48
|
|
|
49
49
|
Environments created this way won't have a default `Task` associated unless you explicitly reset them with one later using `env.reset()`. The `gym.make()` function also automatically links the environment to an active [Job](/concepts/job) if one was defined using the `@job` decorator.
|
|
50
50
|
|
|
51
|
+
## Available Environment Types
|
|
52
|
+
|
|
53
|
+
The HUD SDK provides several standard environment types, specified via the `gym` attribute in a [Task](/concepts/task) or directly in `hud.gym.make()`:
|
|
54
|
+
|
|
55
|
+
* **`"hud-browser"`**: Provides a remote Chromium browser instance managed via Playwright. Ideal for web navigation, form interaction, and testing web applications.
|
|
56
|
+
* [See `hud-browser` Details](../environments/hud-browser.mdx)
|
|
57
|
+
* **`"hud-ubuntu"`**: Provides a remote Ubuntu desktop environment accessed via VNC. Suitable for tasks involving GUI applications, file system interaction, or running Linux software.
|
|
58
|
+
* [See `hud-ubuntu` Details](../environments/hud-ubuntu.mdx)
|
|
59
|
+
* **`"qa"`**: A non-interactive environment for question-answering tasks where the agent provides a direct textual response.
|
|
60
|
+
* [See `qa` Environment Details](../environments/qa.mdx)
|
|
61
|
+
* **`CustomGym`**: Allows defining and running your own [Custom Environments](../advanced/custom-environments.mdx) using Docker, either locally or remotely. This provides maximum flexibility for specific testing needs.
|
|
62
|
+
|
|
63
|
+
The `gym` attribute in a Task tells `hud.gym.make()` which environment to instantiate.
|
|
64
|
+
|
|
51
65
|
## Interaction Loop
|
|
52
66
|
|
|
53
67
|
The standard interaction flow involves the [Agent](/concepts/agent) and the Environment:
|
|
@@ -67,13 +81,14 @@ obs, _ = await env.reset()
|
|
|
67
81
|
for _ in range(10):
|
|
68
82
|
# 2. Agent predicts action(s)
|
|
69
83
|
actions, done = await agent.predict(obs)
|
|
70
|
-
if done: break
|
|
71
84
|
|
|
72
85
|
# 3. Execute action(s) in environment
|
|
73
86
|
obs, reward, terminated, info = await env.step(actions)
|
|
74
|
-
if terminated: break
|
|
87
|
+
if done or terminated: break
|
|
75
88
|
```
|
|
76
89
|
|
|
90
|
+
* **Note on QA Tasks:** For [Question-Answering Tasks](/concepts/task#defining-question-answering-qa-tasks), the agent might only need one `predict` call. The agent should output a `ResponseAction`, which the environment stores. The subsequent `env.evaluate()` call then checks this stored response. The environment itself remains largely passive for QA.
|
|
91
|
+
|
|
77
92
|
## Key Methods
|
|
78
93
|
|
|
79
94
|
* **`env.step(actions: list[CLA] | None = None)`**: Executes actions (or gets initial state). Returns `(Observation, reward, terminated, info)`.
|
|
@@ -30,11 +30,10 @@ task = Task(
|
|
|
30
30
|
prompt="Log in to example.com with username 'test'",
|
|
31
31
|
gym="hud-browser", # Request a browser environment
|
|
32
32
|
setup=[ # Actions run by gym.make(task)
|
|
33
|
-
("goto", "https://example.com/login")
|
|
34
|
-
{"function": "wait_for_element", "args": ["#username"]}
|
|
33
|
+
("goto", "https://example.com/login")
|
|
35
34
|
],
|
|
36
35
|
evaluate={ # Logic run by env.evaluate()
|
|
37
|
-
"function": "
|
|
36
|
+
"function": "page_contains",
|
|
38
37
|
"args": ["test"]
|
|
39
38
|
}
|
|
40
39
|
)
|
|
@@ -60,7 +59,10 @@ Both `setup` and `evaluate` accept configurations defining function calls within
|
|
|
60
59
|
* **Purpose:** Determines task success after the agent finishes.
|
|
61
60
|
* **Execution:** Triggered by `await env.evaluate()`.
|
|
62
61
|
* **Result:** The return value of `env.evaluate()`, often a reward score (e.g., `1.0` or `0.0`). This is stored in the `reward` field of the [Trajectory](/concepts/trajectory) if linked to a [Job](/concepts/job).
|
|
63
|
-
* **Examples:**
|
|
62
|
+
* **Examples:**
|
|
63
|
+
* Interactive: `("contains_text", "Success!")`, `("file_exists", "/path/to/output.txt")`. These typically call functions *within* the active environment controller.
|
|
64
|
+
* QA: `("response_includes", "Paris")`. These functions often check the text stored in `env.final_response` (which comes from the agent's `ResponseAction`).
|
|
65
|
+
* **Note:** Check specific environment or evaluation service documentation for available functions.
|
|
64
66
|
|
|
65
67
|
## TaskSet
|
|
66
68
|
|
|
@@ -79,11 +81,13 @@ Load predefined sets from the HUD platform:
|
|
|
79
81
|
```python
|
|
80
82
|
from hud import load_taskset
|
|
81
83
|
|
|
82
|
-
taskset = await load_taskset("OSWorld-Ubuntu
|
|
84
|
+
taskset = await load_taskset("OSWorld-Ubuntu")
|
|
83
85
|
print(f"Number of tasks: {len(taskset)}") # TaskSet acts like a list
|
|
84
86
|
first_task = taskset[0]
|
|
85
87
|
```
|
|
86
88
|
|
|
89
|
+
Currently supported TaskSets available via `load_taskset` include OSWorld, GAIA, and WebVoyager subsets.
|
|
90
|
+
|
|
87
91
|
### Creating a TaskSet Manually
|
|
88
92
|
|
|
89
93
|
```python
|
|
@@ -99,4 +103,31 @@ my_taskset = TaskSet(tasks=[task1, task2], description="My set")
|
|
|
99
103
|
* [Environment](/concepts/environment): Where Tasks are executed and evaluated.
|
|
100
104
|
* [Agent](/concepts/agent): Aims to complete the Task `prompt`.
|
|
101
105
|
* [Job](/concepts/job): Groups runs of different Tasks.
|
|
102
|
-
* [Trajectory](/concepts/trajectory): Records the execution of a Task.
|
|
106
|
+
* [Trajectory](/concepts/trajectory): Records the execution of a Task.
|
|
107
|
+
|
|
108
|
+
### Defining Question-Answering (QA) Tasks
|
|
109
|
+
|
|
110
|
+
While HUD excels at interactive tasks, you can also define tasks that are primarily question-answering. The key differences are:
|
|
111
|
+
|
|
112
|
+
* **`gym`:** You might still use an existing environment type like `"hud-browser"` if you want the QA to happen *within* that context (e.g., asking the agent to answer based on a webpage). For pure QA without environment interaction, a future specific `"qa"` gym type might be introduced, but currently, you'd use an existing type.
|
|
113
|
+
* **`prompt`:** Contains the question for the agent.
|
|
114
|
+
* **`setup`:** Often minimal or unnecessary for pure QA.
|
|
115
|
+
* **`evaluate`:** Defines how to check the agent's final text answer. This typically involves calling a specific evaluation function that compares the agent's final submitted response (see `ResponseAction` in [CLA Details](/advanced/cla-details)) against expected criteria. The `env.final_response` attribute holds the text submitted by the agent via `ResponseAction`.
|
|
116
|
+
* **`target`:** (Recommended) Store the ground truth answer in the `metadata` or potentially a dedicated `target` field for clarity during evaluation function design.
|
|
117
|
+
|
|
118
|
+
```python
|
|
119
|
+
from hud.task import Task
|
|
120
|
+
|
|
121
|
+
qa_task = Task(
|
|
122
|
+
prompt="What is the powerhouse of the cell?",
|
|
123
|
+
gym="hud-browser", # Or potentially a future "qa" type
|
|
124
|
+
# No complex setup needed for pure QA
|
|
125
|
+
setup=(),
|
|
126
|
+
# Evaluation checks the agent's final submitted text response
|
|
127
|
+
evaluate=("response_includes", "mitochondria"), # Assumes a function checking env.final_response
|
|
128
|
+
)
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
The [Agent](/concepts/agent) handling such a task should recognize it doesn't need complex interaction and output a `ResponseAction` containing the final answer. The `env.evaluate()` call then triggers the specified check (like `response_includes`) against the stored response.
|
|
132
|
+
|
|
133
|
+
### <a name="configuration-styles"></a>Configuration Styles (`setup` and `evaluate`)
|
|
@@ -14,6 +14,7 @@
|
|
|
14
14
|
"group": "Getting Started",
|
|
15
15
|
"pages": [
|
|
16
16
|
"quickstart",
|
|
17
|
+
"running-your-agent",
|
|
17
18
|
"installation"
|
|
18
19
|
]
|
|
19
20
|
},
|
|
@@ -28,12 +29,20 @@
|
|
|
28
29
|
"concepts/trajectory"
|
|
29
30
|
]
|
|
30
31
|
},
|
|
32
|
+
{
|
|
33
|
+
"group": "Environments",
|
|
34
|
+
"pages": [
|
|
35
|
+
"environments/hud-browser",
|
|
36
|
+
"environments/hud-ubuntu",
|
|
37
|
+
"environments/qa"
|
|
38
|
+
]
|
|
39
|
+
},
|
|
31
40
|
{
|
|
32
41
|
"group": "Advanced Topics",
|
|
33
42
|
"pages": [
|
|
43
|
+
"advanced/cla-details",
|
|
34
44
|
"advanced/custom-environments",
|
|
35
|
-
"advanced/environment-control"
|
|
36
|
-
"advanced/cla-details"
|
|
45
|
+
"advanced/environment-control"
|
|
37
46
|
]
|
|
38
47
|
},
|
|
39
48
|
{
|
|
@@ -59,13 +68,13 @@
|
|
|
59
68
|
"links": [
|
|
60
69
|
{
|
|
61
70
|
"label": "GitHub",
|
|
62
|
-
"href": "https://github.com/
|
|
71
|
+
"href": "https://github.com/hud-evals/hud-sdk"
|
|
63
72
|
}
|
|
64
73
|
]
|
|
65
74
|
},
|
|
66
75
|
"footer": {
|
|
67
76
|
"socials": {
|
|
68
|
-
"github": "https://github.com/
|
|
77
|
+
"github": "https://github.com/hud-evals/hud-sdk",
|
|
69
78
|
"website": "https://hud.so"
|
|
70
79
|
}
|
|
71
80
|
}
|
|
@@ -0,0 +1,67 @@
|
|
|
1
|
+
# HUD Browser Environment
|
|
2
|
+
|
|
3
|
+
## Introduction
|
|
4
|
+
|
|
5
|
+
The `hud-browser` environment provides a remote Chromium browser instance, managed by Playwright, for agents to interact with websites. It's ideal for tasks involving web navigation, form filling, information retrieval, and testing web applications.
|
|
6
|
+
|
|
7
|
+
## Setup
|
|
8
|
+
|
|
9
|
+
Setup actions for the `hud-browser` are defined in the `setup` attribute of a [Task](../concepts/task.mdx) and executed by `hud.gym.make()`. They typically involve browser controller functions.
|
|
10
|
+
|
|
11
|
+
* **`goto(url: str)`**: Navigates the browser to the specified `url`. Automatically prepends `http://` if no scheme is provided. Waits for `domcontentloaded` (up to 10s timeout) and adds a 1s wait for rendering.
|
|
12
|
+
```python
|
|
13
|
+
# Example Task Setup:
|
|
14
|
+
setup=[("goto", "https://google.com")]
|
|
15
|
+
```
|
|
16
|
+
* **Other common setup functions coming soon:** `wait_for_element`, `click`, `type`, `set_cookies` etc.
|
|
17
|
+
|
|
18
|
+
Refer to [Task Setup Configuration](../concepts/task.mdx#setup-configuration) for how to define these.
|
|
19
|
+
|
|
20
|
+
## Step Interaction
|
|
21
|
+
|
|
22
|
+
Agents interact with the browser environment by sending a list of [CLA Actions](../advanced/cla-details.mdx) to `env.step()`. An [Adapter](../concepts/adapter.mdx) typically handles the conversion from the agent model's output to the CLA format.
|
|
23
|
+
|
|
24
|
+
Common CLAs used with `hud-browser`:
|
|
25
|
+
* [`ClickAction`](../advanced/cla-details.mdx#mouse-actions)
|
|
26
|
+
* [`MoveAction`](../advanced/cla-details.mdx#mouse-actions)
|
|
27
|
+
* [`TypeAction`](../advanced/cla-details.mdx#keyboard-actions)
|
|
28
|
+
* [`PressAction`](../advanced/cla-details.mdx#keyboard-actions)
|
|
29
|
+
* [`ScrollAction`](../advanced/cla-details.mdx#mouse-actions)
|
|
30
|
+
* [`DragAction`](../advanced/cla-details.mdx#mouse-actions)
|
|
31
|
+
* [`ResponseAction`](../advanced/cla-details.mdx#response-actions) (to submit a final text answer)
|
|
32
|
+
|
|
33
|
+
*See [CLA Action Details](../advanced/cla-details.mdx) for the full specification.*
|
|
34
|
+
|
|
35
|
+
## Evaluate
|
|
36
|
+
|
|
37
|
+
The `evaluate` attribute of a [Task](../concepts/task.mdx) defines how success is measured using `env.evaluate()`. This calls functions within the browser controller.
|
|
38
|
+
|
|
39
|
+
Built-in evaluation functions for `hud-browser`:
|
|
40
|
+
|
|
41
|
+
* **`url_match(expected_url: str)`**: Checks if the current browser URL exactly matches `expected_url`. Returns `1.0` for a match, `0.0` otherwise.
|
|
42
|
+
```python
|
|
43
|
+
# Example Task Evaluation:
|
|
44
|
+
evaluate=("url_match", "https://google.com/search?q=expected")
|
|
45
|
+
```
|
|
46
|
+
* **`page_contains(texts: list[str])`** (alias `contains_text`): Checks if *all* strings in `texts` are present in `page.content()`. Returns `1.0` if all texts are found, `0.0` otherwise.
|
|
47
|
+
```python
|
|
48
|
+
# Example Task Evaluation:
|
|
49
|
+
evaluate=("page_contains", ["Search Results", "About 1,000,000 results"])
|
|
50
|
+
```
|
|
51
|
+
* **`sheet_contains(texts: list[str])`**: Custom function for Google Sheets. Returns `1.0` if any text is found, `0.0` otherwise.
|
|
52
|
+
```python
|
|
53
|
+
# Example Task Evaluation:
|
|
54
|
+
evaluate=("sheet_contains", ["Expected value in cell A1"])
|
|
55
|
+
```
|
|
56
|
+
* **`cookie_exists(cookie_names: list[str])`**: Checks if all cookies in `cookie_names` exist in `context.cookies()`. Returns `1.0` if all exist, `0.0` otherwise.
|
|
57
|
+
```python
|
|
58
|
+
# Example Task Evaluation:
|
|
59
|
+
evaluate=("cookie_exists", ["session_id", "user_pref"])
|
|
60
|
+
```
|
|
61
|
+
* **`cookie_match(name_value_pairs: list[str])`**: Checks if cookies exist *and* match expected values. `name_value_pairs` format: `[name1, value1, name2, value2, ...]`. Returns `1.0` if all match, `0.0` otherwise.
|
|
62
|
+
```python
|
|
63
|
+
# Example Task Evaluation:
|
|
64
|
+
evaluate=("cookie_match", ["user_id", "12345", "theme", "dark"])
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
Refer to [Task Evaluation Configuration](../concepts/task.mdx#evaluation-configuration) for more details.
|
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
# HUD Ubuntu Environment
|
|
2
|
+
|
|
3
|
+
## Introduction
|
|
4
|
+
|
|
5
|
+
The `hud-ubuntu` environment provides a remote Ubuntu OS instance with a graphical desktop, accessed via a VNC connection displayed in the browser. It's suitable for tasks requiring interaction with GUI applications, the file system, or running specific software within a Linux desktop environment.
|
|
6
|
+
|
|
7
|
+
## Setup
|
|
8
|
+
|
|
9
|
+
The environment setup simply launches the Ubuntu desktop session within the VNC viewer.
|
|
10
|
+
|
|
11
|
+
*Specific pre-launch setup functions (e.g., pre-installing packages, setting environment variables) are planned for future releases.*
|
|
12
|
+
|
|
13
|
+
Currently, any necessary setup (like installing software or creating files) must be performed by the agent *after* the environment starts, using standard interaction actions.
|
|
14
|
+
|
|
15
|
+
Refer to [Task Setup Configuration](../concepts/task.mdx#setup-configuration) for the general concept of how setup steps *could* be defined in a Task, although they are not currently implemented for this specific environment.
|
|
16
|
+
|
|
17
|
+
## Step Interaction
|
|
18
|
+
|
|
19
|
+
Agents interact with the Ubuntu desktop environment by sending a list of [CLA Actions](../advanced/cla-details.mdx) to `env.step()`. An [Adapter](../concepts/adapter.mdx) typically handles the conversion from the agent model's output to the CLA format.
|
|
20
|
+
|
|
21
|
+
Available CLA actions for interacting with the graphical desktop:
|
|
22
|
+
|
|
23
|
+
**Keyboard Actions:**
|
|
24
|
+
* [`TypeAction`](../advanced/cla-details.mdx#keyboard-actions): Simulates typing text into the focused application or window element.
|
|
25
|
+
```python
|
|
26
|
+
# Example: Typing into a text editor
|
|
27
|
+
TypeAction(text="Hello, Ubuntu!")
|
|
28
|
+
```
|
|
29
|
+
* [`PressAction`](../advanced/cla-details.mdx#keyboard-actions): For sending hotkeys (e.g., `Ctrl+C`, `Alt+F4`).
|
|
30
|
+
* [`KeyDownAction` / `KeyUpAction`](../advanced/cla-details.mdx#keyboard-actions): For holding/releasing modifier keys (e.g., holding Shift while clicking).
|
|
31
|
+
|
|
32
|
+
**Mouse Actions:**
|
|
33
|
+
* [`ClickAction`](../advanced/cla-details.mdx#mouse-actions): To click on GUI elements (buttons, icons, menus, etc.).
|
|
34
|
+
* [`MoveAction`](../advanced/cla-details.mdx#mouse-actions): To move the mouse cursor to specific coordinates or elements.
|
|
35
|
+
* [`ScrollAction`](../advanced/cla-details.mdx#mouse-actions): To scroll within windows or applications.
|
|
36
|
+
* [`DragAction`](../advanced/cla-details.mdx#mouse-actions): To perform drag-and-drop operations.
|
|
37
|
+
|
|
38
|
+
**Control & Response Actions:**
|
|
39
|
+
* [`WaitAction`](../advanced/cla-details.mdx#control-actions): To introduce pauses if needed.
|
|
40
|
+
* [`ResponseAction`](../advanced/cla-details.mdx#response-actions): Used by the agent to submit its final answer or result text.
|
|
41
|
+
|
|
42
|
+
**Other Actions (less common for direct agent use):**
|
|
43
|
+
* [`ScreenshotFetch`](../advanced/cla-details.mdx#fetch-actions-get-information): Usually handled internally by the environment/agent loop.
|
|
44
|
+
* [`PositionFetch`](../advanced/cla-details.mdx#fetch-actions-get-information): Usually handled internally.
|
|
45
|
+
* [`CustomAction`](../advanced/cla-details.mdx#custom-actions): For potential future custom environment extensions.
|
|
46
|
+
|
|
47
|
+
*See [CLA Action Details](../advanced/cla-details.mdx) for the full specification of each action and its parameters.*
|
|
48
|
+
|
|
49
|
+
## Evaluate
|
|
50
|
+
|
|
51
|
+
*Specific evaluation functions for `hud-ubuntu` (e.g., checking file content, application state, process status) are planned for future releases.*
|
|
52
|
+
|
|
53
|
+
Currently, evaluation often relies on the agent submitting a final answer via `ResponseAction`, which can then be checked using generic QA evaluators defined in the [Task](../concepts/task.mdx) (like `response_includes`, `response_matches`). Alternatively, evaluation might involve visually inspecting the final state via the VNC connection or checking logs if the agent was tasked with producing specific output.
|
|
54
|
+
|
|
55
|
+
Refer to [Task Evaluation Configuration](../concepts/task.mdx#evaluation-configuration) for the general concept of defining evaluation steps.
|