hud-python 0.3.0__tar.gz → 0.3.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of hud-python might be problematic. Click here for more details.

Files changed (301) hide show
  1. {hud_python-0.3.0 → hud_python-0.3.1}/.gitignore +5 -1
  2. {hud_python-0.3.0 → hud_python-0.3.1}/PKG-INFO +20 -14
  3. {hud_python-0.3.0 → hud_python-0.3.1}/README.md +3 -3
  4. hud_python-0.3.1/environments/README.md +407 -0
  5. hud_python-0.3.1/environments/docker_debug.py +701 -0
  6. hud_python-0.3.1/environments/remote_browser/Dockerfile +23 -0
  7. hud_python-0.3.1/environments/remote_browser/README.md +62 -0
  8. hud_python-0.3.1/environments/remote_browser/pyproject.toml +26 -0
  9. hud_python-0.3.1/environments/remote_browser/src/hud_controller/__init__.py +3 -0
  10. hud_python-0.3.1/environments/remote_browser/src/hud_controller/__main__.py +13 -0
  11. hud_python-0.3.1/environments/remote_browser/src/hud_controller/browser_computer_tool.py +335 -0
  12. hud_python-0.3.1/environments/remote_browser/src/hud_controller/evaluators/__init__.py +22 -0
  13. hud_python-0.3.1/environments/remote_browser/src/hud_controller/evaluators/context.py +77 -0
  14. hud_python-0.3.1/environments/remote_browser/src/hud_controller/evaluators/cookie_exists.py +107 -0
  15. hud_python-0.3.1/environments/remote_browser/src/hud_controller/evaluators/cookie_match.py +142 -0
  16. hud_python-0.3.1/environments/remote_browser/src/hud_controller/evaluators/history_length.py +78 -0
  17. hud_python-0.3.1/environments/remote_browser/src/hud_controller/evaluators/page_contains.py +106 -0
  18. hud_python-0.3.1/environments/remote_browser/src/hud_controller/evaluators/raw_last_action_is.py +81 -0
  19. hud_python-0.3.1/environments/remote_browser/src/hud_controller/evaluators/registry.py +157 -0
  20. hud_python-0.3.1/environments/remote_browser/src/hud_controller/evaluators/selector_history.py +69 -0
  21. hud_python-0.3.1/environments/remote_browser/src/hud_controller/evaluators/sheet_contains.py +123 -0
  22. hud_python-0.3.1/environments/remote_browser/src/hud_controller/evaluators/sheets_cell_values.py +176 -0
  23. hud_python-0.3.1/environments/remote_browser/src/hud_controller/evaluators/url_match.py +84 -0
  24. hud_python-0.3.1/environments/remote_browser/src/hud_controller/evaluators/verify_type_action.py +102 -0
  25. hud_python-0.3.1/environments/remote_browser/src/hud_controller/playwright_with_memory.py +144 -0
  26. hud_python-0.3.1/environments/remote_browser/src/hud_controller/problems/__init__.py +11 -0
  27. hud_python-0.3.1/environments/remote_browser/src/hud_controller/problems/navigate_and_verify.py +28 -0
  28. hud_python-0.3.1/environments/remote_browser/src/hud_controller/problems/registry.py +91 -0
  29. hud_python-0.3.1/environments/remote_browser/src/hud_controller/providers/README.md +110 -0
  30. hud_python-0.3.1/environments/remote_browser/src/hud_controller/providers/__init__.py +33 -0
  31. hud_python-0.3.1/environments/remote_browser/src/hud_controller/providers/anchorbrowser.py +164 -0
  32. hud_python-0.3.1/environments/remote_browser/src/hud_controller/providers/base.py +96 -0
  33. hud_python-0.3.1/environments/remote_browser/src/hud_controller/providers/browserbase.py +176 -0
  34. hud_python-0.3.1/environments/remote_browser/src/hud_controller/providers/hyperbrowser.py +244 -0
  35. hud_python-0.3.1/environments/remote_browser/src/hud_controller/providers/kernel.py +13 -0
  36. hud_python-0.3.1/environments/remote_browser/src/hud_controller/providers/steel.py +203 -0
  37. hud_python-0.3.1/environments/remote_browser/src/hud_controller/runtime.py +210 -0
  38. hud_python-0.3.1/environments/remote_browser/src/hud_controller/server.py +336 -0
  39. hud_python-0.3.1/environments/remote_browser/src/hud_controller/setup/__init__.py +15 -0
  40. hud_python-0.3.1/environments/remote_browser/src/hud_controller/setup/cookies.py +95 -0
  41. hud_python-0.3.1/environments/remote_browser/src/hud_controller/setup/interact.py +154 -0
  42. hud_python-0.3.1/environments/remote_browser/src/hud_controller/setup/load_html.py +66 -0
  43. hud_python-0.3.1/environments/remote_browser/src/hud_controller/setup/navigate.py +54 -0
  44. hud_python-0.3.1/environments/remote_browser/src/hud_controller/setup/registry.py +104 -0
  45. hud_python-0.3.1/environments/remote_browser/src/hud_controller/setup/sheets.py +303 -0
  46. hud_python-0.3.1/environments/remote_browser/test_mcp.sh +4 -0
  47. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/Dockerfile +2 -1
  48. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/README.md +121 -3
  49. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/pyproject.toml +2 -1
  50. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/server.py +9 -1
  51. {hud_python-0.3.0 → hud_python-0.3.1}/examples/agents_tools/mcp_claude_agent.py +7 -12
  52. {hud_python-0.3.0 → hud_python-0.3.1}/examples/agents_tools/mcp_openai_agent.py +4 -4
  53. {hud_python-0.3.0 → hud_python-0.3.1}/examples/agents_tools/mcp_test.ipynb +1 -1
  54. {hud_python-0.3.0 → hud_python-0.3.1}/examples/agents_tools/mcp_use_agent.py +2 -2
  55. {hud_python-0.3.0 → hud_python-0.3.1}/examples/agents_tools/simple_task_example.py +12 -11
  56. hud_python-0.3.1/examples/environments/gmail_local.py +74 -0
  57. hud_python-0.3.1/examples/environments/gmail_remote.py +74 -0
  58. {hud_python-0.3.0 → hud_python-0.3.1}/examples/environments/resources_example.py +1 -1
  59. {hud_python-0.3.0 → hud_python-0.3.1}/examples/environments/simple_browser_example.py +4 -4
  60. hud_python-0.3.1/examples/evaluations/eval.py +124 -0
  61. hud_python-0.3.1/examples/evaluations/telemetry_and_datasets.ipynb +350 -0
  62. {hud_python-0.3.0 → hud_python-0.3.1}/hud/__init__.py +7 -4
  63. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/common/adapter.py +14 -3
  64. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/common/tests/test_adapter.py +16 -4
  65. hud_python-0.3.1/hud/datasets.py +188 -0
  66. {hud_python-0.3.0 → hud_python-0.3.1}/hud/env/docker_client.py +14 -2
  67. {hud_python-0.3.0 → hud_python-0.3.1}/hud/env/local_docker_client.py +28 -6
  68. {hud_python-0.3.0 → hud_python-0.3.1}/hud/gym.py +0 -9
  69. {hud_python-0.3.0/hud/mcp_agent → hud_python-0.3.1/hud/mcp}/__init__.py +2 -0
  70. hud_python-0.3.1/hud/mcp/base.py +631 -0
  71. {hud_python-0.3.0/hud/mcp_agent → hud_python-0.3.1/hud/mcp}/claude.py +52 -47
  72. hud_python-0.3.1/hud/mcp/client.py +312 -0
  73. {hud_python-0.3.0/hud/mcp_agent → hud_python-0.3.1/hud/mcp}/langchain.py +52 -33
  74. {hud_python-0.3.0/hud/mcp_agent → hud_python-0.3.1/hud/mcp}/openai.py +56 -40
  75. {hud_python-0.3.0/hud/mcp_agent → hud_python-0.3.1/hud/mcp}/tests/test_base.py +129 -54
  76. hud_python-0.3.1/hud/mcp/tests/test_claude.py +294 -0
  77. hud_python-0.3.1/hud/mcp/tests/test_client.py +324 -0
  78. hud_python-0.3.1/hud/mcp/tests/test_openai.py +238 -0
  79. {hud_python-0.3.0 → hud_python-0.3.1}/hud/settings.py +6 -0
  80. {hud_python-0.3.0 → hud_python-0.3.1}/hud/task.py +1 -88
  81. {hud_python-0.3.0 → hud_python-0.3.1}/hud/taskset.py +2 -23
  82. {hud_python-0.3.0 → hud_python-0.3.1}/hud/telemetry/__init__.py +5 -0
  83. hud_python-0.3.1/hud/telemetry/_trace.py +347 -0
  84. {hud_python-0.3.0 → hud_python-0.3.1}/hud/telemetry/context.py +79 -0
  85. {hud_python-0.3.0 → hud_python-0.3.1}/hud/telemetry/exporter.py +165 -6
  86. hud_python-0.3.1/hud/telemetry/job.py +141 -0
  87. {hud_python-0.3.0 → hud_python-0.3.1}/hud/telemetry/tests/test_trace.py +36 -25
  88. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/__init__.py +14 -1
  89. hud_python-0.3.1/hud/tools/executors/__init__.py +30 -0
  90. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/executors/pyautogui.py +84 -50
  91. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/executors/tests/test_pyautogui_executor.py +4 -1
  92. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/playwright_tool.py +73 -67
  93. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/tests/test_edit.py +8 -1
  94. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/tests/test_tools.py +3 -0
  95. {hud_python-0.3.0 → hud_python-0.3.1}/hud/trajectory.py +5 -1
  96. {hud_python-0.3.0 → hud_python-0.3.1}/hud/utils/tests/test_version.py +1 -1
  97. {hud_python-0.3.0 → hud_python-0.3.1}/hud/version.py +1 -1
  98. {hud_python-0.3.0 → hud_python-0.3.1}/pyproject.toml +31 -17
  99. hud_python-0.3.0/environments/README.md +0 -163
  100. hud_python-0.3.0/environments/simple_browser/DEPLOYMENT.md +0 -159
  101. hud_python-0.3.0/examples/environments/gmail_local.py +0 -66
  102. hud_python-0.3.0/hud/evaluators/__init__.py +0 -9
  103. hud_python-0.3.0/hud/evaluators/base.py +0 -32
  104. hud_python-0.3.0/hud/evaluators/inspect.py +0 -24
  105. hud_python-0.3.0/hud/evaluators/judge.py +0 -189
  106. hud_python-0.3.0/hud/evaluators/match.py +0 -156
  107. hud_python-0.3.0/hud/evaluators/remote.py +0 -65
  108. hud_python-0.3.0/hud/evaluators/tests/test_inspect.py +0 -12
  109. hud_python-0.3.0/hud/evaluators/tests/test_judge.py +0 -231
  110. hud_python-0.3.0/hud/evaluators/tests/test_match.py +0 -115
  111. hud_python-0.3.0/hud/evaluators/tests/test_remote.py +0 -98
  112. hud_python-0.3.0/hud/mcp_agent/base.py +0 -723
  113. hud_python-0.3.0/hud/telemetry/_trace.py +0 -184
  114. hud_python-0.3.0/hud/tools/executors/__init__.py +0 -13
  115. hud_python-0.3.0/hud/utils/tests/__init__.py +0 -0
  116. {hud_python-0.3.0 → hud_python-0.3.1}/.env.example +0 -0
  117. {hud_python-0.3.0 → hud_python-0.3.1}/.github/workflows/ci.yml +0 -0
  118. {hud_python-0.3.0 → hud_python-0.3.1}/.github/workflows/release.yml +0 -0
  119. {hud_python-0.3.0 → hud_python-0.3.1}/LICENSE +0 -0
  120. {hud_python-0.3.0 → hud_python-0.3.1}/MANIFEST.in +0 -0
  121. {hud_python-0.3.0 → hud_python-0.3.1}/docs/advanced/cla-details.mdx +0 -0
  122. {hud_python-0.3.0 → hud_python-0.3.1}/docs/advanced/environment-control.mdx +0 -0
  123. {hud_python-0.3.0 → hud_python-0.3.1}/docs/advanced/tracing.mdx +0 -0
  124. {hud_python-0.3.0 → hud_python-0.3.1}/docs/advanced/uploading.mdx +0 -0
  125. {hud_python-0.3.0 → hud_python-0.3.1}/docs/api-reference/adapters.mdx +0 -0
  126. {hud_python-0.3.0 → hud_python-0.3.1}/docs/api-reference/env.mdx +0 -0
  127. {hud_python-0.3.0 → hud_python-0.3.1}/docs/api-reference/gym.mdx +0 -0
  128. {hud_python-0.3.0 → hud_python-0.3.1}/docs/api-reference/job.mdx +0 -0
  129. {hud_python-0.3.0 → hud_python-0.3.1}/docs/api-reference/task.mdx +0 -0
  130. {hud_python-0.3.0 → hud_python-0.3.1}/docs/api-reference/taskset.mdx +0 -0
  131. {hud_python-0.3.0 → hud_python-0.3.1}/docs/api-reference/telemetry.mdx +0 -0
  132. {hud_python-0.3.0 → hud_python-0.3.1}/docs/api-reference/trajectory.mdx +0 -0
  133. {hud_python-0.3.0 → hud_python-0.3.1}/docs/concepts/adapter.mdx +0 -0
  134. {hud_python-0.3.0 → hud_python-0.3.1}/docs/concepts/agent.mdx +0 -0
  135. {hud_python-0.3.0 → hud_python-0.3.1}/docs/concepts/environment.mdx +0 -0
  136. {hud_python-0.3.0 → hud_python-0.3.1}/docs/concepts/job.mdx +0 -0
  137. {hud_python-0.3.0 → hud_python-0.3.1}/docs/concepts/task.mdx +0 -0
  138. {hud_python-0.3.0 → hud_python-0.3.1}/docs/concepts/trajectory.mdx +0 -0
  139. {hud_python-0.3.0 → hud_python-0.3.1}/docs/docs.json +0 -0
  140. {hud_python-0.3.0 → hud_python-0.3.1}/docs/environment-creation.mdx +0 -0
  141. {hud_python-0.3.0 → hud_python-0.3.1}/docs/environments/browser.mdx +0 -0
  142. {hud_python-0.3.0 → hud_python-0.3.1}/docs/environments/custom-environments.mdx +0 -0
  143. {hud_python-0.3.0 → hud_python-0.3.1}/docs/environments/custom.mdx +0 -0
  144. {hud_python-0.3.0 → hud_python-0.3.1}/docs/environments/osworld-ubuntu.mdx +0 -0
  145. {hud_python-0.3.0 → hud_python-0.3.1}/docs/environments/qa.mdx +0 -0
  146. {hud_python-0.3.0 → hud_python-0.3.1}/docs/examples/alignment-evaluation.mdx +0 -0
  147. {hud_python-0.3.0 → hud_python-0.3.1}/docs/examples/benchmarking-agents.mdx +0 -0
  148. {hud_python-0.3.0 → hud_python-0.3.1}/docs/examples/custom-os-env.mdx +0 -0
  149. {hud_python-0.3.0 → hud_python-0.3.1}/docs/examples/mcp-agent-tracing.mdx +0 -0
  150. {hud_python-0.3.0 → hud_python-0.3.1}/docs/examples/web-app-testing.mdx +0 -0
  151. {hud_python-0.3.0 → hud_python-0.3.1}/docs/examples/web-mocks.mdx +0 -0
  152. {hud_python-0.3.0 → hud_python-0.3.1}/docs/favicon.png +0 -0
  153. {hud_python-0.3.0 → hud_python-0.3.1}/docs/logo/hud_logo.svg +0 -0
  154. {hud_python-0.3.0 → hud_python-0.3.1}/docs/logo/hud_logo_dark.svg +0 -0
  155. {hud_python-0.3.0 → hud_python-0.3.1}/docs/quickstart.mdx +0 -0
  156. {hud_python-0.3.0 → hud_python-0.3.1}/docs/running-your-agent.mdx +0 -0
  157. {hud_python-0.3.0 → hud_python-0.3.1}/docs/task-creation.mdx +0 -0
  158. {hud_python-0.3.0 → hud_python-0.3.1}/environments/pokemon_controller/Dockerfile +0 -0
  159. {hud_python-0.3.0 → hud_python-0.3.1}/environments/pokemon_controller/pyproject.toml +0 -0
  160. {hud_python-0.3.0 → hud_python-0.3.1}/environments/pokemon_controller/src/hud_controller/__init__.py +0 -0
  161. {hud_python-0.3.0 → hud_python-0.3.1}/environments/pokemon_controller/src/hud_controller/display_adapters.py +0 -0
  162. {hud_python-0.3.0 → hud_python-0.3.1}/environments/pokemon_controller/src/hud_controller/emulator.py +0 -0
  163. {hud_python-0.3.0 → hud_python-0.3.1}/environments/pokemon_controller/src/hud_controller/evaluator.py +0 -0
  164. {hud_python-0.3.0 → hud_python-0.3.1}/environments/pokemon_controller/src/hud_controller/kill.py +0 -0
  165. {hud_python-0.3.0 → hud_python-0.3.1}/environments/pokemon_controller/src/hud_controller/main.py +0 -0
  166. {hud_python-0.3.0 → hud_python-0.3.1}/environments/pokemon_controller/src/hud_controller/setup.py +0 -0
  167. {hud_python-0.3.0 → hud_python-0.3.1}/environments/pokemon_controller/src/hud_controller/step.py +0 -0
  168. {hud_python-0.3.0 → hud_python-0.3.1}/environments/qa_controller/Dockerfile +0 -0
  169. {hud_python-0.3.0 → hud_python-0.3.1}/environments/qa_controller/pyproject.toml +0 -0
  170. {hud_python-0.3.0 → hud_python-0.3.1}/environments/qa_controller/src/hud_controller/__init__.py +0 -0
  171. {hud_python-0.3.0 → hud_python-0.3.1}/environments/qa_controller/src/hud_controller/evaluate/__init__.py +0 -0
  172. {hud_python-0.3.0 → hud_python-0.3.1}/environments/qa_controller/src/hud_controller/evaluate/matchers.py +0 -0
  173. {hud_python-0.3.0 → hud_python-0.3.1}/environments/qa_controller/src/hud_controller/info.py +0 -0
  174. {hud_python-0.3.0 → hud_python-0.3.1}/environments/qa_controller/src/hud_controller/setup/__init__.py +0 -0
  175. {hud_python-0.3.0 → hud_python-0.3.1}/environments/qa_controller/src/hud_controller/setup/question.py +0 -0
  176. {hud_python-0.3.0 → hud_python-0.3.1}/environments/qa_controller/src/hud_controller/step.py +0 -0
  177. {hud_python-0.3.0 → hud_python-0.3.1}/environments/qa_controller/src/hud_controller/utils/__init__.py +0 -0
  178. {hud_python-0.3.0 → hud_python-0.3.1}/environments/qa_controller/src/hud_controller/utils/state.py +0 -0
  179. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/.dockerignore +0 -0
  180. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/.gitignore +0 -0
  181. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/README.md +0 -0
  182. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/todo/README.md +0 -0
  183. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/todo/backend/main.py +0 -0
  184. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/todo/backend/pyproject.toml +0 -0
  185. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/todo/frontend/app/globals.css +0 -0
  186. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/todo/frontend/app/layout.tsx +0 -0
  187. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/todo/frontend/app/page.tsx +0 -0
  188. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/todo/frontend/next.config.js +0 -0
  189. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/todo/frontend/package-lock.json +0 -0
  190. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/todo/frontend/package.json +0 -0
  191. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/todo/frontend/postcss.config.js +0 -0
  192. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/todo/frontend/tailwind.config.js +0 -0
  193. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/todo/frontend/tsconfig.json +0 -0
  194. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/apps/todo/launch.py +0 -0
  195. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/docker-compose.yml +0 -0
  196. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/README.md +0 -0
  197. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/__init__.py +0 -0
  198. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/__main__.py +0 -0
  199. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/evaluators/__init__.py +0 -0
  200. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/evaluators/context.py +0 -0
  201. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/evaluators/registry.py +0 -0
  202. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/evaluators/todo.py +0 -0
  203. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/problems/__init__.py +0 -0
  204. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/problems/registry.py +0 -0
  205. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/problems/todo.py +0 -0
  206. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/runtime.py +0 -0
  207. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/services.py +0 -0
  208. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/setup/__init__.py +0 -0
  209. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/setup/registry.py +0 -0
  210. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/src/hud_controller/setup/todo.py +0 -0
  211. {hud_python-0.3.0 → hud_python-0.3.1}/environments/simple_browser/start.sh +0 -0
  212. {hud_python-0.3.0 → hud_python-0.3.1}/examples/README.md +0 -0
  213. {hud_python-0.3.0 → hud_python-0.3.1}/examples/agents_tools/browser_use.ipynb +0 -0
  214. {hud_python-0.3.0 → hud_python-0.3.1}/examples/agents_tools/sensitive_data.ipynb +0 -0
  215. {hud_python-0.3.0 → hud_python-0.3.1}/examples/environments/pokemon_local.ipynb +0 -0
  216. {hud_python-0.3.0 → hud_python-0.3.1}/examples/environments/pokemon_remote.ipynb +0 -0
  217. {hud_python-0.3.0 → hud_python-0.3.1}/examples/environments/remote.ipynb +0 -0
  218. {hud_python-0.3.0 → hud_python-0.3.1}/examples/evaluations/osworld.ipynb +0 -0
  219. {hud_python-0.3.0 → hud_python-0.3.1}/examples/evaluations/sheetbench_direct_example.ipynb +0 -0
  220. {hud_python-0.3.0 → hud_python-0.3.1}/examples/evaluations/tasks.ipynb +0 -0
  221. {hud_python-0.3.0 → hud_python-0.3.1}/examples/evaluations/wordle_example.ipynb +0 -0
  222. {hud_python-0.3.0 → hud_python-0.3.1}/examples/sheets_bench_cua_example.ipynb +0 -0
  223. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/__init__.py +0 -0
  224. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/claude/__init__.py +0 -0
  225. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/claude/adapter.py +0 -0
  226. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/claude/tests/__init__.py +0 -0
  227. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/claude/tests/test_adapter.py +0 -0
  228. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/common/__init__.py +0 -0
  229. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/common/tests/__init__.py +0 -0
  230. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/common/types.py +0 -0
  231. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/operator/__init__.py +0 -0
  232. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/operator/adapter.py +0 -0
  233. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/operator/tests/__init__.py +0 -0
  234. {hud_python-0.3.0 → hud_python-0.3.1}/hud/adapters/operator/tests/test_adapter.py +0 -0
  235. {hud_python-0.3.0 → hud_python-0.3.1}/hud/agent/__init__.py +0 -0
  236. {hud_python-0.3.0 → hud_python-0.3.1}/hud/agent/base.py +0 -0
  237. {hud_python-0.3.0 → hud_python-0.3.1}/hud/agent/claude.py +0 -0
  238. {hud_python-0.3.0 → hud_python-0.3.1}/hud/agent/claude_plays_pokemon.py +0 -0
  239. {hud_python-0.3.0 → hud_python-0.3.1}/hud/agent/langchain.py +0 -0
  240. {hud_python-0.3.0 → hud_python-0.3.1}/hud/agent/misc/__init__.py +0 -0
  241. {hud_python-0.3.0 → hud_python-0.3.1}/hud/agent/misc/response_agent.py +0 -0
  242. {hud_python-0.3.0 → hud_python-0.3.1}/hud/agent/operator.py +0 -0
  243. {hud_python-0.3.0 → hud_python-0.3.1}/hud/agent/tests/__init__.py +0 -0
  244. {hud_python-0.3.0 → hud_python-0.3.1}/hud/agent/tests/test_base.py +0 -0
  245. {hud_python-0.3.0 → hud_python-0.3.1}/hud/env/__init__.py +0 -0
  246. {hud_python-0.3.0 → hud_python-0.3.1}/hud/env/client.py +0 -0
  247. {hud_python-0.3.0 → hud_python-0.3.1}/hud/env/environment.py +0 -0
  248. {hud_python-0.3.0 → hud_python-0.3.1}/hud/env/remote_client.py +0 -0
  249. {hud_python-0.3.0 → hud_python-0.3.1}/hud/env/remote_docker_client.py +0 -0
  250. {hud_python-0.3.0 → hud_python-0.3.1}/hud/exceptions.py +0 -0
  251. {hud_python-0.3.0 → hud_python-0.3.1}/hud/job.py +0 -0
  252. {hud_python-0.3.0/hud/mcp_agent → hud_python-0.3.1/hud/mcp}/tests/__init__.py +0 -0
  253. {hud_python-0.3.0 → hud_python-0.3.1}/hud/py.typed +0 -0
  254. {hud_python-0.3.0 → hud_python-0.3.1}/hud/server/__init__.py +0 -0
  255. {hud_python-0.3.0 → hud_python-0.3.1}/hud/server/requests.py +0 -0
  256. {hud_python-0.3.0/hud/evaluators → hud_python-0.3.1/hud/server}/tests/__init__.py +0 -0
  257. {hud_python-0.3.0 → hud_python-0.3.1}/hud/server/tests/test_requests.py +0 -0
  258. {hud_python-0.3.0 → hud_python-0.3.1}/hud/telemetry/instrumentation/__init__.py +0 -0
  259. {hud_python-0.3.0 → hud_python-0.3.1}/hud/telemetry/instrumentation/mcp.py +0 -0
  260. {hud_python-0.3.0 → hud_python-0.3.1}/hud/telemetry/instrumentation/registry.py +0 -0
  261. {hud_python-0.3.0 → hud_python-0.3.1}/hud/telemetry/mcp_models.py +0 -0
  262. {hud_python-0.3.0 → hud_python-0.3.1}/hud/telemetry/tests/__init__.py +0 -0
  263. {hud_python-0.3.0 → hud_python-0.3.1}/hud/telemetry/tests/test_context.py +0 -0
  264. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/base.py +0 -0
  265. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/bash.py +0 -0
  266. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/computer/__init__.py +0 -0
  267. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/computer/anthropic.py +0 -0
  268. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/computer/hud.py +0 -0
  269. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/computer/openai.py +0 -0
  270. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/edit.py +0 -0
  271. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/executors/base.py +0 -0
  272. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/executors/tests/__init__.py +0 -0
  273. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/executors/tests/test_base_executor.py +0 -0
  274. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/executors/xdo.py +0 -0
  275. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/helper/README.md +0 -0
  276. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/helper/__init__.py +0 -0
  277. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/helper/mcp_server.py +0 -0
  278. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/helper/server_initialization.py +0 -0
  279. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/helper/utils.py +0 -0
  280. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/tests/__init__.py +0 -0
  281. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/tests/test_bash.py +0 -0
  282. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/tests/test_computer.py +0 -0
  283. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/tests/test_computer_actions.py +0 -0
  284. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/tests/test_init.py +0 -0
  285. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/tests/test_playwright_tool.py +0 -0
  286. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/tests/test_utils.py +0 -0
  287. {hud_python-0.3.0 → hud_python-0.3.1}/hud/tools/utils.py +0 -0
  288. {hud_python-0.3.0 → hud_python-0.3.1}/hud/types.py +0 -0
  289. {hud_python-0.3.0 → hud_python-0.3.1}/hud/utils/__init__.py +0 -0
  290. {hud_python-0.3.0 → hud_python-0.3.1}/hud/utils/agent.py +0 -0
  291. {hud_python-0.3.0 → hud_python-0.3.1}/hud/utils/common.py +0 -0
  292. {hud_python-0.3.0 → hud_python-0.3.1}/hud/utils/config.py +0 -0
  293. {hud_python-0.3.0 → hud_python-0.3.1}/hud/utils/misc.py +0 -0
  294. {hud_python-0.3.0 → hud_python-0.3.1}/hud/utils/progress.py +0 -0
  295. {hud_python-0.3.0 → hud_python-0.3.1}/hud/utils/telemetry.py +0 -0
  296. {hud_python-0.3.0/hud/server → hud_python-0.3.1/hud/utils}/tests/__init__.py +0 -0
  297. {hud_python-0.3.0 → hud_python-0.3.1}/hud/utils/tests/test_common.py +0 -0
  298. {hud_python-0.3.0 → hud_python-0.3.1}/hud/utils/tests/test_config.py +0 -0
  299. {hud_python-0.3.0 → hud_python-0.3.1}/hud/utils/tests/test_init.py +0 -0
  300. {hud_python-0.3.0 → hud_python-0.3.1}/hud/utils/tests/test_progress.py +0 -0
  301. {hud_python-0.3.0 → hud_python-0.3.1}/hud/utils/tests/test_telemetry.py +0 -0
@@ -28,4 +28,8 @@ TODO.md
28
28
 
29
29
  .coverage
30
30
 
31
- *.log
31
+ *.log
32
+
33
+ /ref/
34
+
35
+ .cursor/
@@ -1,9 +1,9 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: hud-python
3
- Version: 0.3.0
3
+ Version: 0.3.1
4
4
  Summary: SDK for the HUD platform.
5
- Project-URL: Homepage, https://github.com/hud-evals/hud-sdk
6
- Project-URL: Bug Tracker, https://github.com/hud-evals/hud-sdk/issues
5
+ Project-URL: Homepage, https://github.com/hud-evals/hud-python
6
+ Project-URL: Bug Tracker, https://github.com/hud-evals/hud-python/issues
7
7
  Project-URL: Documentation, https://docs.hud.so
8
8
  Author-email: HUD SDK <founders@hud.so>
9
9
  License: MIT License
@@ -35,28 +35,22 @@ Classifier: Programming Language :: Python :: 3.11
35
35
  Classifier: Programming Language :: Python :: 3.12
36
36
  Classifier: Programming Language :: Python :: 3.13
37
37
  Requires-Python: <3.14,>=3.11
38
- Requires-Dist: aiodocker>=0.24.0
39
38
  Requires-Dist: anthropic
39
+ Requires-Dist: datasets>=4.0.0
40
40
  Requires-Dist: dotenv>=0.9.9
41
41
  Requires-Dist: httpx<1,>=0.23.0
42
- Requires-Dist: inspect-ai>=0.3.80
43
- Requires-Dist: ipykernel
44
42
  Requires-Dist: langchain
45
43
  Requires-Dist: langchain-anthropic
46
44
  Requires-Dist: langchain-openai
47
45
  Requires-Dist: mcp-use>=1.3.7
48
46
  Requires-Dist: mcp==1.12.2
49
- Requires-Dist: numpy
50
47
  Requires-Dist: openai
51
48
  Requires-Dist: pathspec>=0.12.1
52
- Requires-Dist: pillow>=11.1.0
53
- Requires-Dist: pyautogui>=0.9.54
54
49
  Requires-Dist: pydantic-settings<3,>=2
55
50
  Requires-Dist: pydantic<3,>=2
56
- Requires-Dist: textdistance<5,>=4.5.0
57
- Requires-Dist: toml>=0.10.2
58
51
  Requires-Dist: wrapt>=1.14.0
59
52
  Provides-Extra: dev
53
+ Requires-Dist: aiodocker>=0.24.0; extra == 'dev'
60
54
  Requires-Dist: anthropic; extra == 'dev'
61
55
  Requires-Dist: dotenv; extra == 'dev'
62
56
  Requires-Dist: ipykernel; extra == 'dev'
@@ -64,17 +58,29 @@ Requires-Dist: ipython<9; extra == 'dev'
64
58
  Requires-Dist: jupyter-client; extra == 'dev'
65
59
  Requires-Dist: jupyter-core; extra == 'dev'
66
60
  Requires-Dist: openai; extra == 'dev'
61
+ Requires-Dist: pillow>=11.1.0; extra == 'dev'
67
62
  Requires-Dist: playwright; extra == 'dev'
63
+ Requires-Dist: pyautogui>=0.9.54; extra == 'dev'
68
64
  Requires-Dist: pyright==1.1.401; extra == 'dev'
69
65
  Requires-Dist: pytest-asyncio; extra == 'dev'
70
66
  Requires-Dist: pytest-cov; extra == 'dev'
71
67
  Requires-Dist: pytest-mock; extra == 'dev'
72
68
  Requires-Dist: pytest<9,>=8.1.1; extra == 'dev'
73
69
  Requires-Dist: ruff==0.11.8; extra == 'dev'
70
+ Requires-Dist: toml>=0.10.2; extra == 'dev'
71
+ Provides-Extra: v2
72
+ Requires-Dist: aiodocker>=0.24.0; extra == 'v2'
73
+ Requires-Dist: inspect-ai>=0.3.80; extra == 'v2'
74
+ Requires-Dist: ipykernel; extra == 'v2'
75
+ Requires-Dist: numpy; extra == 'v2'
76
+ Requires-Dist: pillow>=11.1.0; extra == 'v2'
77
+ Requires-Dist: pyautogui>=0.9.54; extra == 'v2'
78
+ Requires-Dist: textdistance<5,>=4.5.0; extra == 'v2'
79
+ Requires-Dist: toml>=0.10.2; extra == 'v2'
74
80
  Description-Content-Type: text/markdown
75
81
 
76
82
  <div align="left">
77
- <img src="https://raw.githubusercontent.com/hud-evals/hud-sdk/main/docs/logo/hud_logo.svg" alt="HUD" width="150" style="margin-bottom: 20px;"/>
83
+ <img src="https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/logo/hud_logo.svg" alt="HUD" width="150" style="margin-bottom: 20px;"/>
78
84
  </div>
79
85
 
80
86
  <h3>
@@ -88,7 +94,7 @@ Evaluate your Computer Use AI agents across web browsers, desktop environments,
88
94
  We're here to help with eval strategies, custom environments, or improving your agent architecture!
89
95
 
90
96
 
91
- > **Early Release Notice**: We'd love to hear your feedback in [Issues](https://github.com/hud-evals/hud-sdk/issues), as the SDK is still evolving!
97
+ > **Early Release Notice**: We'd love to hear your feedback in [Issues](https://github.com/hud-evals/hud-python/issues), as the SDK is still evolving!
92
98
 
93
99
  [![PyPI version](https://img.shields.io/pypi/v/hud-python)](https://pypi.org/project/hud-python/)
94
100
 
@@ -272,7 +278,7 @@ If you use this SDK in your research, please cite it as follows:
272
278
  author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Oskars Putans and Govind Pimpale and Mayank Singamreddy and Nguyen Nhat Minh},
273
279
  title = {{HUD: An Evaluation Platform for Agents}},
274
280
  date = {2025-04},
275
- url = {https://github.com/hud-evals/hud-sdk},
281
+ url = {https://github.com/hud-evals/hud-python},
276
282
  langid = {en}
277
283
  }
278
284
  ```
@@ -1,5 +1,5 @@
1
1
  <div align="left">
2
- <img src="https://raw.githubusercontent.com/hud-evals/hud-sdk/main/docs/logo/hud_logo.svg" alt="HUD" width="150" style="margin-bottom: 20px;"/>
2
+ <img src="https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/logo/hud_logo.svg" alt="HUD" width="150" style="margin-bottom: 20px;"/>
3
3
  </div>
4
4
 
5
5
  <h3>
@@ -13,7 +13,7 @@ Evaluate your Computer Use AI agents across web browsers, desktop environments,
13
13
  We're here to help with eval strategies, custom environments, or improving your agent architecture!
14
14
 
15
15
 
16
- > **Early Release Notice**: We'd love to hear your feedback in [Issues](https://github.com/hud-evals/hud-sdk/issues), as the SDK is still evolving!
16
+ > **Early Release Notice**: We'd love to hear your feedback in [Issues](https://github.com/hud-evals/hud-python/issues), as the SDK is still evolving!
17
17
 
18
18
  [![PyPI version](https://img.shields.io/pypi/v/hud-python)](https://pypi.org/project/hud-python/)
19
19
 
@@ -197,7 +197,7 @@ If you use this SDK in your research, please cite it as follows:
197
197
  author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Oskars Putans and Govind Pimpale and Mayank Singamreddy and Nguyen Nhat Minh},
198
198
  title = {{HUD: An Evaluation Platform for Agents}},
199
199
  date = {2025-04},
200
- url = {https://github.com/hud-evals/hud-sdk},
200
+ url = {https://github.com/hud-evals/hud-python},
201
201
  langid = {en}
202
202
  }
203
203
  ```
@@ -0,0 +1,407 @@
1
+ # How to Build HUD-Compatible MCP Environments
2
+
3
+ This document is a step-by-step guide for turning *any* piece of software that can run in a Docker container into a **Model Context Protocol (MCP)** environment that the HUD SDK can evaluate or control. We’ll move through six short phases, each with a clear checkpoint.
4
+
5
+ The official MCP lifecycle specification is an excellent companion reference – skim it now, keep it open while you work: [modelcontextprotocol.io › Lifecycle](https://modelcontextprotocol.io/specification/2025-06-18/basic/lifecycle).
6
+
7
+ ---
8
+
9
+ ## Phase Overview
10
+
11
+ | Phase | Goal |
12
+ |-------|------|
13
+ | 1 | A Docker image that *starts* and prints to **stderr** |
14
+ | 2 | A minimal MCP server that responds to `initialize` over **stdio** |
15
+ | 3 | Working `setup`, `evaluate`, and **interaction** tools |
16
+ | 4 | Image launches remotely on the HUD platform & exposes live telemetry |
17
+ | 5 | Fast local iteration with **cursor-mcp** and a tiny `mcp.json` |
18
+ | 6 | Optional polish – registries, optimisation, security, creative ideas |
19
+
20
+ Take the phases one at a time; do **not** jump ahead. Each stage’s checkpoint is the foundation for the next.
21
+
22
+ ---
23
+
24
+ ## Phase 1 – Write a *Simple* Dockerfile
25
+
26
+ **Goal →** the container starts, prints a message to **stderr**, and exits cleanly. Nothing else.
27
+
28
+ Why stderr? In Phase 2 the MCP server will reserve **stdout** for JSON-RPC traffic, so *all* human-readable logs should already go to the other stream.
29
+
30
+ ### Minimal example
31
+
32
+ ```dockerfile
33
+ FROM python:3.11-slim
34
+
35
+ WORKDIR /app
36
+ COPY . .
37
+
38
+ # Optional: install requirements
39
+ # RUN pip install --no-cache-dir -r requirements.txt
40
+
41
+ # ‼️ Send logs to stderr (stdout remains untouched for MCP)
42
+ CMD [
43
+ "python",
44
+ "-c",
45
+ "import sys, time; print('hello from the container', file=sys.stderr); time.sleep(1)"
46
+ ]
47
+ ```
48
+
49
+ Build & run:
50
+
51
+ ```bash
52
+ docker build -t my-environment .
53
+ docker run --rm -it my-environment # look for the log line on stderr
54
+ ```
55
+
56
+ • **One Dockerfile only** – no docker-compose.
57
+ • If you’re building a GUI environment, start from `hudpython/novnc-base:latest` instead and leave VNC configuration for later phases.
58
+
59
+ Checkpoint reached? Congratulations – move on.
60
+
61
+ Need inspiration? Skim the real Dockerfiles used in the example browser environments:
62
+ • [`simple_browser/Dockerfile`](./simple_browser/Dockerfile)
63
+ • [`remote_browser/Dockerfile`](./remote_browser/Dockerfile)
64
+ They follow the exact same pattern – a single file, logs to stderr, nothing fancy.
65
+
66
+ ---
67
+
68
+ ## Phase 2 – Create the MCP Server
69
+
70
+ **Goal →** a Python process that:
71
+ 1. Speaks MCP over **stdio**.
72
+ 2. Responds correctly to the `initialize` request.
73
+ 3. Logs everything to **stderr**.
74
+
75
+ The MCP lifecycle is *initialize → operate → shutdown* (see spec link above).
76
+
77
+ ### Skeleton server (FastMCP)
78
+
79
+ ```python
80
+ import sys
81
+ import logging
82
+ from mcp.server.fastmcp import FastMCP
83
+
84
+ # 1️⃣ Always log to stderr – stdout is reserved for JSON-RPC
85
+ logging.basicConfig(
86
+ stream=sys.stderr,
87
+ level=logging.INFO,
88
+ format='[%(levelname)s] %(asctime)s | %(name)s | %(message)s'
89
+ )
90
+
91
+ mcp = FastMCP("My Environment")
92
+
93
+ from hud.tools.helper import mcp_intialize_wrapper
94
+
95
+ @mcp_intialize_wrapper()
96
+ async def initialize_environment():
97
+ """Heavy one-time setup – start databases, launch background apps, etc."""
98
+ logging.info("starting core services…")
99
+ await start_services() # your coroutine
100
+ logging.info("services ready")
101
+
102
+ if __name__ == "__main__":
103
+ mcp.run()
104
+ ```
105
+
106
+ *(Replace `start_services()` with whatever takes noticeable startup time – browsers, DBs, X servers, …)*
107
+
108
+ ### Adapt Dockerfile
109
+
110
+ At the end of your Dockerfile, you must launch the MCP server as the container's main process, ensuring it communicates over stdio (stdin/stdout). This is typically done by setting the `CMD` or `ENTRYPOINT` to run your server module directly, for example:
111
+
112
+
113
+ ```dockerfile
114
+ FROM python:3.11-slim
115
+
116
+ WORKDIR /app
117
+ COPY . .
118
+
119
+ # Optional: install requirements
120
+ # RUN pip install --no-cache-dir -r requirements.txt
121
+
122
+ CMD ["uv", "pip", "run", "python", "-m", "your_module_name"] # Replace 'your_module_name' with your actual entrypoint module
123
+ ```
124
+
125
+ ### Three validation steps (run them **in order**)
126
+
127
+ | # | What you do | Why it matters |
128
+ |---|-------------|----------------|
129
+ | 1 | **Direct stdio test** – pipe the JSON below into your script | Proves the Python code handles `initialize` without any client or Docker noise |
130
+ | 2 | **MCP Inspector** – `npx @modelcontextprotocol/inspector python -m my_package.server` | Lets you click around: view capabilities, tools, resources |
131
+ | 3 | **Inside Docker** – rebuild the image and run it | This is *exactly* how HUD will execute the server |
132
+
133
+ #### JSON for step 1
134
+
135
+ ```jsonc
136
+ { "jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {
137
+ "protocolVersion": "2024-11-05",
138
+ "capabilities": {"roots": {"listChanged": true}},
139
+ "clientInfo": {"name": "DevClient", "title": "Dev", "version": "0.0.0"}
140
+ }}
141
+ ```
142
+
143
+ Pipe it:
144
+
145
+ ```bash
146
+ echo '<the-json-above>' | python -m my_package.server
147
+ ```
148
+
149
+ If all three validations succeed, you have a real MCP server – time to make it useful.
150
+
151
+ ---
152
+
153
+ ## Phase 3 – Add Setup / Evaluate / Interaction Tools
154
+
155
+ **Goal →** tools are discoverable in the Inspector *and* callable from the HUD SDK.
156
+
157
+ 1. Write **`setup`** and **`evaluate`** tools first – they are *lifecycle* tools and never shown to the LLM.
158
+ 2. Register at least one **interaction** tool (`computer`, `playwright`, or your own).
159
+
160
+ ### Example
161
+
162
+ ```python
163
+ from hud.tools.helper import register_instance_tool
164
+ from hud.tools import HudComputerTool
165
+
166
+ @mcp.tool()
167
+ async def setup(config: dict) -> dict:
168
+ ... # prepare environment
169
+
170
+ @mcp.tool()
171
+ async def evaluate(config: dict) -> dict:
172
+ ... # return {"reward": <0-1>, "done": bool}
173
+
174
+ @mcp.initialize()
175
+ async def init():
176
+ register_instance_tool(mcp, "computer", HudComputerTool())
177
+ ```
178
+
179
+ ### Test workflow
180
+
181
+ 1. **Inspector first** – restart the server, refresh the *Tools* tab, confirm the new tools appear.
182
+ 2. **Rebuild the image** – `docker build -t my-environment .`.
183
+ 3. **HUD SDK test** – run a short script like the one below. GUI environments built from `hudpython/novnc-base` still expose a VNC viewer on <http://localhost:8080/vnc.html> – keep it open while testing.
184
+
185
+ ```python
186
+ import asyncio
187
+ from hud import Task
188
+ from hud.mcp import ClaudeMCPAgent
189
+ from hud.telemetry import trace
190
+ from mcp_use import MCPClient
191
+
192
+ async def main():
193
+ # `trace` captures *everything* that happens and sends it to app.hud.so
194
+ with trace("local_test"):
195
+ cfg = {
196
+ "mcp_config": {
197
+ "local": {"command": "docker", "args": ["run", "--rm", "-i", "my-environment:latest"]}
198
+ }
199
+ }
200
+ client = MCPClient.from_dict(cfg)
201
+
202
+ agent = ClaudeMCPAgent(
203
+ client=client,
204
+ model="claude-3-sonnet-20241022",
205
+ allowed_tools=["computer"]
206
+ )
207
+
208
+ task = Task(
209
+ prompt="Mark two todo items as done",
210
+ setup={"function": "todo_seed", "args": {"num_items": 5}},
211
+ evaluate={"function": "todo_completed", "args": {"expected_count": 2}}
212
+ )
213
+
214
+ result = await agent.run(task)
215
+ print(result)
216
+
217
+ await client.close_all_sessions()
218
+
219
+ asyncio.run(main())
220
+ ```
221
+
222
+ The `trace` context manager sends a full timeline of agent actions, tool calls, and rewards to app.hud.so – perfect for debugging.
223
+
224
+ See `examples/agents_tools/simple_task_example.py` and `examples/environments/gmail_local.py` for larger end-to-end demos.
225
+
226
+ ---
227
+
228
+ ## Phase 4 – Remote Deployment & HUD Runner
229
+
230
+ **Goal →** the exact same image runs in parallel on hundreds of instances, and exposes more telemetry so the app.hud.so can visualise the whole lifecycle.
231
+
232
+ ### 1. Publish your image
233
+
234
+ Log in to Docker Hub (or any registry HUD can pull from) and push a tagged build:
235
+
236
+ ```bash
237
+ docker tag my-environment yourdockerhubuser/my-environment:latest
238
+ docker push yourdockerhubuser/my-environment:latest
239
+ ```
240
+
241
+ *(If you’re using a private registry, make sure the HUD worker has pull credentials.)*
242
+
243
+ ### 2. Launch it remotely (gmail_remote pattern)
244
+
245
+ `examples/environments/gmail_remote.py` shows the canonical pattern – a remote MCP server entry that simply runs **the same Docker image**:
246
+
247
+ ```python
248
+ from hud import settings
249
+ # Your image is in a registry, now tell HUD to pull & run it on demand
250
+ config = {
251
+ "mcp_config": {
252
+ "hud": {
253
+ "url": settings.mcp_url, # Provided by HUD when you create an evaluation run
254
+ "headers": {
255
+ "Authorization": f"Bearer {settings.api_key}",
256
+ "Mcp-Image": "yourdockerhubuser/my-environment:latest", # which image to launch
257
+ },
258
+ }
259
+ }
260
+ }
261
+
262
+ client = MCPClient.from_dict(config)
263
+ ```
264
+
265
+ _Steps 3 and 4 below are **optional but highly recommended** once the image boots successfully._
266
+
267
+ Spin up **many** agents in parallel by just launching multiple tasks – HUD will queue and start as many containers as resources allow.
268
+
269
+ ### 3. Progress updates during `initialize` (Optional)
270
+
271
+ At remote scale it can take 10-30 s for heavy services to boot. Use `mcp_intialize_wrapper()` with a *progress token* to stream status messages:
272
+
273
+ ```python
274
+ from hud.tools.helper import mcp_intialize_wrapper
275
+
276
+ @mcp_intialize_wrapper()
277
+ async def initialize_environment(session=None, progress_token=None):
278
+ async def send(p, msg):
279
+ if session and progress_token:
280
+ await session.send_progress_notification(progress_token, p, 100, msg)
281
+ await send(10, "starting X11…")
282
+ await start_x11()
283
+ await send(50, "launching browser…")
284
+ await launch_browser()
285
+ await send(100, "ready")
286
+ ```
287
+
288
+ Those messages are displayed live on app.hud.so alongside resource graphs – perfect feedback while you wait.
289
+
290
+ ### 4. Live telemetry (`telemetry://live`) (Optional)
291
+
292
+ Expose a resource named `telemetry://live` exactly like in `environments/simple_browser/src/hud_controller/server.py` to return live url to be displayed on app.hud.so.
293
+
294
+ Once all of the above works you can unleash *hundreds* of concurrent agents on your new environment.
295
+
296
+ ---
297
+
298
+ ## Phase 5 – Automated Iteration with *cursor-mcp*
299
+
300
+ [`cursor-mcp`](https://github.com/hud-evals/cursor-mcp) turns the edit → build → restart → test loop into a single key-press and adds tools to Cursor Agent that can drive the whole workflow for you. The agent reads the MCP spec, your code, and the live server state, then proposes fixes or new tests on its own. It then has access to the MCP tools the environment provides, enabling it to test all functionality, which completes the iteration loop.
301
+
302
+ 1. Add an entry to `.cursor/mcp.json`:
303
+
304
+ ```jsonc
305
+ {
306
+ "mcp_config": {
307
+ "env": {
308
+ "command": "docker",
309
+ "args": ["run", "--rm", "-i", "my-environment:latest"]
310
+ },
311
+ "cursor-manager": {
312
+ "command": "uvx",
313
+ "args": ["cursor-mcp"]
314
+ }
315
+ }
316
+ }
317
+ ```
318
+
319
+ 2. Follow the cursor rules below: rebuild, refresh, test, reflect, repeat.
320
+ 3. Keep the agent open for any messages or issues.
321
+
322
+ ### Cursor rules – paste this once
323
+
324
+ Inside `.cursor/rules/mcp_environment_iteration.mdc` add (or verify) the following so the agent always knows the expected loop:
325
+
326
+ ```mdc
327
+ ---
328
+ description: When making an environment that launches and MCP server this is the iteration loop
329
+ alwaysApply: false
330
+ ---
331
+ Setting up (also refer to environments/README.md):
332
+ 1. Follow each environment's README.md or any other steps to set it up for the MCP server to be able to directly launch it (such as building the dockerfile)
333
+ 2. Run local tests to make sure the initialize without immediate errors and stays alive until properly closed. If the server crashes within the first few seconds then the manager will not pick up on it. In this case please go back and either debug the docker run directly, or the mcp server by piping an initialization request.
334
+ 3. When the server initialization is stable, use the cursor-manager tool to see the current list of tools and add it if necessary. Take note of the name.
335
+ 4. When working, tell the user to send another message to refresh your list of tools.
336
+
337
+ After setting up, when iterating (will not require a user message ever):
338
+ 1. Look at the environment project and refine/edit/fix files
339
+ 2. Follow its README to set it up for the MCP server (such as building the dockerfile)
340
+ 3. Use the cursor-manager tool to refresh this server (by name)
341
+ 4. See its status using cursor-manager, if it's running then follow with step 5. If it fails, then check the logs using cursor-manager and go back to step 1, but ask the user to reset.
342
+ 5. Use the tools from that server (by name) to test the functionality and edge cases, reflect on the success of your TODOs and think of new things to fix. If the tools are unavailable but the status is running, then ask the user to refresh the user message.
343
+ 6. Review your TODOs, update with new TODOs
344
+ 7. Repeat until reached user's high level goals, or generally extremely happy with the final result
345
+
346
+ In general:
347
+ 1. Try to avoid running direct docker or mcp commands and use the tools. If you want to run a docker command or python mcp server command then ask permission and only use if otherwise completely impossible.
348
+ 2. If at any point the docker build starts breaking on initialize, return to setting up properly
349
+ ```
350
+
351
+ The result: fast, autonomous turnaround times even for complex GUI environments.
352
+
353
+ ---
354
+
355
+ ## Phase 6 – Optional Polish & Extensions
356
+
357
+ ### Deeper dive into registries
358
+
359
+ An environment often needs *structured knowledge* about tasks, evaluation logic, or problem definitions. The browser examples keep these in three explicit registries:
360
+
361
+ | Registry | Purpose | Example resource URI |
362
+ |----------|---------|----------------------|
363
+ | **Setup** | How to seed the environment before the agent starts | `setup://registry` & `setup://{env}` |
364
+ | **Evaluators** | Functions that decide success & reward | `evaluators://registry` |
365
+ | **Problems** | Bundled benchmarks / tasks with their own setup & evaluate pairs | `problems://registry` |
366
+
367
+ Each registry is just a dictionary mapping a *name* to a *class*. Use a **decorator** to register classes:
368
+
369
+ ```python
370
+ from .registry import setup, evaluator, problem
371
+
372
+ @setup("todo_seed")
373
+ class TodoSeed:
374
+ ...
375
+
376
+ @evaluator("todo_completed")
377
+ class TodoCompleted:
378
+ ...
379
+
380
+ @problem("todo_basic", description="Complete two todo items", difficulty="easy")
381
+ class TodoBasic:
382
+ def get_setup(self):
383
+ return {"function": "todo_seed", "args": {"num_items": 5}}
384
+ def get_evaluation(self):
385
+ return {"function": "todo_completed", "args": {"expected_count": 2}}
386
+ ```
387
+
388
+ Decorators keep registration *next to the implementation* and avoid manual bookkeeping. The server simply exposes the combined metadata through an MCP **resource**. Follow `environments/simple_browser/src/hud_controller/problems/registry.py` as a template and expose the JSON with `@mcp.resource("problems://registry")`.
389
+
390
+ ### Other finishing touches
391
+
392
+ * **Performance** – lazy-load heavy resources, pool DB connections, cache expensive calls.
393
+ * **Security** – sandbox untrusted code, keep secrets in env vars, audit-log every tool call.
394
+ * **Creative ideas** – API simulators, network test-beds, game worlds… if it fits in Docker it can be an MCP environment.
395
+
396
+ ---
397
+
398
+ ## Summary
399
+
400
+ 1. Start with a *plain* Dockerfile – verify it runs.
401
+ 2. Add a minimal FastMCP server – verify with stdio, Inspector, Docker.
402
+ 3. Implement tools – verify discovery + execution.
403
+ 4. Run the same image remotely – verify telemetry.
404
+ 5. Automate the loop with cursor-mcp.
405
+ 6. Polish and extend as inspiration strikes.
406
+
407
+ Happy building – and remember: **stderr is your friend, stdout belongs to MCP.** 🚀