hud-python 0.3.0__tar.gz → 0.3.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of hud-python might be problematic. Click here for more details.

Files changed (301) hide show
  1. {hud_python-0.3.0 → hud_python-0.3.2}/.gitignore +5 -1
  2. {hud_python-0.3.0 → hud_python-0.3.2}/PKG-INFO +20 -14
  3. {hud_python-0.3.0 → hud_python-0.3.2}/README.md +3 -3
  4. hud_python-0.3.2/environments/README.md +433 -0
  5. hud_python-0.3.2/environments/docker_debug.py +743 -0
  6. hud_python-0.3.2/environments/remote_browser/Dockerfile +23 -0
  7. hud_python-0.3.2/environments/remote_browser/README.md +62 -0
  8. hud_python-0.3.2/environments/remote_browser/pyproject.toml +26 -0
  9. hud_python-0.3.2/environments/remote_browser/src/hud_controller/__init__.py +3 -0
  10. hud_python-0.3.2/environments/remote_browser/src/hud_controller/__main__.py +13 -0
  11. hud_python-0.3.2/environments/remote_browser/src/hud_controller/browser_computer_tool.py +335 -0
  12. hud_python-0.3.2/environments/remote_browser/src/hud_controller/evaluators/__init__.py +22 -0
  13. hud_python-0.3.2/environments/remote_browser/src/hud_controller/evaluators/context.py +77 -0
  14. hud_python-0.3.2/environments/remote_browser/src/hud_controller/evaluators/cookie_exists.py +107 -0
  15. hud_python-0.3.2/environments/remote_browser/src/hud_controller/evaluators/cookie_match.py +142 -0
  16. hud_python-0.3.2/environments/remote_browser/src/hud_controller/evaluators/history_length.py +78 -0
  17. hud_python-0.3.2/environments/remote_browser/src/hud_controller/evaluators/page_contains.py +106 -0
  18. hud_python-0.3.2/environments/remote_browser/src/hud_controller/evaluators/raw_last_action_is.py +81 -0
  19. hud_python-0.3.2/environments/remote_browser/src/hud_controller/evaluators/registry.py +157 -0
  20. hud_python-0.3.2/environments/remote_browser/src/hud_controller/evaluators/selector_history.py +69 -0
  21. hud_python-0.3.2/environments/remote_browser/src/hud_controller/evaluators/sheet_contains.py +123 -0
  22. hud_python-0.3.2/environments/remote_browser/src/hud_controller/evaluators/sheets_cell_values.py +176 -0
  23. hud_python-0.3.2/environments/remote_browser/src/hud_controller/evaluators/url_match.py +84 -0
  24. hud_python-0.3.2/environments/remote_browser/src/hud_controller/evaluators/verify_type_action.py +102 -0
  25. hud_python-0.3.2/environments/remote_browser/src/hud_controller/playwright_with_memory.py +144 -0
  26. hud_python-0.3.2/environments/remote_browser/src/hud_controller/problems/__init__.py +11 -0
  27. hud_python-0.3.2/environments/remote_browser/src/hud_controller/problems/navigate_and_verify.py +28 -0
  28. hud_python-0.3.2/environments/remote_browser/src/hud_controller/problems/registry.py +91 -0
  29. hud_python-0.3.2/environments/remote_browser/src/hud_controller/providers/README.md +110 -0
  30. hud_python-0.3.2/environments/remote_browser/src/hud_controller/providers/__init__.py +33 -0
  31. hud_python-0.3.2/environments/remote_browser/src/hud_controller/providers/anchorbrowser.py +164 -0
  32. hud_python-0.3.2/environments/remote_browser/src/hud_controller/providers/base.py +96 -0
  33. hud_python-0.3.2/environments/remote_browser/src/hud_controller/providers/browserbase.py +176 -0
  34. hud_python-0.3.2/environments/remote_browser/src/hud_controller/providers/hyperbrowser.py +244 -0
  35. hud_python-0.3.2/environments/remote_browser/src/hud_controller/providers/kernel.py +13 -0
  36. hud_python-0.3.2/environments/remote_browser/src/hud_controller/providers/steel.py +203 -0
  37. hud_python-0.3.2/environments/remote_browser/src/hud_controller/runtime.py +210 -0
  38. hud_python-0.3.2/environments/remote_browser/src/hud_controller/server.py +336 -0
  39. hud_python-0.3.2/environments/remote_browser/src/hud_controller/setup/__init__.py +15 -0
  40. hud_python-0.3.2/environments/remote_browser/src/hud_controller/setup/cookies.py +95 -0
  41. hud_python-0.3.2/environments/remote_browser/src/hud_controller/setup/interact.py +154 -0
  42. hud_python-0.3.2/environments/remote_browser/src/hud_controller/setup/load_html.py +66 -0
  43. hud_python-0.3.2/environments/remote_browser/src/hud_controller/setup/navigate.py +54 -0
  44. hud_python-0.3.2/environments/remote_browser/src/hud_controller/setup/registry.py +104 -0
  45. hud_python-0.3.2/environments/remote_browser/src/hud_controller/setup/sheets.py +303 -0
  46. hud_python-0.3.2/environments/remote_browser/test_mcp.sh +4 -0
  47. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/Dockerfile +2 -1
  48. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/README.md +122 -4
  49. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/pyproject.toml +2 -1
  50. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/server.py +9 -1
  51. {hud_python-0.3.0 → hud_python-0.3.2}/examples/agents_tools/mcp_claude_agent.py +7 -12
  52. {hud_python-0.3.0 → hud_python-0.3.2}/examples/agents_tools/mcp_openai_agent.py +4 -4
  53. {hud_python-0.3.0 → hud_python-0.3.2}/examples/agents_tools/mcp_test.ipynb +1 -1
  54. {hud_python-0.3.0 → hud_python-0.3.2}/examples/agents_tools/mcp_use_agent.py +2 -2
  55. {hud_python-0.3.0 → hud_python-0.3.2}/examples/agents_tools/simple_task_example.py +12 -11
  56. hud_python-0.3.2/examples/environments/gmail_local.py +74 -0
  57. hud_python-0.3.2/examples/environments/gmail_remote.py +74 -0
  58. {hud_python-0.3.0 → hud_python-0.3.2}/examples/environments/resources_example.py +1 -1
  59. {hud_python-0.3.0 → hud_python-0.3.2}/examples/environments/simple_browser_example.py +4 -4
  60. hud_python-0.3.2/examples/evaluations/eval.py +124 -0
  61. hud_python-0.3.2/examples/evaluations/telemetry_and_datasets.ipynb +350 -0
  62. {hud_python-0.3.0 → hud_python-0.3.2}/hud/__init__.py +7 -4
  63. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/common/adapter.py +14 -3
  64. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/common/tests/test_adapter.py +16 -4
  65. hud_python-0.3.2/hud/datasets.py +188 -0
  66. {hud_python-0.3.0 → hud_python-0.3.2}/hud/env/docker_client.py +14 -2
  67. {hud_python-0.3.0 → hud_python-0.3.2}/hud/env/local_docker_client.py +28 -6
  68. {hud_python-0.3.0 → hud_python-0.3.2}/hud/gym.py +0 -9
  69. {hud_python-0.3.0/hud/mcp_agent → hud_python-0.3.2/hud/mcp}/__init__.py +2 -0
  70. hud_python-0.3.2/hud/mcp/base.py +631 -0
  71. {hud_python-0.3.0/hud/mcp_agent → hud_python-0.3.2/hud/mcp}/claude.py +52 -47
  72. hud_python-0.3.2/hud/mcp/client.py +312 -0
  73. {hud_python-0.3.0/hud/mcp_agent → hud_python-0.3.2/hud/mcp}/langchain.py +52 -33
  74. {hud_python-0.3.0/hud/mcp_agent → hud_python-0.3.2/hud/mcp}/openai.py +56 -40
  75. {hud_python-0.3.0/hud/mcp_agent → hud_python-0.3.2/hud/mcp}/tests/test_base.py +129 -54
  76. hud_python-0.3.2/hud/mcp/tests/test_claude.py +294 -0
  77. hud_python-0.3.2/hud/mcp/tests/test_client.py +324 -0
  78. hud_python-0.3.2/hud/mcp/tests/test_openai.py +238 -0
  79. {hud_python-0.3.0 → hud_python-0.3.2}/hud/settings.py +6 -0
  80. {hud_python-0.3.0 → hud_python-0.3.2}/hud/task.py +2 -88
  81. {hud_python-0.3.0 → hud_python-0.3.2}/hud/taskset.py +2 -23
  82. {hud_python-0.3.0 → hud_python-0.3.2}/hud/telemetry/__init__.py +5 -0
  83. hud_python-0.3.2/hud/telemetry/_trace.py +347 -0
  84. {hud_python-0.3.0 → hud_python-0.3.2}/hud/telemetry/context.py +79 -0
  85. {hud_python-0.3.0 → hud_python-0.3.2}/hud/telemetry/exporter.py +165 -6
  86. hud_python-0.3.2/hud/telemetry/job.py +141 -0
  87. {hud_python-0.3.0 → hud_python-0.3.2}/hud/telemetry/tests/test_trace.py +36 -25
  88. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/__init__.py +14 -1
  89. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/computer/hud.py +13 -0
  90. hud_python-0.3.2/hud/tools/executors/__init__.py +30 -0
  91. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/executors/pyautogui.py +84 -50
  92. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/executors/tests/test_pyautogui_executor.py +4 -1
  93. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/playwright_tool.py +73 -67
  94. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/tests/test_edit.py +8 -1
  95. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/tests/test_tools.py +3 -0
  96. {hud_python-0.3.0 → hud_python-0.3.2}/hud/trajectory.py +5 -1
  97. {hud_python-0.3.0 → hud_python-0.3.2}/hud/utils/tests/test_version.py +1 -1
  98. {hud_python-0.3.0 → hud_python-0.3.2}/hud/version.py +1 -1
  99. {hud_python-0.3.0 → hud_python-0.3.2}/pyproject.toml +31 -17
  100. hud_python-0.3.0/environments/README.md +0 -163
  101. hud_python-0.3.0/environments/simple_browser/DEPLOYMENT.md +0 -159
  102. hud_python-0.3.0/examples/environments/gmail_local.py +0 -66
  103. hud_python-0.3.0/hud/evaluators/__init__.py +0 -9
  104. hud_python-0.3.0/hud/evaluators/base.py +0 -32
  105. hud_python-0.3.0/hud/evaluators/inspect.py +0 -24
  106. hud_python-0.3.0/hud/evaluators/judge.py +0 -189
  107. hud_python-0.3.0/hud/evaluators/match.py +0 -156
  108. hud_python-0.3.0/hud/evaluators/remote.py +0 -65
  109. hud_python-0.3.0/hud/evaluators/tests/test_inspect.py +0 -12
  110. hud_python-0.3.0/hud/evaluators/tests/test_judge.py +0 -231
  111. hud_python-0.3.0/hud/evaluators/tests/test_match.py +0 -115
  112. hud_python-0.3.0/hud/evaluators/tests/test_remote.py +0 -98
  113. hud_python-0.3.0/hud/mcp_agent/base.py +0 -723
  114. hud_python-0.3.0/hud/telemetry/_trace.py +0 -184
  115. hud_python-0.3.0/hud/tools/executors/__init__.py +0 -13
  116. hud_python-0.3.0/hud/utils/tests/__init__.py +0 -0
  117. {hud_python-0.3.0 → hud_python-0.3.2}/.env.example +0 -0
  118. {hud_python-0.3.0 → hud_python-0.3.2}/.github/workflows/ci.yml +0 -0
  119. {hud_python-0.3.0 → hud_python-0.3.2}/.github/workflows/release.yml +0 -0
  120. {hud_python-0.3.0 → hud_python-0.3.2}/LICENSE +0 -0
  121. {hud_python-0.3.0 → hud_python-0.3.2}/MANIFEST.in +0 -0
  122. {hud_python-0.3.0 → hud_python-0.3.2}/docs/advanced/cla-details.mdx +0 -0
  123. {hud_python-0.3.0 → hud_python-0.3.2}/docs/advanced/environment-control.mdx +0 -0
  124. {hud_python-0.3.0 → hud_python-0.3.2}/docs/advanced/tracing.mdx +0 -0
  125. {hud_python-0.3.0 → hud_python-0.3.2}/docs/advanced/uploading.mdx +0 -0
  126. {hud_python-0.3.0 → hud_python-0.3.2}/docs/api-reference/adapters.mdx +0 -0
  127. {hud_python-0.3.0 → hud_python-0.3.2}/docs/api-reference/env.mdx +0 -0
  128. {hud_python-0.3.0 → hud_python-0.3.2}/docs/api-reference/gym.mdx +0 -0
  129. {hud_python-0.3.0 → hud_python-0.3.2}/docs/api-reference/job.mdx +0 -0
  130. {hud_python-0.3.0 → hud_python-0.3.2}/docs/api-reference/task.mdx +0 -0
  131. {hud_python-0.3.0 → hud_python-0.3.2}/docs/api-reference/taskset.mdx +0 -0
  132. {hud_python-0.3.0 → hud_python-0.3.2}/docs/api-reference/telemetry.mdx +0 -0
  133. {hud_python-0.3.0 → hud_python-0.3.2}/docs/api-reference/trajectory.mdx +0 -0
  134. {hud_python-0.3.0 → hud_python-0.3.2}/docs/concepts/adapter.mdx +0 -0
  135. {hud_python-0.3.0 → hud_python-0.3.2}/docs/concepts/agent.mdx +0 -0
  136. {hud_python-0.3.0 → hud_python-0.3.2}/docs/concepts/environment.mdx +0 -0
  137. {hud_python-0.3.0 → hud_python-0.3.2}/docs/concepts/job.mdx +0 -0
  138. {hud_python-0.3.0 → hud_python-0.3.2}/docs/concepts/task.mdx +0 -0
  139. {hud_python-0.3.0 → hud_python-0.3.2}/docs/concepts/trajectory.mdx +0 -0
  140. {hud_python-0.3.0 → hud_python-0.3.2}/docs/docs.json +0 -0
  141. {hud_python-0.3.0 → hud_python-0.3.2}/docs/environment-creation.mdx +0 -0
  142. {hud_python-0.3.0 → hud_python-0.3.2}/docs/environments/browser.mdx +0 -0
  143. {hud_python-0.3.0 → hud_python-0.3.2}/docs/environments/custom-environments.mdx +0 -0
  144. {hud_python-0.3.0 → hud_python-0.3.2}/docs/environments/custom.mdx +0 -0
  145. {hud_python-0.3.0 → hud_python-0.3.2}/docs/environments/osworld-ubuntu.mdx +0 -0
  146. {hud_python-0.3.0 → hud_python-0.3.2}/docs/environments/qa.mdx +0 -0
  147. {hud_python-0.3.0 → hud_python-0.3.2}/docs/examples/alignment-evaluation.mdx +0 -0
  148. {hud_python-0.3.0 → hud_python-0.3.2}/docs/examples/benchmarking-agents.mdx +0 -0
  149. {hud_python-0.3.0 → hud_python-0.3.2}/docs/examples/custom-os-env.mdx +0 -0
  150. {hud_python-0.3.0 → hud_python-0.3.2}/docs/examples/mcp-agent-tracing.mdx +0 -0
  151. {hud_python-0.3.0 → hud_python-0.3.2}/docs/examples/web-app-testing.mdx +0 -0
  152. {hud_python-0.3.0 → hud_python-0.3.2}/docs/examples/web-mocks.mdx +0 -0
  153. {hud_python-0.3.0 → hud_python-0.3.2}/docs/favicon.png +0 -0
  154. {hud_python-0.3.0 → hud_python-0.3.2}/docs/logo/hud_logo.svg +0 -0
  155. {hud_python-0.3.0 → hud_python-0.3.2}/docs/logo/hud_logo_dark.svg +0 -0
  156. {hud_python-0.3.0 → hud_python-0.3.2}/docs/quickstart.mdx +0 -0
  157. {hud_python-0.3.0 → hud_python-0.3.2}/docs/running-your-agent.mdx +0 -0
  158. {hud_python-0.3.0 → hud_python-0.3.2}/docs/task-creation.mdx +0 -0
  159. {hud_python-0.3.0 → hud_python-0.3.2}/environments/pokemon_controller/Dockerfile +0 -0
  160. {hud_python-0.3.0 → hud_python-0.3.2}/environments/pokemon_controller/pyproject.toml +0 -0
  161. {hud_python-0.3.0 → hud_python-0.3.2}/environments/pokemon_controller/src/hud_controller/__init__.py +0 -0
  162. {hud_python-0.3.0 → hud_python-0.3.2}/environments/pokemon_controller/src/hud_controller/display_adapters.py +0 -0
  163. {hud_python-0.3.0 → hud_python-0.3.2}/environments/pokemon_controller/src/hud_controller/emulator.py +0 -0
  164. {hud_python-0.3.0 → hud_python-0.3.2}/environments/pokemon_controller/src/hud_controller/evaluator.py +0 -0
  165. {hud_python-0.3.0 → hud_python-0.3.2}/environments/pokemon_controller/src/hud_controller/kill.py +0 -0
  166. {hud_python-0.3.0 → hud_python-0.3.2}/environments/pokemon_controller/src/hud_controller/main.py +0 -0
  167. {hud_python-0.3.0 → hud_python-0.3.2}/environments/pokemon_controller/src/hud_controller/setup.py +0 -0
  168. {hud_python-0.3.0 → hud_python-0.3.2}/environments/pokemon_controller/src/hud_controller/step.py +0 -0
  169. {hud_python-0.3.0 → hud_python-0.3.2}/environments/qa_controller/Dockerfile +0 -0
  170. {hud_python-0.3.0 → hud_python-0.3.2}/environments/qa_controller/pyproject.toml +0 -0
  171. {hud_python-0.3.0 → hud_python-0.3.2}/environments/qa_controller/src/hud_controller/__init__.py +0 -0
  172. {hud_python-0.3.0 → hud_python-0.3.2}/environments/qa_controller/src/hud_controller/evaluate/__init__.py +0 -0
  173. {hud_python-0.3.0 → hud_python-0.3.2}/environments/qa_controller/src/hud_controller/evaluate/matchers.py +0 -0
  174. {hud_python-0.3.0 → hud_python-0.3.2}/environments/qa_controller/src/hud_controller/info.py +0 -0
  175. {hud_python-0.3.0 → hud_python-0.3.2}/environments/qa_controller/src/hud_controller/setup/__init__.py +0 -0
  176. {hud_python-0.3.0 → hud_python-0.3.2}/environments/qa_controller/src/hud_controller/setup/question.py +0 -0
  177. {hud_python-0.3.0 → hud_python-0.3.2}/environments/qa_controller/src/hud_controller/step.py +0 -0
  178. {hud_python-0.3.0 → hud_python-0.3.2}/environments/qa_controller/src/hud_controller/utils/__init__.py +0 -0
  179. {hud_python-0.3.0 → hud_python-0.3.2}/environments/qa_controller/src/hud_controller/utils/state.py +0 -0
  180. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/.dockerignore +0 -0
  181. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/.gitignore +0 -0
  182. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/README.md +0 -0
  183. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/todo/README.md +0 -0
  184. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/todo/backend/main.py +0 -0
  185. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/todo/backend/pyproject.toml +0 -0
  186. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/todo/frontend/app/globals.css +0 -0
  187. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/todo/frontend/app/layout.tsx +0 -0
  188. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/todo/frontend/app/page.tsx +0 -0
  189. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/todo/frontend/next.config.js +0 -0
  190. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/todo/frontend/package-lock.json +0 -0
  191. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/todo/frontend/package.json +0 -0
  192. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/todo/frontend/postcss.config.js +0 -0
  193. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/todo/frontend/tailwind.config.js +0 -0
  194. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/todo/frontend/tsconfig.json +0 -0
  195. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/apps/todo/launch.py +0 -0
  196. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/docker-compose.yml +0 -0
  197. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/README.md +0 -0
  198. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/__init__.py +0 -0
  199. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/__main__.py +0 -0
  200. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/evaluators/__init__.py +0 -0
  201. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/evaluators/context.py +0 -0
  202. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/evaluators/registry.py +0 -0
  203. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/evaluators/todo.py +0 -0
  204. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/problems/__init__.py +0 -0
  205. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/problems/registry.py +0 -0
  206. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/problems/todo.py +0 -0
  207. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/runtime.py +0 -0
  208. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/services.py +0 -0
  209. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/setup/__init__.py +0 -0
  210. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/setup/registry.py +0 -0
  211. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/src/hud_controller/setup/todo.py +0 -0
  212. {hud_python-0.3.0 → hud_python-0.3.2}/environments/simple_browser/start.sh +0 -0
  213. {hud_python-0.3.0 → hud_python-0.3.2}/examples/README.md +0 -0
  214. {hud_python-0.3.0 → hud_python-0.3.2}/examples/agents_tools/browser_use.ipynb +0 -0
  215. {hud_python-0.3.0 → hud_python-0.3.2}/examples/agents_tools/sensitive_data.ipynb +0 -0
  216. {hud_python-0.3.0 → hud_python-0.3.2}/examples/environments/pokemon_local.ipynb +0 -0
  217. {hud_python-0.3.0 → hud_python-0.3.2}/examples/environments/pokemon_remote.ipynb +0 -0
  218. {hud_python-0.3.0 → hud_python-0.3.2}/examples/environments/remote.ipynb +0 -0
  219. {hud_python-0.3.0 → hud_python-0.3.2}/examples/evaluations/osworld.ipynb +0 -0
  220. {hud_python-0.3.0 → hud_python-0.3.2}/examples/evaluations/sheetbench_direct_example.ipynb +0 -0
  221. {hud_python-0.3.0 → hud_python-0.3.2}/examples/evaluations/tasks.ipynb +0 -0
  222. {hud_python-0.3.0 → hud_python-0.3.2}/examples/evaluations/wordle_example.ipynb +0 -0
  223. {hud_python-0.3.0 → hud_python-0.3.2}/examples/sheets_bench_cua_example.ipynb +0 -0
  224. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/__init__.py +0 -0
  225. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/claude/__init__.py +0 -0
  226. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/claude/adapter.py +0 -0
  227. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/claude/tests/__init__.py +0 -0
  228. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/claude/tests/test_adapter.py +0 -0
  229. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/common/__init__.py +0 -0
  230. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/common/tests/__init__.py +0 -0
  231. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/common/types.py +0 -0
  232. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/operator/__init__.py +0 -0
  233. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/operator/adapter.py +0 -0
  234. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/operator/tests/__init__.py +0 -0
  235. {hud_python-0.3.0 → hud_python-0.3.2}/hud/adapters/operator/tests/test_adapter.py +0 -0
  236. {hud_python-0.3.0 → hud_python-0.3.2}/hud/agent/__init__.py +0 -0
  237. {hud_python-0.3.0 → hud_python-0.3.2}/hud/agent/base.py +0 -0
  238. {hud_python-0.3.0 → hud_python-0.3.2}/hud/agent/claude.py +0 -0
  239. {hud_python-0.3.0 → hud_python-0.3.2}/hud/agent/claude_plays_pokemon.py +0 -0
  240. {hud_python-0.3.0 → hud_python-0.3.2}/hud/agent/langchain.py +0 -0
  241. {hud_python-0.3.0 → hud_python-0.3.2}/hud/agent/misc/__init__.py +0 -0
  242. {hud_python-0.3.0 → hud_python-0.3.2}/hud/agent/misc/response_agent.py +0 -0
  243. {hud_python-0.3.0 → hud_python-0.3.2}/hud/agent/operator.py +0 -0
  244. {hud_python-0.3.0 → hud_python-0.3.2}/hud/agent/tests/__init__.py +0 -0
  245. {hud_python-0.3.0 → hud_python-0.3.2}/hud/agent/tests/test_base.py +0 -0
  246. {hud_python-0.3.0 → hud_python-0.3.2}/hud/env/__init__.py +0 -0
  247. {hud_python-0.3.0 → hud_python-0.3.2}/hud/env/client.py +0 -0
  248. {hud_python-0.3.0 → hud_python-0.3.2}/hud/env/environment.py +0 -0
  249. {hud_python-0.3.0 → hud_python-0.3.2}/hud/env/remote_client.py +0 -0
  250. {hud_python-0.3.0 → hud_python-0.3.2}/hud/env/remote_docker_client.py +0 -0
  251. {hud_python-0.3.0 → hud_python-0.3.2}/hud/exceptions.py +0 -0
  252. {hud_python-0.3.0 → hud_python-0.3.2}/hud/job.py +0 -0
  253. {hud_python-0.3.0/hud/mcp_agent → hud_python-0.3.2/hud/mcp}/tests/__init__.py +0 -0
  254. {hud_python-0.3.0 → hud_python-0.3.2}/hud/py.typed +0 -0
  255. {hud_python-0.3.0 → hud_python-0.3.2}/hud/server/__init__.py +0 -0
  256. {hud_python-0.3.0 → hud_python-0.3.2}/hud/server/requests.py +0 -0
  257. {hud_python-0.3.0/hud/evaluators → hud_python-0.3.2/hud/server}/tests/__init__.py +0 -0
  258. {hud_python-0.3.0 → hud_python-0.3.2}/hud/server/tests/test_requests.py +0 -0
  259. {hud_python-0.3.0 → hud_python-0.3.2}/hud/telemetry/instrumentation/__init__.py +0 -0
  260. {hud_python-0.3.0 → hud_python-0.3.2}/hud/telemetry/instrumentation/mcp.py +0 -0
  261. {hud_python-0.3.0 → hud_python-0.3.2}/hud/telemetry/instrumentation/registry.py +0 -0
  262. {hud_python-0.3.0 → hud_python-0.3.2}/hud/telemetry/mcp_models.py +0 -0
  263. {hud_python-0.3.0 → hud_python-0.3.2}/hud/telemetry/tests/__init__.py +0 -0
  264. {hud_python-0.3.0 → hud_python-0.3.2}/hud/telemetry/tests/test_context.py +0 -0
  265. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/base.py +0 -0
  266. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/bash.py +0 -0
  267. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/computer/__init__.py +0 -0
  268. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/computer/anthropic.py +0 -0
  269. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/computer/openai.py +0 -0
  270. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/edit.py +0 -0
  271. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/executors/base.py +0 -0
  272. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/executors/tests/__init__.py +0 -0
  273. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/executors/tests/test_base_executor.py +0 -0
  274. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/executors/xdo.py +0 -0
  275. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/helper/README.md +0 -0
  276. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/helper/__init__.py +0 -0
  277. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/helper/mcp_server.py +0 -0
  278. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/helper/server_initialization.py +0 -0
  279. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/helper/utils.py +0 -0
  280. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/tests/__init__.py +0 -0
  281. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/tests/test_bash.py +0 -0
  282. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/tests/test_computer.py +0 -0
  283. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/tests/test_computer_actions.py +0 -0
  284. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/tests/test_init.py +0 -0
  285. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/tests/test_playwright_tool.py +0 -0
  286. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/tests/test_utils.py +0 -0
  287. {hud_python-0.3.0 → hud_python-0.3.2}/hud/tools/utils.py +0 -0
  288. {hud_python-0.3.0 → hud_python-0.3.2}/hud/types.py +0 -0
  289. {hud_python-0.3.0 → hud_python-0.3.2}/hud/utils/__init__.py +0 -0
  290. {hud_python-0.3.0 → hud_python-0.3.2}/hud/utils/agent.py +0 -0
  291. {hud_python-0.3.0 → hud_python-0.3.2}/hud/utils/common.py +0 -0
  292. {hud_python-0.3.0 → hud_python-0.3.2}/hud/utils/config.py +0 -0
  293. {hud_python-0.3.0 → hud_python-0.3.2}/hud/utils/misc.py +0 -0
  294. {hud_python-0.3.0 → hud_python-0.3.2}/hud/utils/progress.py +0 -0
  295. {hud_python-0.3.0 → hud_python-0.3.2}/hud/utils/telemetry.py +0 -0
  296. {hud_python-0.3.0/hud/server → hud_python-0.3.2/hud/utils}/tests/__init__.py +0 -0
  297. {hud_python-0.3.0 → hud_python-0.3.2}/hud/utils/tests/test_common.py +0 -0
  298. {hud_python-0.3.0 → hud_python-0.3.2}/hud/utils/tests/test_config.py +0 -0
  299. {hud_python-0.3.0 → hud_python-0.3.2}/hud/utils/tests/test_init.py +0 -0
  300. {hud_python-0.3.0 → hud_python-0.3.2}/hud/utils/tests/test_progress.py +0 -0
  301. {hud_python-0.3.0 → hud_python-0.3.2}/hud/utils/tests/test_telemetry.py +0 -0
@@ -28,4 +28,8 @@ TODO.md
28
28
 
29
29
  .coverage
30
30
 
31
- *.log
31
+ *.log
32
+
33
+ /ref/
34
+
35
+ .cursor/
@@ -1,9 +1,9 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: hud-python
3
- Version: 0.3.0
3
+ Version: 0.3.2
4
4
  Summary: SDK for the HUD platform.
5
- Project-URL: Homepage, https://github.com/hud-evals/hud-sdk
6
- Project-URL: Bug Tracker, https://github.com/hud-evals/hud-sdk/issues
5
+ Project-URL: Homepage, https://github.com/hud-evals/hud-python
6
+ Project-URL: Bug Tracker, https://github.com/hud-evals/hud-python/issues
7
7
  Project-URL: Documentation, https://docs.hud.so
8
8
  Author-email: HUD SDK <founders@hud.so>
9
9
  License: MIT License
@@ -35,28 +35,22 @@ Classifier: Programming Language :: Python :: 3.11
35
35
  Classifier: Programming Language :: Python :: 3.12
36
36
  Classifier: Programming Language :: Python :: 3.13
37
37
  Requires-Python: <3.14,>=3.11
38
- Requires-Dist: aiodocker>=0.24.0
39
38
  Requires-Dist: anthropic
39
+ Requires-Dist: datasets>=4.0.0
40
40
  Requires-Dist: dotenv>=0.9.9
41
41
  Requires-Dist: httpx<1,>=0.23.0
42
- Requires-Dist: inspect-ai>=0.3.80
43
- Requires-Dist: ipykernel
44
42
  Requires-Dist: langchain
45
43
  Requires-Dist: langchain-anthropic
46
44
  Requires-Dist: langchain-openai
47
45
  Requires-Dist: mcp-use>=1.3.7
48
46
  Requires-Dist: mcp==1.12.2
49
- Requires-Dist: numpy
50
47
  Requires-Dist: openai
51
48
  Requires-Dist: pathspec>=0.12.1
52
- Requires-Dist: pillow>=11.1.0
53
- Requires-Dist: pyautogui>=0.9.54
54
49
  Requires-Dist: pydantic-settings<3,>=2
55
50
  Requires-Dist: pydantic<3,>=2
56
- Requires-Dist: textdistance<5,>=4.5.0
57
- Requires-Dist: toml>=0.10.2
58
51
  Requires-Dist: wrapt>=1.14.0
59
52
  Provides-Extra: dev
53
+ Requires-Dist: aiodocker>=0.24.0; extra == 'dev'
60
54
  Requires-Dist: anthropic; extra == 'dev'
61
55
  Requires-Dist: dotenv; extra == 'dev'
62
56
  Requires-Dist: ipykernel; extra == 'dev'
@@ -64,17 +58,29 @@ Requires-Dist: ipython<9; extra == 'dev'
64
58
  Requires-Dist: jupyter-client; extra == 'dev'
65
59
  Requires-Dist: jupyter-core; extra == 'dev'
66
60
  Requires-Dist: openai; extra == 'dev'
61
+ Requires-Dist: pillow>=11.1.0; extra == 'dev'
67
62
  Requires-Dist: playwright; extra == 'dev'
63
+ Requires-Dist: pyautogui>=0.9.54; extra == 'dev'
68
64
  Requires-Dist: pyright==1.1.401; extra == 'dev'
69
65
  Requires-Dist: pytest-asyncio; extra == 'dev'
70
66
  Requires-Dist: pytest-cov; extra == 'dev'
71
67
  Requires-Dist: pytest-mock; extra == 'dev'
72
68
  Requires-Dist: pytest<9,>=8.1.1; extra == 'dev'
73
69
  Requires-Dist: ruff==0.11.8; extra == 'dev'
70
+ Requires-Dist: toml>=0.10.2; extra == 'dev'
71
+ Provides-Extra: v2
72
+ Requires-Dist: aiodocker>=0.24.0; extra == 'v2'
73
+ Requires-Dist: inspect-ai>=0.3.80; extra == 'v2'
74
+ Requires-Dist: ipykernel; extra == 'v2'
75
+ Requires-Dist: numpy; extra == 'v2'
76
+ Requires-Dist: pillow>=11.1.0; extra == 'v2'
77
+ Requires-Dist: pyautogui>=0.9.54; extra == 'v2'
78
+ Requires-Dist: textdistance<5,>=4.5.0; extra == 'v2'
79
+ Requires-Dist: toml>=0.10.2; extra == 'v2'
74
80
  Description-Content-Type: text/markdown
75
81
 
76
82
  <div align="left">
77
- <img src="https://raw.githubusercontent.com/hud-evals/hud-sdk/main/docs/logo/hud_logo.svg" alt="HUD" width="150" style="margin-bottom: 20px;"/>
83
+ <img src="https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/logo/hud_logo.svg" alt="HUD" width="150" style="margin-bottom: 20px;"/>
78
84
  </div>
79
85
 
80
86
  <h3>
@@ -88,7 +94,7 @@ Evaluate your Computer Use AI agents across web browsers, desktop environments,
88
94
  We're here to help with eval strategies, custom environments, or improving your agent architecture!
89
95
 
90
96
 
91
- > **Early Release Notice**: We'd love to hear your feedback in [Issues](https://github.com/hud-evals/hud-sdk/issues), as the SDK is still evolving!
97
+ > **Early Release Notice**: We'd love to hear your feedback in [Issues](https://github.com/hud-evals/hud-python/issues), as the SDK is still evolving!
92
98
 
93
99
  [![PyPI version](https://img.shields.io/pypi/v/hud-python)](https://pypi.org/project/hud-python/)
94
100
 
@@ -272,7 +278,7 @@ If you use this SDK in your research, please cite it as follows:
272
278
  author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Oskars Putans and Govind Pimpale and Mayank Singamreddy and Nguyen Nhat Minh},
273
279
  title = {{HUD: An Evaluation Platform for Agents}},
274
280
  date = {2025-04},
275
- url = {https://github.com/hud-evals/hud-sdk},
281
+ url = {https://github.com/hud-evals/hud-python},
276
282
  langid = {en}
277
283
  }
278
284
  ```
@@ -1,5 +1,5 @@
1
1
  <div align="left">
2
- <img src="https://raw.githubusercontent.com/hud-evals/hud-sdk/main/docs/logo/hud_logo.svg" alt="HUD" width="150" style="margin-bottom: 20px;"/>
2
+ <img src="https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/logo/hud_logo.svg" alt="HUD" width="150" style="margin-bottom: 20px;"/>
3
3
  </div>
4
4
 
5
5
  <h3>
@@ -13,7 +13,7 @@ Evaluate your Computer Use AI agents across web browsers, desktop environments,
13
13
  We're here to help with eval strategies, custom environments, or improving your agent architecture!
14
14
 
15
15
 
16
- > **Early Release Notice**: We'd love to hear your feedback in [Issues](https://github.com/hud-evals/hud-sdk/issues), as the SDK is still evolving!
16
+ > **Early Release Notice**: We'd love to hear your feedback in [Issues](https://github.com/hud-evals/hud-python/issues), as the SDK is still evolving!
17
17
 
18
18
  [![PyPI version](https://img.shields.io/pypi/v/hud-python)](https://pypi.org/project/hud-python/)
19
19
 
@@ -197,7 +197,7 @@ If you use this SDK in your research, please cite it as follows:
197
197
  author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Oskars Putans and Govind Pimpale and Mayank Singamreddy and Nguyen Nhat Minh},
198
198
  title = {{HUD: An Evaluation Platform for Agents}},
199
199
  date = {2025-04},
200
- url = {https://github.com/hud-evals/hud-sdk},
200
+ url = {https://github.com/hud-evals/hud-python},
201
201
  langid = {en}
202
202
  }
203
203
  ```
@@ -0,0 +1,433 @@
1
+ # How to Build HUD-Compatible MCP Environments
2
+
3
+ This document is a step-by-step guide for turning *any* piece of software that can run in a Docker container into a **Model Context Protocol (MCP)** environment that the HUD SDK can evaluate or control. We’ll move through six short phases, each with a clear checkpoint.
4
+
5
+ The official MCP lifecycle specification is an excellent companion reference – skim it now, keep it open while you work: [modelcontextprotocol.io › Lifecycle](https://modelcontextprotocol.io/specification/2025-06-18/basic/lifecycle).
6
+
7
+ ---
8
+
9
+ ## Phase Overview
10
+
11
+ | Phase | Goal |
12
+ |-------|------|
13
+ | 1 | A Docker image that *starts* and prints to **stderr** |
14
+ | 2 | A minimal MCP server that responds to `initialize` over **stdio** |
15
+ | 3 | Working `setup`, `evaluate`, and **interaction** tools |
16
+ | 4 | Image launches remotely on the HUD platform & exposes live telemetry |
17
+ | 5 | Fast local iteration with **cursor-mcp** and a tiny `mcp.json` |
18
+ | 6 | Optional polish – registries, optimisation, security, creative ideas |
19
+
20
+ Take the phases one at a time; do **not** jump ahead. Each stage’s checkpoint is the foundation for the next.
21
+
22
+ ### One-command sanity check (`docker_debug.py`)
23
+
24
+ While you move through the phases it’s handy to run the **interactive checker** to make sure nothing broke:
25
+
26
+ ```bash
27
+ python environments/docker_debug.py my-environment:latest
28
+ ```
29
+
30
+ The script walks the *same* checklist and prints coloured, human-friendly hints whenever something fails.
31
+
32
+ | What it validates | Phase |
33
+ |-------------------|-------|
34
+ | Container starts & logs to **stderr** | 1 |
35
+ | MCP server responds to an `initialize` request | 2 |
36
+ | Discovers `setup`, `evaluate`, and interaction tools | 3 |
37
+ | Calls `setup` / `evaluate`, checks telemetry & startup time | 4 |
38
+ | Spawns three concurrent clients to stress-test resources | 5 |
39
+
40
+ 💡 **Run it after finishing each phase.** If the checker exits with a red ❌, scroll up for the gold-coloured *hint* block – it usually points directly to the root cause.
41
+
42
+ ---
43
+
44
+ ## Phase 1 – Write a *Simple* Dockerfile
45
+
46
+ **Goal →** the container starts, prints a message to **stderr**, and exits cleanly. Nothing else.
47
+
48
+ Why stderr? In Phase 2 the MCP server will reserve **stdout** for JSON-RPC traffic, so *all* human-readable logs should already go to the other stream.
49
+
50
+ ### Minimal example
51
+
52
+ ```dockerfile
53
+ FROM python:3.11-slim
54
+
55
+ WORKDIR /apphello
56
+
57
+ COPY . .
58
+
59
+ # Optional: install requirements
60
+ # RUN pip install --no-cache-dir -r requirements.txt
61
+
62
+ # ‼️ Send logs to stderr (stdout remains untouched for MCP)
63
+ CMD [
64
+ "python",
65
+ "-c",
66
+ "import sys, time; print('hello from the container', file=sys.stderr); time.sleep(1)"
67
+ ]
68
+ ```
69
+
70
+ Build & run:
71
+
72
+ ```bash
73
+ docker build -t my-environment .
74
+ docker run --rm -it my-environment # look for the log line on stderr
75
+ ```
76
+
77
+ • **One Dockerfile only** – no docker-compose.
78
+ • If you’re building a GUI environment, start from `hudpython/novnc-base:latest` instead and leave VNC configuration for later phases.
79
+
80
+ Checkpoint reached? Congratulations – move on.
81
+
82
+ 👉 Quick sanity check: `python environments/docker_debug.py my-environment:latest` (verifies Phase 1 automatically)
83
+
84
+ Need inspiration? Skim the real Dockerfiles used in the example browser environments:
85
+ • [`simple_browser/Dockerfile`](./simple_browser/Dockerfile)
86
+ • [`remote_browser/Dockerfile`](./remote_browser/Dockerfile)
87
+ They follow the exact same pattern – a single file, logs to stderr, nothing fancy.
88
+
89
+ ---
90
+
91
+ ## Phase 2 – Create the MCP Server
92
+
93
+ **Goal →** a Python process that:
94
+ 1. Speaks MCP over **stdio**.
95
+ 2. Responds correctly to the `initialize` request.
96
+ 3. Logs everything to **stderr**.
97
+
98
+ The MCP lifecycle is *initialize → operate → shutdown* (see spec link above).
99
+
100
+ ### Skeleton server (FastMCP)
101
+
102
+ ```python
103
+ import sys
104
+ import logging
105
+ from mcp.server.fastmcp import FastMCP
106
+
107
+ # 1️⃣ Always log to stderr – stdout is reserved for JSON-RPC
108
+ logging.basicConfig(
109
+ stream=sys.stderr,
110
+ level=logging.INFO,
111
+ format='[%(levelname)s] %(asctime)s | %(name)s | %(message)s'
112
+ )
113
+
114
+ mcp = FastMCP("My Environment")
115
+
116
+ from hud.tools.helper import mcp_intialize_wrapper
117
+
118
+ @mcp_intialize_wrapper()
119
+ async def initialize_environment():
120
+ """Heavy one-time setup – start databases, launch background apps, etc."""
121
+ logging.info("starting core services…")
122
+ await start_services() # your coroutine
123
+ logging.info("services ready")
124
+
125
+ if __name__ == "__main__":
126
+ mcp.run()
127
+ ```
128
+
129
+ *(Replace `start_services()` with whatever takes noticeable startup time – browsers, DBs, X servers, …)*
130
+
131
+ ### Adapt Dockerfile
132
+
133
+ At the end of your Dockerfile, you must launch the MCP server as the container's main process, ensuring it communicates over stdio (stdin/stdout). This is typically done by setting the `CMD` or `ENTRYPOINT` to run your server module directly, for example:
134
+
135
+
136
+ ```dockerfile
137
+ FROM python:3.11-slim
138
+
139
+ WORKDIR /app
140
+ COPY . .
141
+
142
+ # Optional: install requirements
143
+ # RUN pip install --no-cache-dir -r requirements.txt
144
+
145
+ CMD ["uv", "pip", "run", "python", "-m", "your_module_name"] # Replace 'your_module_name' with your actual entrypoint module
146
+ ```
147
+
148
+ ### Three validation steps (run them **in order**)
149
+
150
+ | # | What you do | Why it matters |
151
+ |---|-------------|----------------|
152
+ | 1 | **Direct stdio test** – pipe the JSON below into your script | Proves the Python code handles `initialize` without any client or Docker noise |
153
+ | 2 | **MCP Inspector** – `npx @modelcontextprotocol/inspector python -m my_package.server` | Lets you click around: view capabilities, tools, resources |
154
+ | 3 | **Inside Docker** – rebuild the image and run it | This is *exactly* how HUD will execute the server |
155
+ | 4 | **Run `docker_debug.py`** – `python environments/docker_debug.py my-environment:latest` | Combines the above checks & points out common mistakes |
156
+
157
+ #### JSON for step 1
158
+
159
+ ```jsonc
160
+ { "jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {
161
+ "protocolVersion": "2024-11-05",
162
+ "capabilities": {"roots": {"listChanged": true}},
163
+ "clientInfo": {"name": "DevClient", "title": "Dev", "version": "0.0.0"}
164
+ }}
165
+ ```
166
+
167
+ Pipe it:
168
+
169
+ ```bash
170
+ echo '<the-json-above>' | python -m my_package.server
171
+ ```
172
+
173
+ If all three validations succeed, you have a real MCP server – time to make it useful.
174
+
175
+ ---
176
+
177
+ ## Phase 3 – Add Setup / Evaluate / Interaction Tools
178
+
179
+ **Goal →** tools are discoverable in the Inspector *and* callable from the HUD SDK.
180
+
181
+ 👉 After wiring in the tools, confirm with `python environments/docker_debug.py my-environment:latest` – it now checks for their presence and basic execution.
182
+
183
+ 1. Write **`setup`** and **`evaluate`** tools first – they are *lifecycle* tools and never shown to the LLM.
184
+ 2. Register at least one **interaction** tool (`computer`, `playwright`, or your own).
185
+
186
+ ### Example
187
+
188
+ ```python
189
+ from hud.tools.helper import register_instance_tool
190
+ from hud.tools import HudComputerTool
191
+
192
+ @mcp.tool()
193
+ async def setup(config: dict) -> dict:
194
+ ... # prepare environment
195
+
196
+ @mcp.tool()
197
+ async def evaluate(config: dict) -> dict:
198
+ ... # return {"reward": <0-1>, "done": bool}
199
+
200
+ @mcp.initialize()
201
+ async def init():
202
+ register_instance_tool(mcp, "computer", HudComputerTool())
203
+ ```
204
+
205
+ ### Test workflow
206
+
207
+ 1. **Inspector first** – restart the server, refresh the *Tools* tab, confirm the new tools appear.
208
+ 2. **Rebuild the image** – `docker build -t my-environment .`.
209
+ 3. **HUD SDK test** – run a short script like the one below. GUI environments built from `hudpython/novnc-base` still expose a VNC viewer on <http://localhost:8080/vnc.html> – keep it open while testing.
210
+
211
+ ```python
212
+ import asyncio
213
+ from hud import Task
214
+ from hud.mcp import ClaudeMCPAgent
215
+ from hud.telemetry import trace
216
+ from mcp_use import MCPClient
217
+
218
+ async def main():
219
+ # `trace` captures *everything* that happens and sends it to app.hud.so
220
+ with trace("local_test"):
221
+ cfg = {
222
+ "mcp_config": {
223
+ "local": {"command": "docker", "args": ["run", "--rm", "-i", "my-environment:latest"]}
224
+ }
225
+ }
226
+ client = MCPClient.from_dict(cfg)
227
+
228
+ agent = ClaudeMCPAgent(
229
+ client=client,
230
+ model="claude-3-sonnet-20241022",
231
+ allowed_tools=["computer"]
232
+ )
233
+
234
+ task = Task(
235
+ prompt="Mark two todo items as done",
236
+ setup={"function": "todo_seed", "args": {"num_items": 5}},
237
+ evaluate={"function": "todo_completed", "args": {"expected_count": 2}}
238
+ )
239
+
240
+ result = await agent.run(task)
241
+ print(result)
242
+
243
+ await client.close_all_sessions()
244
+
245
+ asyncio.run(main())
246
+ ```
247
+
248
+ The `trace` context manager sends a full timeline of agent actions, tool calls, and rewards to app.hud.so – perfect for debugging.
249
+
250
+ See `examples/agents_tools/simple_task_example.py` and `examples/environments/gmail_local.py` for larger end-to-end demos.
251
+
252
+ ---
253
+
254
+ ## Phase 4 – Remote Deployment & HUD Runner
255
+
256
+ **Goal →** the exact same image runs in parallel on hundreds of instances, and exposes more telemetry so the app.hud.so can visualise the whole lifecycle.
257
+
258
+ ### 1. Publish your image
259
+
260
+ Log in to Docker Hub (or any registry HUD can pull from) and push a tagged build:
261
+
262
+ ```bash
263
+ docker tag my-environment yourdockerhubuser/my-environment:latest
264
+ docker push yourdockerhubuser/my-environment:latest
265
+ ```
266
+
267
+ *(If you’re using a private registry, make sure the HUD worker has pull credentials.)*
268
+
269
+ ### 2. Launch it remotely (gmail_remote pattern)
270
+
271
+ `examples/environments/gmail_remote.py` shows the canonical pattern – a remote MCP server entry that simply runs **the same Docker image**:
272
+
273
+ ```python
274
+ from hud import settings
275
+ # Your image is in a registry, now tell HUD to pull & run it on demand
276
+ config = {
277
+ "mcp_config": {
278
+ "hud": {
279
+ "url": settings.mcp_url, # Provided by HUD when you create an evaluation run
280
+ "headers": {
281
+ "Authorization": f"Bearer {settings.api_key}",
282
+ "Mcp-Image": "yourdockerhubuser/my-environment:latest", # which image to launch
283
+ },
284
+ }
285
+ }
286
+ }
287
+
288
+ client = MCPClient.from_dict(config)
289
+ ```
290
+
291
+ _Steps 3 and 4 below are **optional but highly recommended** once the image boots successfully._
292
+
293
+ Spin up **many** agents in parallel by just launching multiple tasks – HUD will queue and start as many containers as resources allow.
294
+
295
+ ### 3. Progress updates during `initialize` (Optional)
296
+
297
+ At remote scale it can take 10-30 s for heavy services to boot. Use `mcp_intialize_wrapper()` with a *progress token* to stream status messages:
298
+
299
+ ```python
300
+ from hud.tools.helper import mcp_intialize_wrapper
301
+
302
+ @mcp_intialize_wrapper()
303
+ async def initialize_environment(session=None, progress_token=None):
304
+ async def send(p, msg):
305
+ if session and progress_token:
306
+ await session.send_progress_notification(progress_token, p, 100, msg)
307
+ await send(10, "starting X11…")
308
+ await start_x11()
309
+ await send(50, "launching browser…")
310
+ await launch_browser()
311
+ await send(100, "ready")
312
+ ```
313
+
314
+ Those messages are displayed live on app.hud.so alongside resource graphs – perfect feedback while you wait.
315
+
316
+ ### 4. Live telemetry (`telemetry://live`) (Optional)
317
+
318
+ Expose a resource named `telemetry://live` exactly like in `environments/simple_browser/src/hud_controller/server.py` to return live url to be displayed on app.hud.so.
319
+
320
+ Once all of the above works you can unleash *hundreds* of concurrent agents on your new environment.
321
+
322
+ ---
323
+
324
+ ## Phase 5 – Automated Iteration with *cursor-mcp*
325
+
326
+ [`cursor-mcp`](https://github.com/hud-evals/cursor-mcp) turns the edit → build → restart → test loop into a single key-press and adds tools to Cursor Agent that can drive the whole workflow for you. The agent reads the MCP spec, your code, and the live server state, then proposes fixes or new tests on its own. It then has access to the MCP tools the environment provides, enabling it to test all functionality, which completes the iteration loop.
327
+
328
+ 1. Add an entry to `.cursor/mcp.json`:
329
+
330
+ ```jsonc
331
+ {
332
+ "mcp_config": {
333
+ "env": {
334
+ "command": "docker",
335
+ "args": ["run", "--rm", "-i", "my-environment:latest"]
336
+ },
337
+ "cursor-manager": {
338
+ "command": "uvx",
339
+ "args": ["cursor-mcp"]
340
+ }
341
+ }
342
+ }
343
+ ```
344
+
345
+ 2. Follow the cursor rules below: rebuild, refresh, test, reflect, repeat.
346
+ 3. Keep the agent open for any messages or issues.
347
+
348
+ ### Cursor rules – paste this once
349
+
350
+ Inside `.cursor/rules/mcp_environment_iteration.mdc` add (or verify) the following so the agent always knows the expected loop:
351
+
352
+ ```mdc
353
+ ---
354
+ description: When making an environment that launches and MCP server this is the iteration loop
355
+ alwaysApply: false
356
+ ---
357
+ Setting up (also refer to environments/README.md):
358
+ 1. Follow each environment's README.md or any other steps to set it up for the MCP server to be able to directly launch it (such as building the dockerfile)
359
+ 2. Run local tests to make sure the initialize without immediate errors and stays alive until properly closed. If the server crashes within the first few seconds then the manager will not pick up on it. In this case please go back and either debug the docker run directly, or the mcp server by piping an initialization request.
360
+ 3. When the server initialization is stable, use the cursor-manager tool to see the current list of tools and add it if necessary. Take note of the name.
361
+ 4. When working, tell the user to send another message to refresh your list of tools.
362
+
363
+ After setting up, when iterating (will not require a user message ever):
364
+ 1. Look at the environment project and refine/edit/fix files
365
+ 2. Follow its README to set it up for the MCP server (such as building the dockerfile)
366
+ 3. Use the cursor-manager tool to refresh this server (by name)
367
+ 4. See its status using cursor-manager, if it's running then follow with step 5. If it fails, then check the logs using cursor-manager and go back to step 1, but ask the user to reset.
368
+ 5. Use the tools from that server (by name) to test the functionality and edge cases, reflect on the success of your TODOs and think of new things to fix. If the tools are unavailable but the status is running, then ask the user to refresh the user message.
369
+ 6. Review your TODOs, update with new TODOs
370
+ 7. Repeat until reached user's high level goals, or generally extremely happy with the final result
371
+
372
+ In general:
373
+ 1. Try to avoid running direct docker or mcp commands and use the tools. If you want to run a docker command or python mcp server command then ask permission and only use if otherwise completely impossible.
374
+ 2. If at any point the docker build starts breaking on initialize, return to setting up properly
375
+ ```
376
+
377
+ The result: fast, autonomous turnaround times even for complex GUI environments.
378
+
379
+ ---
380
+
381
+ ## Phase 6 – Optional Polish & Extensions
382
+
383
+ ### Deeper dive into registries
384
+
385
+ An environment often needs *structured knowledge* about tasks, evaluation logic, or problem definitions. The browser examples keep these in three explicit registries:
386
+
387
+ | Registry | Purpose | Example resource URI |
388
+ |----------|---------|----------------------|
389
+ | **Setup** | How to seed the environment before the agent starts | `setup://registry` & `setup://{env}` |
390
+ | **Evaluators** | Functions that decide success & reward | `evaluators://registry` |
391
+ | **Problems** | Bundled benchmarks / tasks with their own setup & evaluate pairs | `problems://registry` |
392
+
393
+ Each registry is just a dictionary mapping a *name* to a *class*. Use a **decorator** to register classes:
394
+
395
+ ```python
396
+ from .registry import setup, evaluator, problem
397
+
398
+ @setup("todo_seed")
399
+ class TodoSeed:
400
+ ...
401
+
402
+ @evaluator("todo_completed")
403
+ class TodoCompleted:
404
+ ...
405
+
406
+ @problem("todo_basic", description="Complete two todo items", difficulty="easy")
407
+ class TodoBasic:
408
+ def get_setup(self):
409
+ return {"function": "todo_seed", "args": {"num_items": 5}}
410
+ def get_evaluation(self):
411
+ return {"function": "todo_completed", "args": {"expected_count": 2}}
412
+ ```
413
+
414
+ Decorators keep registration *next to the implementation* and avoid manual bookkeeping. The server simply exposes the combined metadata through an MCP **resource**. Follow `environments/simple_browser/src/hud_controller/problems/registry.py` as a template and expose the JSON with `@mcp.resource("problems://registry")`.
415
+
416
+ ### Other finishing touches
417
+
418
+ * **Performance** – lazy-load heavy resources, pool DB connections, cache expensive calls.
419
+ * **Security** – sandbox untrusted code, keep secrets in env vars, audit-log every tool call.
420
+ * **Creative ideas** – API simulators, network test-beds, game worlds… if it fits in Docker it can be an MCP environment.
421
+
422
+ ---
423
+
424
+ ## Summary
425
+
426
+ 1. Start with a *plain* Dockerfile – verify it runs.
427
+ 2. Add a minimal FastMCP server – verify with stdio, Inspector, Docker.
428
+ 3. Implement tools – verify discovery + execution.
429
+ 4. Run the same image remotely – verify telemetry.
430
+ 5. Automate the loop with cursor-mcp.
431
+ 6. Polish and extend as inspiration strikes.
432
+
433
+ Happy building – and remember: **stderr is your friend, stdout belongs to MCP.** 🚀