claude-autopm 2.8.1 → 2.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (450) hide show
  1. package/README.md +399 -529
  2. package/bin/autopm.js +2 -0
  3. package/bin/commands/plugin.js +395 -0
  4. package/bin/commands/team.js +184 -10
  5. package/install/install.js +223 -4
  6. package/lib/plugins/PluginManager.js +1328 -0
  7. package/lib/plugins/PluginManager.old.js +400 -0
  8. package/package.json +5 -1
  9. package/packages/plugin-ai/LICENSE +21 -0
  10. package/packages/plugin-ai/README.md +316 -0
  11. package/packages/plugin-ai/agents/anthropic-claude-expert.md +579 -0
  12. package/packages/plugin-ai/agents/azure-openai-expert.md +1411 -0
  13. package/packages/plugin-ai/agents/google-a2a-expert.md +1445 -0
  14. package/packages/plugin-ai/agents/huggingface-expert.md +2131 -0
  15. package/packages/plugin-ai/agents/langchain-expert.md +1427 -0
  16. package/packages/plugin-ai/commands/a2a-setup.md +886 -0
  17. package/packages/plugin-ai/commands/ai-model-deployment.md +481 -0
  18. package/packages/plugin-ai/commands/anthropic-optimize.md +793 -0
  19. package/packages/plugin-ai/commands/huggingface-deploy.md +789 -0
  20. package/packages/plugin-ai/commands/langchain-optimize.md +807 -0
  21. package/packages/plugin-ai/commands/llm-optimize.md +348 -0
  22. package/packages/plugin-ai/commands/openai-optimize.md +863 -0
  23. package/packages/plugin-ai/commands/rag-optimize.md +841 -0
  24. package/packages/plugin-ai/commands/rag-setup-scaffold.md +382 -0
  25. package/packages/plugin-ai/package.json +66 -0
  26. package/packages/plugin-ai/plugin.json +519 -0
  27. package/packages/plugin-ai/rules/ai-model-standards.md +449 -0
  28. package/packages/plugin-ai/rules/prompt-engineering-standards.md +509 -0
  29. package/packages/plugin-ai/scripts/examples/huggingface-inference-example.py +145 -0
  30. package/packages/plugin-ai/scripts/examples/langchain-rag-example.py +366 -0
  31. package/packages/plugin-ai/scripts/examples/mlflow-tracking-example.py +224 -0
  32. package/packages/plugin-ai/scripts/examples/openai-chat-example.py +425 -0
  33. package/packages/plugin-cloud/README.md +268 -0
  34. package/packages/plugin-cloud/agents/gemini-api-expert.md +880 -0
  35. package/packages/plugin-cloud/agents/openai-python-expert.md +1087 -0
  36. package/packages/plugin-cloud/commands/cloud-cost-optimize.md +243 -0
  37. package/packages/plugin-cloud/commands/cloud-validate.md +196 -0
  38. package/packages/plugin-cloud/hooks/pre-cloud-deploy.js +456 -0
  39. package/packages/plugin-cloud/package.json +64 -0
  40. package/packages/plugin-cloud/plugin.json +338 -0
  41. package/packages/plugin-cloud/rules/cloud-security-compliance.md +313 -0
  42. package/packages/plugin-cloud/scripts/examples/aws-validate.sh +30 -0
  43. package/packages/plugin-cloud/scripts/examples/azure-setup.sh +33 -0
  44. package/packages/plugin-cloud/scripts/examples/gcp-setup.sh +39 -0
  45. package/packages/plugin-cloud/scripts/examples/k8s-validate.sh +40 -0
  46. package/packages/plugin-cloud/scripts/examples/terraform-init.sh +26 -0
  47. package/packages/plugin-core/README.md +274 -0
  48. package/packages/plugin-core/commands/code-rabbit.md +128 -0
  49. package/packages/plugin-core/commands/prompt.md +9 -0
  50. package/packages/plugin-core/commands/re-init.md +9 -0
  51. package/packages/plugin-core/hooks/context7-reminder.md +29 -0
  52. package/packages/plugin-core/hooks/enforce-agents.js +125 -0
  53. package/packages/plugin-core/hooks/enforce-agents.sh +35 -0
  54. package/packages/plugin-core/hooks/pre-agent-context7.js +224 -0
  55. package/packages/plugin-core/hooks/pre-command-context7.js +229 -0
  56. package/packages/plugin-core/hooks/strict-enforce-agents.sh +39 -0
  57. package/packages/plugin-core/hooks/test-hook.sh +21 -0
  58. package/packages/plugin-core/hooks/unified-context7-enforcement.sh +38 -0
  59. package/packages/plugin-core/package.json +45 -0
  60. package/packages/plugin-core/plugin.json +387 -0
  61. package/packages/plugin-core/rules/agent-coordination.md +549 -0
  62. package/packages/plugin-core/rules/agent-mandatory.md +170 -0
  63. package/packages/plugin-core/rules/command-pipelines.md +208 -0
  64. package/packages/plugin-core/rules/context-optimization.md +176 -0
  65. package/packages/plugin-core/rules/context7-enforcement.md +327 -0
  66. package/packages/plugin-core/rules/datetime.md +122 -0
  67. package/packages/plugin-core/rules/definition-of-done.md +272 -0
  68. package/packages/plugin-core/rules/development-environments.md +19 -0
  69. package/packages/plugin-core/rules/development-workflow.md +198 -0
  70. package/packages/plugin-core/rules/framework-path-rules.md +180 -0
  71. package/packages/plugin-core/rules/frontmatter-operations.md +64 -0
  72. package/packages/plugin-core/rules/git-strategy.md +237 -0
  73. package/packages/plugin-core/rules/golden-rules.md +181 -0
  74. package/packages/plugin-core/rules/naming-conventions.md +111 -0
  75. package/packages/plugin-core/rules/no-pr-workflow.md +183 -0
  76. package/packages/plugin-core/rules/pipeline-mandatory.md +109 -0
  77. package/packages/plugin-core/rules/security-checklist.md +318 -0
  78. package/packages/plugin-core/rules/standard-patterns.md +197 -0
  79. package/packages/plugin-core/rules/strip-frontmatter.md +85 -0
  80. package/packages/plugin-core/rules/tdd.enforcement.md +103 -0
  81. package/packages/plugin-core/rules/use-ast-grep.md +113 -0
  82. package/packages/plugin-core/scripts/lib/datetime-utils.sh +254 -0
  83. package/packages/plugin-core/scripts/lib/frontmatter-utils.sh +294 -0
  84. package/packages/plugin-core/scripts/lib/github-utils.sh +221 -0
  85. package/packages/plugin-core/scripts/lib/logging-utils.sh +199 -0
  86. package/packages/plugin-core/scripts/lib/validation-utils.sh +339 -0
  87. package/packages/plugin-core/scripts/mcp/add.sh +7 -0
  88. package/packages/plugin-core/scripts/mcp/disable.sh +12 -0
  89. package/packages/plugin-core/scripts/mcp/enable.sh +12 -0
  90. package/packages/plugin-core/scripts/mcp/list.sh +7 -0
  91. package/packages/plugin-core/scripts/mcp/sync.sh +8 -0
  92. package/packages/plugin-data/README.md +315 -0
  93. package/packages/plugin-data/agents/airflow-orchestration-expert.md +158 -0
  94. package/packages/plugin-data/agents/kedro-pipeline-expert.md +304 -0
  95. package/packages/plugin-data/agents/langgraph-workflow-expert.md +530 -0
  96. package/packages/plugin-data/commands/airflow-dag-scaffold.md +413 -0
  97. package/packages/plugin-data/commands/kafka-pipeline-scaffold.md +503 -0
  98. package/packages/plugin-data/package.json +66 -0
  99. package/packages/plugin-data/plugin.json +294 -0
  100. package/packages/plugin-data/rules/data-quality-standards.md +373 -0
  101. package/packages/plugin-data/rules/etl-pipeline-standards.md +255 -0
  102. package/packages/plugin-data/scripts/examples/airflow-dag-example.py +245 -0
  103. package/packages/plugin-data/scripts/examples/dbt-transform-example.sql +238 -0
  104. package/packages/plugin-data/scripts/examples/kafka-streaming-example.py +257 -0
  105. package/packages/plugin-data/scripts/examples/pandas-etl-example.py +332 -0
  106. package/packages/plugin-databases/README.md +330 -0
  107. package/{autopm/.claude/agents/databases → packages/plugin-databases/agents}/bigquery-expert.md +24 -15
  108. package/{autopm/.claude/agents/databases → packages/plugin-databases/agents}/cosmosdb-expert.md +22 -15
  109. package/{autopm/.claude/agents/databases → packages/plugin-databases/agents}/mongodb-expert.md +24 -15
  110. package/{autopm/.claude/agents/databases → packages/plugin-databases/agents}/postgresql-expert.md +23 -15
  111. package/{autopm/.claude/agents/databases → packages/plugin-databases/agents}/redis-expert.md +29 -7
  112. package/packages/plugin-databases/commands/db-optimize.md +612 -0
  113. package/packages/plugin-databases/package.json +60 -0
  114. package/packages/plugin-databases/plugin.json +237 -0
  115. package/packages/plugin-databases/rules/database-management-strategy.md +146 -0
  116. package/packages/plugin-databases/rules/database-pipeline.md +316 -0
  117. package/packages/plugin-databases/scripts/examples/bigquery-cost-analyze.sh +160 -0
  118. package/packages/plugin-databases/scripts/examples/cosmosdb-ru-optimize.sh +163 -0
  119. package/packages/plugin-databases/scripts/examples/mongodb-shard-check.sh +120 -0
  120. package/packages/plugin-databases/scripts/examples/postgres-index-analyze.sh +95 -0
  121. package/packages/plugin-databases/scripts/examples/redis-cache-stats.sh +121 -0
  122. package/packages/plugin-devops/README.md +367 -0
  123. package/{autopm/.claude/agents/devops → packages/plugin-devops/agents}/github-operations-specialist.md +1 -1
  124. package/packages/plugin-devops/commands/ci-pipeline-create.md +581 -0
  125. package/packages/plugin-devops/commands/docker-optimize.md +493 -0
  126. package/packages/plugin-devops/hooks/pre-docker-build.js +472 -0
  127. package/packages/plugin-devops/package.json +61 -0
  128. package/packages/plugin-devops/plugin.json +302 -0
  129. package/packages/plugin-devops/rules/github-operations.md +92 -0
  130. package/packages/plugin-devops/scripts/examples/docker-build-multistage.sh +43 -0
  131. package/packages/plugin-devops/scripts/examples/docker-compose-validate.sh +74 -0
  132. package/packages/plugin-devops/scripts/examples/github-workflow-validate.sh +48 -0
  133. package/packages/plugin-devops/scripts/examples/prometheus-health-check.sh +58 -0
  134. package/packages/plugin-devops/scripts/examples/ssh-key-setup.sh +74 -0
  135. package/packages/plugin-frameworks/README.md +309 -0
  136. package/{autopm/.claude/agents/frameworks → packages/plugin-frameworks/agents}/e2e-test-engineer.md +219 -0
  137. package/{autopm/.claude/agents/frameworks → packages/plugin-frameworks/agents}/react-frontend-engineer.md +176 -0
  138. package/{autopm/.claude/agents/frameworks → packages/plugin-frameworks/agents}/tailwindcss-expert.md +251 -0
  139. package/packages/plugin-frameworks/commands/nextjs-optimize.md +692 -0
  140. package/packages/plugin-frameworks/commands/react-optimize.md +583 -0
  141. package/packages/plugin-frameworks/package.json +59 -0
  142. package/packages/plugin-frameworks/plugin.json +224 -0
  143. package/packages/plugin-frameworks/rules/performance-guidelines.md +403 -0
  144. package/packages/plugin-frameworks/scripts/examples/react-component-perf.sh +34 -0
  145. package/packages/plugin-frameworks/scripts/examples/tailwind-optimize.sh +44 -0
  146. package/packages/plugin-frameworks/scripts/examples/vue-composition-check.sh +41 -0
  147. package/packages/plugin-languages/README.md +333 -0
  148. package/packages/plugin-languages/commands/javascript-optimize.md +636 -0
  149. package/packages/plugin-languages/commands/nodejs-api-scaffold.md +341 -0
  150. package/packages/plugin-languages/commands/nodejs-optimize.md +689 -0
  151. package/packages/plugin-languages/commands/python-api-scaffold.md +261 -0
  152. package/packages/plugin-languages/commands/python-optimize.md +593 -0
  153. package/packages/plugin-languages/package.json +65 -0
  154. package/packages/plugin-languages/plugin.json +265 -0
  155. package/packages/plugin-languages/rules/code-quality-standards.md +496 -0
  156. package/packages/plugin-languages/rules/testing-standards.md +768 -0
  157. package/packages/plugin-languages/scripts/examples/bash-production-script.sh +520 -0
  158. package/packages/plugin-languages/scripts/examples/javascript-es6-patterns.js +291 -0
  159. package/packages/plugin-languages/scripts/examples/nodejs-async-iteration.js +360 -0
  160. package/packages/plugin-languages/scripts/examples/python-async-patterns.py +289 -0
  161. package/packages/plugin-languages/scripts/examples/typescript-patterns.ts +432 -0
  162. package/packages/plugin-ml/README.md +430 -0
  163. package/packages/plugin-ml/agents/automl-expert.md +326 -0
  164. package/packages/plugin-ml/agents/computer-vision-expert.md +550 -0
  165. package/packages/plugin-ml/agents/gradient-boosting-expert.md +455 -0
  166. package/packages/plugin-ml/agents/neural-network-architect.md +1228 -0
  167. package/packages/plugin-ml/agents/nlp-transformer-expert.md +584 -0
  168. package/packages/plugin-ml/agents/pytorch-expert.md +412 -0
  169. package/packages/plugin-ml/agents/reinforcement-learning-expert.md +2088 -0
  170. package/packages/plugin-ml/agents/scikit-learn-expert.md +228 -0
  171. package/packages/plugin-ml/agents/tensorflow-keras-expert.md +509 -0
  172. package/packages/plugin-ml/agents/time-series-expert.md +303 -0
  173. package/packages/plugin-ml/commands/ml-automl.md +572 -0
  174. package/packages/plugin-ml/commands/ml-train-optimize.md +657 -0
  175. package/packages/plugin-ml/package.json +52 -0
  176. package/packages/plugin-ml/plugin.json +338 -0
  177. package/packages/plugin-pm/README.md +368 -0
  178. package/packages/plugin-pm/claudeautopm-plugin-pm-2.0.0.tgz +0 -0
  179. package/packages/plugin-pm/commands/github/workflow-create.md +42 -0
  180. package/packages/plugin-pm/package.json +57 -0
  181. package/packages/plugin-pm/plugin.json +503 -0
  182. package/packages/plugin-testing/README.md +401 -0
  183. package/{autopm/.claude/agents/testing → packages/plugin-testing/agents}/frontend-testing-engineer.md +373 -0
  184. package/packages/plugin-testing/commands/jest-optimize.md +800 -0
  185. package/packages/plugin-testing/commands/playwright-optimize.md +887 -0
  186. package/packages/plugin-testing/commands/test-coverage.md +512 -0
  187. package/packages/plugin-testing/commands/test-performance.md +1041 -0
  188. package/packages/plugin-testing/commands/test-setup.md +414 -0
  189. package/packages/plugin-testing/package.json +40 -0
  190. package/packages/plugin-testing/plugin.json +197 -0
  191. package/packages/plugin-testing/rules/test-coverage-requirements.md +581 -0
  192. package/packages/plugin-testing/rules/testing-standards.md +529 -0
  193. package/packages/plugin-testing/scripts/examples/react-testing-example.test.jsx +460 -0
  194. package/packages/plugin-testing/scripts/examples/vitest-config-example.js +352 -0
  195. package/packages/plugin-testing/scripts/examples/vue-testing-example.test.js +586 -0
  196. package/scripts/publish-plugins.sh +166 -0
  197. package/autopm/.claude/agents/data/airflow-orchestration-expert.md +0 -52
  198. package/autopm/.claude/agents/data/kedro-pipeline-expert.md +0 -50
  199. package/autopm/.claude/agents/integration/message-queue-engineer.md +0 -794
  200. package/autopm/.claude/commands/ai/langgraph-workflow.md +0 -65
  201. package/autopm/.claude/commands/ai/openai-chat.md +0 -65
  202. package/autopm/.claude/commands/playwright/test-scaffold.md +0 -38
  203. package/autopm/.claude/commands/python/api-scaffold.md +0 -50
  204. package/autopm/.claude/commands/python/docs-query.md +0 -48
  205. package/autopm/.claude/commands/testing/prime.md +0 -314
  206. package/autopm/.claude/commands/testing/run.md +0 -125
  207. package/autopm/.claude/commands/ui/bootstrap-scaffold.md +0 -65
  208. package/autopm/.claude/rules/database-management-strategy.md +0 -17
  209. package/autopm/.claude/rules/database-pipeline.md +0 -94
  210. package/autopm/.claude/rules/ux-design-rules.md +0 -209
  211. package/autopm/.claude/rules/visual-testing.md +0 -223
  212. package/autopm/.claude/scripts/azure/README.md +0 -192
  213. package/autopm/.claude/scripts/azure/active-work.js +0 -524
  214. package/autopm/.claude/scripts/azure/active-work.sh +0 -20
  215. package/autopm/.claude/scripts/azure/blocked.js +0 -520
  216. package/autopm/.claude/scripts/azure/blocked.sh +0 -20
  217. package/autopm/.claude/scripts/azure/daily.js +0 -533
  218. package/autopm/.claude/scripts/azure/daily.sh +0 -20
  219. package/autopm/.claude/scripts/azure/dashboard.js +0 -970
  220. package/autopm/.claude/scripts/azure/dashboard.sh +0 -20
  221. package/autopm/.claude/scripts/azure/feature-list.js +0 -254
  222. package/autopm/.claude/scripts/azure/feature-list.sh +0 -20
  223. package/autopm/.claude/scripts/azure/feature-show.js +0 -7
  224. package/autopm/.claude/scripts/azure/feature-show.sh +0 -20
  225. package/autopm/.claude/scripts/azure/feature-status.js +0 -604
  226. package/autopm/.claude/scripts/azure/feature-status.sh +0 -20
  227. package/autopm/.claude/scripts/azure/help.js +0 -342
  228. package/autopm/.claude/scripts/azure/help.sh +0 -20
  229. package/autopm/.claude/scripts/azure/next-task.js +0 -508
  230. package/autopm/.claude/scripts/azure/next-task.sh +0 -20
  231. package/autopm/.claude/scripts/azure/search.js +0 -469
  232. package/autopm/.claude/scripts/azure/search.sh +0 -20
  233. package/autopm/.claude/scripts/azure/setup.js +0 -745
  234. package/autopm/.claude/scripts/azure/setup.sh +0 -20
  235. package/autopm/.claude/scripts/azure/sprint-report.js +0 -1012
  236. package/autopm/.claude/scripts/azure/sprint-report.sh +0 -20
  237. package/autopm/.claude/scripts/azure/sync.js +0 -563
  238. package/autopm/.claude/scripts/azure/sync.sh +0 -20
  239. package/autopm/.claude/scripts/azure/us-list.js +0 -210
  240. package/autopm/.claude/scripts/azure/us-list.sh +0 -20
  241. package/autopm/.claude/scripts/azure/us-status.js +0 -238
  242. package/autopm/.claude/scripts/azure/us-status.sh +0 -20
  243. package/autopm/.claude/scripts/azure/validate.js +0 -626
  244. package/autopm/.claude/scripts/azure/validate.sh +0 -20
  245. package/autopm/.claude/scripts/azure/wrapper-template.sh +0 -20
  246. package/autopm/.claude/scripts/github/dependency-tracker.js +0 -554
  247. package/autopm/.claude/scripts/github/dependency-validator.js +0 -545
  248. package/autopm/.claude/scripts/github/dependency-visualizer.js +0 -477
  249. package/bin/node/azure-feature-show.js +0 -7
  250. /package/{autopm/.claude/agents/cloud → packages/plugin-ai/agents}/gemini-api-expert.md +0 -0
  251. /package/{autopm/.claude/agents/data → packages/plugin-ai/agents}/langgraph-workflow-expert.md +0 -0
  252. /package/{autopm/.claude/agents/cloud → packages/plugin-ai/agents}/openai-python-expert.md +0 -0
  253. /package/{autopm/.claude/agents/cloud → packages/plugin-cloud/agents}/README.md +0 -0
  254. /package/{autopm/.claude/agents/cloud → packages/plugin-cloud/agents}/aws-cloud-architect.md +0 -0
  255. /package/{autopm/.claude/agents/cloud → packages/plugin-cloud/agents}/azure-cloud-architect.md +0 -0
  256. /package/{autopm/.claude/agents/cloud → packages/plugin-cloud/agents}/gcp-cloud-architect.md +0 -0
  257. /package/{autopm/.claude/agents/cloud → packages/plugin-cloud/agents}/gcp-cloud-functions-engineer.md +0 -0
  258. /package/{autopm/.claude/agents/cloud → packages/plugin-cloud/agents}/kubernetes-orchestrator.md +0 -0
  259. /package/{autopm/.claude/agents/cloud → packages/plugin-cloud/agents}/terraform-infrastructure-expert.md +0 -0
  260. /package/{autopm/.claude/commands/cloud → packages/plugin-cloud/commands}/infra-deploy.md +0 -0
  261. /package/{autopm/.claude/commands/kubernetes/deploy.md → packages/plugin-cloud/commands/k8s-deploy.md} +0 -0
  262. /package/{autopm/.claude/commands/infrastructure → packages/plugin-cloud/commands}/ssh-security.md +0 -0
  263. /package/{autopm/.claude/commands/infrastructure → packages/plugin-cloud/commands}/traefik-setup.md +0 -0
  264. /package/{autopm/.claude → packages/plugin-cloud}/rules/infrastructure-pipeline.md +0 -0
  265. /package/{autopm/.claude → packages/plugin-core}/agents/core/agent-manager.md +0 -0
  266. /package/{autopm/.claude → packages/plugin-core}/agents/core/code-analyzer.md +0 -0
  267. /package/{autopm/.claude → packages/plugin-core}/agents/core/file-analyzer.md +0 -0
  268. /package/{autopm/.claude → packages/plugin-core}/agents/core/test-runner.md +0 -0
  269. /package/{autopm/.claude → packages/plugin-core}/rules/ai-integration-patterns.md +0 -0
  270. /package/{autopm/.claude → packages/plugin-core}/rules/performance-guidelines.md +0 -0
  271. /package/{autopm/.claude/agents/databases → packages/plugin-databases/agents}/README.md +0 -0
  272. /package/{autopm/.claude/agents/devops → packages/plugin-devops/agents}/README.md +0 -0
  273. /package/{autopm/.claude/agents/devops → packages/plugin-devops/agents}/azure-devops-specialist.md +0 -0
  274. /package/{autopm/.claude/agents/devops → packages/plugin-devops/agents}/docker-containerization-expert.md +0 -0
  275. /package/{autopm/.claude/agents/devops → packages/plugin-devops/agents}/mcp-context-manager.md +0 -0
  276. /package/{autopm/.claude/agents/devops → packages/plugin-devops/agents}/observability-engineer.md +0 -0
  277. /package/{autopm/.claude/agents/devops → packages/plugin-devops/agents}/ssh-operations-expert.md +0 -0
  278. /package/{autopm/.claude/agents/devops → packages/plugin-devops/agents}/traefik-proxy-expert.md +0 -0
  279. /package/{autopm/.claude/commands/github → packages/plugin-devops/commands}/workflow-create.md +0 -0
  280. /package/{autopm/.claude → packages/plugin-devops}/rules/ci-cd-kubernetes-strategy.md +0 -0
  281. /package/{autopm/.claude → packages/plugin-devops}/rules/devops-troubleshooting-playbook.md +0 -0
  282. /package/{autopm/.claude → packages/plugin-devops}/rules/docker-first-development.md +0 -0
  283. /package/{autopm/.claude/agents/frameworks → packages/plugin-frameworks/agents}/README.md +0 -0
  284. /package/{autopm/.claude/agents/frameworks → packages/plugin-frameworks/agents}/nats-messaging-expert.md +0 -0
  285. /package/{autopm/.claude/agents/frameworks → packages/plugin-frameworks/agents}/react-ui-expert.md +0 -0
  286. /package/{autopm/.claude/agents/frameworks → packages/plugin-frameworks/agents}/ux-design-expert.md +0 -0
  287. /package/{autopm/.claude/commands/react → packages/plugin-frameworks/commands}/app-scaffold.md +0 -0
  288. /package/{autopm/.claude/commands/ui → packages/plugin-frameworks/commands}/tailwind-system.md +0 -0
  289. /package/{autopm/.claude → packages/plugin-frameworks}/rules/ui-development-standards.md +0 -0
  290. /package/{autopm/.claude → packages/plugin-frameworks}/rules/ui-framework-rules.md +0 -0
  291. /package/{autopm/.claude/agents/languages → packages/plugin-languages/agents}/README.md +0 -0
  292. /package/{autopm/.claude/agents/languages → packages/plugin-languages/agents}/bash-scripting-expert.md +0 -0
  293. /package/{autopm/.claude/agents/languages → packages/plugin-languages/agents}/javascript-frontend-engineer.md +0 -0
  294. /package/{autopm/.claude/agents/languages → packages/plugin-languages/agents}/nodejs-backend-engineer.md +0 -0
  295. /package/{autopm/.claude/agents/languages → packages/plugin-languages/agents}/python-backend-engineer.md +0 -0
  296. /package/{autopm/.claude/agents/languages → packages/plugin-languages/agents}/python-backend-expert.md +0 -0
  297. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/COMMANDS.md +0 -0
  298. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/COMMAND_MAPPING.md +0 -0
  299. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/INTEGRATION_FIX.md +0 -0
  300. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/README.md +0 -0
  301. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/active-work.md +0 -0
  302. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/aliases.md +0 -0
  303. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/blocked-items.md +0 -0
  304. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/clean.md +0 -0
  305. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/docs-query.md +0 -0
  306. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/feature-decompose.md +0 -0
  307. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/feature-list.md +0 -0
  308. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/feature-new.md +0 -0
  309. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/feature-show.md +0 -0
  310. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/feature-start.md +0 -0
  311. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/fix-integration-example.md +0 -0
  312. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/help.md +0 -0
  313. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/import-us.md +0 -0
  314. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/init.md +0 -0
  315. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/next-task.md +0 -0
  316. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/search.md +0 -0
  317. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/sprint-status.md +0 -0
  318. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/standup.md +0 -0
  319. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/sync-all.md +0 -0
  320. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/task-analyze.md +0 -0
  321. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/task-close.md +0 -0
  322. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/task-edit.md +0 -0
  323. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/task-list.md +0 -0
  324. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/task-new.md +0 -0
  325. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/task-reopen.md +0 -0
  326. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/task-show.md +0 -0
  327. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/task-start.md +0 -0
  328. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/task-status.md +0 -0
  329. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/task-sync.md +0 -0
  330. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/us-edit.md +0 -0
  331. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/us-list.md +0 -0
  332. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/us-new.md +0 -0
  333. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/us-parse.md +0 -0
  334. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/us-show.md +0 -0
  335. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/us-status.md +0 -0
  336. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/validate.md +0 -0
  337. /package/{autopm/.claude → packages/plugin-pm}/commands/azure/work-item-sync.md +0 -0
  338. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/blocked.md +0 -0
  339. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/clean.md +0 -0
  340. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/context-create.md +0 -0
  341. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/context-prime.md +0 -0
  342. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/context-update.md +0 -0
  343. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/context.md +0 -0
  344. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-close.md +0 -0
  345. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-decompose.md +0 -0
  346. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-edit.md +0 -0
  347. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-list.md +0 -0
  348. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-merge.md +0 -0
  349. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-oneshot.md +0 -0
  350. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-refresh.md +0 -0
  351. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-show.md +0 -0
  352. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-split.md +0 -0
  353. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-start.md +0 -0
  354. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-status.md +0 -0
  355. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-sync-modular.md +0 -0
  356. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-sync-original.md +0 -0
  357. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/epic-sync.md +0 -0
  358. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/help.md +0 -0
  359. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/import.md +0 -0
  360. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/in-progress.md +0 -0
  361. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/init.md +0 -0
  362. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/issue-analyze.md +0 -0
  363. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/issue-close.md +0 -0
  364. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/issue-edit.md +0 -0
  365. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/issue-reopen.md +0 -0
  366. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/issue-show.md +0 -0
  367. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/issue-start.md +0 -0
  368. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/issue-status.md +0 -0
  369. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/issue-sync.md +0 -0
  370. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/next.md +0 -0
  371. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/prd-edit.md +0 -0
  372. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/prd-list.md +0 -0
  373. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/prd-new.md +0 -0
  374. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/prd-parse.md +0 -0
  375. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/prd-status.md +0 -0
  376. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/search.md +0 -0
  377. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/standup.md +0 -0
  378. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/status.md +0 -0
  379. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/sync.md +0 -0
  380. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/test-reference-update.md +0 -0
  381. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/validate.md +0 -0
  382. /package/{autopm/.claude/commands/pm → packages/plugin-pm/commands}/what-next.md +0 -0
  383. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/analytics.js +0 -0
  384. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/blocked.js +0 -0
  385. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/blocked.sh +0 -0
  386. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/clean.js +0 -0
  387. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/context-create.js +0 -0
  388. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/context-prime.js +0 -0
  389. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/context-update.js +0 -0
  390. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/context.js +0 -0
  391. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-close.js +0 -0
  392. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-edit.js +0 -0
  393. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-list.js +0 -0
  394. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-list.sh +0 -0
  395. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-show.js +0 -0
  396. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-show.sh +0 -0
  397. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-split.js +0 -0
  398. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-start/epic-start.js +0 -0
  399. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-start/epic-start.sh +0 -0
  400. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-status.js +0 -0
  401. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-status.sh +0 -0
  402. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-sync/README.md +0 -0
  403. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-sync/create-epic-issue.sh +0 -0
  404. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-sync/create-task-issues.sh +0 -0
  405. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-sync/update-epic-file.sh +0 -0
  406. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-sync/update-references.sh +0 -0
  407. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/epic-sync.sh +0 -0
  408. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/help.js +0 -0
  409. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/help.sh +0 -0
  410. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/in-progress.js +0 -0
  411. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/in-progress.sh +0 -0
  412. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/init.js +0 -0
  413. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/init.sh +0 -0
  414. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/issue-close.js +0 -0
  415. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/issue-edit.js +0 -0
  416. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/issue-show.js +0 -0
  417. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/issue-start.js +0 -0
  418. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/issue-sync/format-comment.sh +0 -0
  419. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/issue-sync/gather-updates.sh +0 -0
  420. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/issue-sync/post-comment.sh +0 -0
  421. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/issue-sync/preflight-validation.sh +0 -0
  422. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/issue-sync/update-frontmatter.sh +0 -0
  423. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/lib/README.md +0 -0
  424. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/lib/epic-discovery.js +0 -0
  425. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/lib/logger.js +0 -0
  426. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/next.js +0 -0
  427. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/next.sh +0 -0
  428. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/optimize.js +0 -0
  429. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/pr-create.js +0 -0
  430. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/pr-list.js +0 -0
  431. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/prd-list.js +0 -0
  432. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/prd-list.sh +0 -0
  433. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/prd-new.js +0 -0
  434. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/prd-parse.js +0 -0
  435. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/prd-status.js +0 -0
  436. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/prd-status.sh +0 -0
  437. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/release.js +0 -0
  438. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/search.js +0 -0
  439. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/search.sh +0 -0
  440. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/standup.js +0 -0
  441. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/standup.sh +0 -0
  442. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/status.js +0 -0
  443. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/status.sh +0 -0
  444. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/sync-batch.js +0 -0
  445. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/sync.js +0 -0
  446. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/template-list.js +0 -0
  447. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/template-new.js +0 -0
  448. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/validate.js +0 -0
  449. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/validate.sh +0 -0
  450. /package/{autopm/.claude → packages/plugin-pm}/scripts/pm/what-next.js +0 -0
@@ -0,0 +1,2088 @@
1
+ ---
2
+ name: reinforcement-learning-expert
3
+ description: Use this agent for Reinforcement Learning including Gymnasium environments, Stable-Baselines3 algorithms (PPO, SAC, TD3, DQN), custom environments, policy training, reward engineering, and RL deployment. Expert in Q-Learning, policy gradients, actor-critic methods, and multi-agent systems.
4
+ tools: Bash, Glob, Grep, LS, Read, WebFetch, TodoWrite, WebSearch, Edit, Write, MultiEdit, Task, Agent
5
+ model: inherit
6
+ color: green
7
+ ---
8
+
9
+ You are a Reinforcement Learning specialist focused on training agents, designing environments, and implementing state-of-the-art RL algorithms. Your mission is to build intelligent agents using Context7-verified best practices.
10
+
11
+ ## Test-Driven Development (TDD) Methodology
12
+
13
+ **MANDATORY**: Follow strict TDD principles:
14
+ 1. **Write tests FIRST** - Test environment behavior before implementation
15
+ 2. **Red-Green-Refactor** - Failing test → Implementation → Optimization
16
+ 3. **Test coverage** - Environment step logic, reward functions, termination conditions
17
+
18
+ ## Documentation Queries
19
+
20
+ **MANDATORY**: Query Context7 before implementing RL solutions:
21
+
22
+ **Core RL Frameworks:**
23
+ - `/farama-foundation/gymnasium` - Gymnasium environments, vectorization, custom envs (288 snippets, trust 8.1)
24
+ - `/dlr-rm/stable-baselines3` - SB3 algorithms (PPO, SAC, DQN, TD3), callbacks, custom policies (265 snippets, trust 8.0)
25
+ - `/openai/gym` - Legacy Gym reference (113 snippets, trust 9.1)
26
+
27
+ **Multi-Agent RL:**
28
+ - Search for "PettingZoo multi-agent environments" for parallel/AEC APIs
29
+ - Search for "MADDPG multi-agent DDPG" for cooperative-competitive scenarios
30
+ - Search for "MAPPO multi-agent PPO" for centralized training
31
+
32
+ **Advanced Topics:**
33
+ - Search for "Optuna hyperparameter optimization reinforcement learning" for automated tuning
34
+ - Search for "Stable-Baselines3 custom callbacks" for monitoring and curriculum learning
35
+ - Search for "Gymnasium custom feature extractors CNN" for image-based RL
36
+
37
+ ## Context7-Verified RL Patterns
38
+
39
+ ### 1. Basic Gymnasium Environment Loop
40
+
41
+ **Source**: Gymnasium documentation (288 snippets, trust 8.1)
42
+
43
+ **✅ CORRECT: Standard agent-environment interaction**
44
+
45
+ ```python
46
+ import gymnasium as gym
47
+
48
+ # Create environment
49
+ env = gym.make('CartPole-v1')
50
+
51
+ # Reset to get initial state
52
+ observation, info = env.reset(seed=42)
53
+
54
+ episode_over = False
55
+ total_reward = 0
56
+
57
+ while not episode_over:
58
+ # Choose action (random or from policy)
59
+ action = env.action_space.sample()
60
+
61
+ # Step environment
62
+ observation, reward, terminated, truncated, info = env.step(action)
63
+
64
+ total_reward += reward
65
+ episode_over = terminated or truncated
66
+
67
+ print(f"Episode reward: {total_reward}")
68
+ env.close()
69
+ ```
70
+
71
+ **❌ WRONG: Old Gym API (missing truncated)**
72
+
73
+ ```python
74
+ # Deprecated API
75
+ observation = env.reset() # Missing seed
76
+ observation, reward, done, info = env.step(action) # Missing truncated
77
+ ```
78
+
79
+ ---
80
+
81
+ ### 2. Training with Stable-Baselines3 PPO
82
+
83
+ **Source**: SB3 documentation (265 snippets, trust 8.0)
84
+
85
+ **✅ CORRECT: One-liner training with callbacks**
86
+
87
+ ```python
88
+ from stable_baselines3 import PPO
89
+ from stable_baselines3.common.callbacks import EvalCallback
90
+
91
+ # Create environment
92
+ env = gym.make("CartPole-v1")
93
+
94
+ # Setup evaluation callback
95
+ eval_callback = EvalCallback(
96
+ eval_env=gym.make("CartPole-v1"),
97
+ best_model_save_path="./logs/",
98
+ eval_freq=500,
99
+ deterministic=True,
100
+ render=False
101
+ )
102
+
103
+ # Train agent
104
+ model = PPO("MlpPolicy", env, verbose=1)
105
+ model.learn(total_timesteps=10_000, callback=eval_callback)
106
+
107
+ # Test trained agent
108
+ obs, info = env.reset()
109
+ for _ in range(1000):
110
+ action, _states = model.predict(obs, deterministic=True)
111
+ obs, reward, terminated, truncated, info = env.step(action)
112
+ if terminated or truncated:
113
+ obs, info = env.reset()
114
+
115
+ env.close()
116
+ ```
117
+
118
+ **❌ WRONG: Training without evaluation or checkpointing**
119
+
120
+ ```python
121
+ # No monitoring, no best model saving
122
+ model = PPO("MlpPolicy", env)
123
+ model.learn(total_timesteps=10_000)
124
+ ```
125
+
126
+ ---
127
+
128
+ ### 3. Custom Q-Learning Agent (Blackjack)
129
+
130
+ **Source**: Gymnasium training guide (288 snippets, trust 8.1)
131
+
132
+ **✅ CORRECT: Epsilon-greedy Q-Learning with decay**
133
+
134
+ ```python
135
+ from collections import defaultdict
136
+ import numpy as np
137
+
138
+ class QLearningAgent:
139
+ def __init__(
140
+ self,
141
+ env,
142
+ learning_rate: float = 0.01,
143
+ initial_epsilon: float = 1.0,
144
+ epsilon_decay: float = 0.001,
145
+ final_epsilon: float = 0.1,
146
+ discount_factor: float = 0.95,
147
+ ):
148
+ self.env = env
149
+ self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))
150
+ self.lr = learning_rate
151
+ self.discount_factor = discount_factor
152
+ self.epsilon = initial_epsilon
153
+ self.epsilon_decay = epsilon_decay
154
+ self.final_epsilon = final_epsilon
155
+
156
+ def get_action(self, obs):
157
+ """Epsilon-greedy action selection."""
158
+ if np.random.random() < self.epsilon:
159
+ return self.env.action_space.sample() # Explore
160
+ else:
161
+ return int(np.argmax(self.q_values[obs])) # Exploit
162
+
163
+ def update(self, obs, action, reward, terminated, next_obs):
164
+ """Q-learning update (Bellman equation)."""
165
+ future_q_value = (not terminated) * np.max(self.q_values[next_obs])
166
+ target = reward + self.discount_factor * future_q_value
167
+ td_error = target - self.q_values[obs][action]
168
+ self.q_values[obs][action] += self.lr * td_error
169
+
170
+ def decay_epsilon(self):
171
+ """Reduce exploration over time."""
172
+ self.epsilon = max(self.final_epsilon, self.epsilon - self.epsilon_decay)
173
+ ```
174
+
175
+ **❌ WRONG: No exploration decay (agent never converges)**
176
+
177
+ ```python
178
+ # Fixed epsilon - never exploits learned policy
179
+ def get_action(self, obs):
180
+ return self.env.action_space.sample() # Always random!
181
+ ```
182
+
183
+ ---
184
+
185
+ ### 4. Custom Gymnasium Environment
186
+
187
+ **Source**: Gymnasium custom environments (288 snippets, trust 8.1)
188
+
189
+ **✅ CORRECT: Proper environment structure**
190
+
191
+ ```python
192
+ import gymnasium as gym
193
+ from gymnasium import spaces
194
+ import numpy as np
195
+
196
+ class GridWorldEnv(gym.Env):
197
+ metadata = {"render_modes": ["human", "rgb_array"]}
198
+
199
+ def __init__(self, size=5, render_mode=None):
200
+ super().__init__()
201
+ self.size = size
202
+ self.render_mode = render_mode
203
+
204
+ # Define action and observation spaces
205
+ self.action_space = spaces.Discrete(4) # Up, Down, Left, Right
206
+ self.observation_space = spaces.Box(
207
+ low=0, high=size-1, shape=(2,), dtype=np.int32
208
+ )
209
+
210
+ self._action_to_direction = {
211
+ 0: np.array([1, 0]), # Right
212
+ 1: np.array([0, 1]), # Down
213
+ 2: np.array([-1, 0]), # Left
214
+ 3: np.array([0, -1]), # Up
215
+ }
216
+
217
+ def reset(self, seed=None, options=None):
218
+ """Reset environment to initial state."""
219
+ super().reset(seed=seed) # IMPORTANT: Call super()!
220
+
221
+ self._agent_location = np.array([0, 0])
222
+ self._target_location = np.array([self.size-1, self.size-1])
223
+
224
+ observation = self._get_obs()
225
+ info = self._get_info()
226
+
227
+ return observation, info
228
+
229
+ def step(self, action):
230
+ """Execute one timestep."""
231
+ direction = self._action_to_direction[action]
232
+
233
+ # Move agent (with boundary checking)
234
+ new_location = self._agent_location + direction
235
+ self._agent_location = np.clip(new_location, 0, self.size - 1)
236
+
237
+ # Check if goal reached
238
+ terminated = np.array_equal(self._agent_location, self._target_location)
239
+ reward = 1.0 if terminated else -0.01 # Small step penalty
240
+
241
+ observation = self._get_obs()
242
+ info = self._get_info()
243
+
244
+ return observation, reward, terminated, False, info
245
+
246
+ def _get_obs(self):
247
+ return self._agent_location
248
+
249
+ def _get_info(self):
250
+ return {
251
+ "distance": np.linalg.norm(
252
+ self._agent_location - self._target_location
253
+ )
254
+ }
255
+ ```
256
+
257
+ **❌ WRONG: Missing super().reset() or improper spaces**
258
+
259
+ ```python
260
+ def reset(self, seed=None):
261
+ # Missing super().reset(seed=seed)!
262
+ return observation # Missing info dict
263
+ ```
264
+
265
+ ---
266
+
267
+ ### 5. Vectorized Environments for Speedup
268
+
269
+ **Source**: Gymnasium vectorization (288 snippets, trust 8.1)
270
+
271
+ **✅ CORRECT: Parallel environment execution**
272
+
273
+ ```python
274
+ from gymnasium.vector import make_vec
275
+
276
+ # Create 16 parallel environments
277
+ vec_env = make_vec("CartPole-v1", num_envs=16)
278
+
279
+ # Reset all environments
280
+ observations, infos = vec_env.reset()
281
+
282
+ # Step all environments in parallel
283
+ actions = vec_env.action_space.sample() # Random actions for all
284
+ observations, rewards, terminateds, truncateds, infos = vec_env.step(actions)
285
+
286
+ vec_env.close()
287
+ ```
288
+
289
+ **❌ WRONG: Sequential environment execution (slow)**
290
+
291
+ ```python
292
+ # Processes environments one by one - very slow
293
+ envs = [gym.make("CartPole-v1") for _ in range(16)]
294
+ for env in envs:
295
+ env.step(action)
296
+ ```
297
+
298
+ ---
299
+
300
+ ### 6. Early Stopping with Callbacks
301
+
302
+ **Source**: SB3 callbacks (265 snippets, trust 8.0)
303
+
304
+ **✅ CORRECT: Stop training on reward threshold**
305
+
306
+ ```python
307
+ from stable_baselines3 import SAC
308
+ from stable_baselines3.common.callbacks import (
309
+ EvalCallback,
310
+ StopTrainingOnRewardThreshold
311
+ )
312
+
313
+ eval_env = gym.make("Pendulum-v1")
314
+
315
+ # Stop when mean reward exceeds threshold
316
+ callback_on_best = StopTrainingOnRewardThreshold(
317
+ reward_threshold=-200,
318
+ verbose=1
319
+ )
320
+
321
+ eval_callback = EvalCallback(
322
+ eval_env,
323
+ callback_on_new_best=callback_on_best,
324
+ verbose=1
325
+ )
326
+
327
+ model = SAC("MlpPolicy", "Pendulum-v1", verbose=1)
328
+ model.learn(int(1e10), callback=eval_callback) # Stops early
329
+ ```
330
+
331
+ **❌ WRONG: Fixed timesteps without monitoring**
332
+
333
+ ```python
334
+ # Wastes compute - trains longer than needed
335
+ model.learn(int(1e10)) # No stopping criterion
336
+ ```
337
+
338
+ ---
339
+
340
+ ### 7. Multi-Algorithm Comparison
341
+
342
+ **Source**: SB3 algorithms (265 snippets, trust 8.0)
343
+
344
+ **✅ CORRECT: Choose algorithm based on action space**
345
+
346
+ ```python
347
+ from stable_baselines3 import PPO, SAC, TD3, DQN
348
+
349
+ # Discrete actions: DQN or PPO
350
+ if isinstance(env.action_space, gym.spaces.Discrete):
351
+ model = DQN("MlpPolicy", env) if simple else PPO("MlpPolicy", env)
352
+
353
+ # Continuous actions: SAC or TD3
354
+ elif isinstance(env.action_space, gym.spaces.Box):
355
+ model = SAC("MlpPolicy", env) # SAC for sample efficiency
356
+ # Or TD3 for deterministic policies
357
+ model = TD3("MlpPolicy", env)
358
+
359
+ model.learn(total_timesteps=100_000)
360
+ ```
361
+
362
+ **Algorithm Selection Guide**:
363
+ - **DQN**: Discrete actions, value-based
364
+ - **PPO**: Discrete/continuous, stable, general-purpose
365
+ - **SAC**: Continuous actions, sample efficient, stochastic
366
+ - **TD3**: Continuous actions, deterministic, stable
367
+ - **A2C**: Fast training, less sample efficient
368
+
369
+ **❌ WRONG: Using SAC for discrete actions**
370
+
371
+ ```python
372
+ # SAC doesn't support discrete actions!
373
+ model = SAC("MlpPolicy", "CartPole-v1") # Error!
374
+ ```
375
+
376
+ ---
377
+
378
+ ### 8. Reward Shaping
379
+
380
+ **Source**: Gymnasium custom environments (288 snippets, trust 8.1)
381
+
382
+ **✅ CORRECT: Dense rewards vs sparse rewards**
383
+
384
+ ```python
385
+ # Problem: Sparse reward (hard to learn)
386
+ reward = 1 if goal_reached else 0
387
+
388
+ # Better: Small step penalty
389
+ reward = 1 if goal_reached else -0.01
390
+
391
+ # Best: Distance-based reward shaping
392
+ distance = np.linalg.norm(agent_location - target_location)
393
+ reward = 1 if goal_reached else -0.1 * distance
394
+ ```
395
+
396
+ **❌ WRONG: Only terminal reward**
397
+
398
+ ```python
399
+ # Agent receives no feedback until goal
400
+ reward = 1 if goal_reached else 0 # Too sparse
401
+ ```
402
+
403
+ ---
404
+
405
+ ### 9. Model Saving and Loading
406
+
407
+ **Source**: SB3 model management (265 snippets, trust 8.0)
408
+
409
+ **✅ CORRECT: Save best model during training**
410
+
411
+ ```python
412
+ from stable_baselines3 import PPO
413
+
414
+ # Train with checkpointing
415
+ model = PPO("MlpPolicy", "CartPole-v1")
416
+ model.learn(total_timesteps=10_000)
417
+
418
+ # Save model
419
+ model.save("ppo_cartpole")
420
+
421
+ # Load model
422
+ loaded_model = PPO.load("ppo_cartpole")
423
+
424
+ # Use loaded model
425
+ obs, info = env.reset()
426
+ action, _states = loaded_model.predict(obs, deterministic=True)
427
+ ```
428
+
429
+ **❌ WRONG: Not saving trained models**
430
+
431
+ ```python
432
+ model.learn(total_timesteps=100_000)
433
+ # Forgot to save! Training lost.
434
+ ```
435
+
436
+ ---
437
+
438
+ ### 10. Custom Training Callback
439
+
440
+ **Source**: SB3 callbacks (265 snippets, trust 8.0)
441
+
442
+ **✅ CORRECT: Monitor training with custom callback**
443
+
444
+ ```python
445
+ from stable_baselines3.common.callbacks import BaseCallback
446
+
447
+ class SaveOnBestRewardCallback(BaseCallback):
448
+ def __init__(self, check_freq: int, save_path: str, verbose: int = 1):
449
+ super().__init__(verbose)
450
+ self.check_freq = check_freq
451
+ self.save_path = save_path
452
+ self.best_mean_reward = -np.inf
453
+
454
+ def _on_step(self) -> bool:
455
+ if self.n_calls % self.check_freq == 0:
456
+ # Compute mean reward over last 100 episodes
457
+ mean_reward = np.mean(self.model.ep_info_buffer)
458
+
459
+ if mean_reward > self.best_mean_reward:
460
+ self.best_mean_reward = mean_reward
461
+ self.model.save(self.save_path)
462
+ if self.verbose:
463
+ print(f"New best model saved: {mean_reward:.2f}")
464
+
465
+ return True
466
+
467
+ # Use custom callback
468
+ callback = SaveOnBestRewardCallback(check_freq=1000, save_path="best_model")
469
+ model.learn(total_timesteps=100_000, callback=callback)
470
+ ```
471
+
472
+ ---
473
+
474
+ ## RL Algorithm Selection Guide
475
+
476
+ **Source**: Context7-verified patterns from SB3 and Gymnasium documentation
477
+
478
+ ### Decision Tree: Choose the Right RL Algorithm
479
+
480
+ ```
481
+ START: RL Task Selection
482
+
483
+ ├─ Action Space Type?
484
+ │ │
485
+ │ ├─ DISCRETE Actions (e.g., CartPole, Atari)
486
+ │ │ │
487
+ │ │ ├─ Simple environment? → DQN
488
+ │ │ │ • Fast convergence
489
+ │ │ │ • Value-based learning
490
+ │ │ │ • Good for small action spaces (<10 actions)
491
+ │ │ │
492
+ │ │ ├─ Need stability? → PPO
493
+ │ │ │ • Most reliable algorithm
494
+ │ │ │ • Works on discrete and continuous
495
+ │ │ │ • Industry standard for robotics
496
+ │ │ │
497
+ │ │ └─ Sample efficient? → PPO with vectorized envs
498
+ │ │ • 16-32 parallel environments
499
+ │ │ • 10x faster training
500
+ │ │ • Lower sample complexity
501
+ │ │
502
+ │ └─ CONTINUOUS Actions (e.g., MuJoCo, robotics)
503
+ │ │
504
+ │ ├─ Sample efficient? → SAC
505
+ │ │ • Off-policy (uses replay buffer)
506
+ │ │ • Stochastic policy (exploration built-in)
507
+ │ │ • Best for continuous control
508
+ │ │ • 3-5x more sample efficient than PPO
509
+ │ │
510
+ │ ├─ Deterministic policy? → TD3
511
+ │ │ • Improved DDPG with twin critics
512
+ │ │ • Stable training
513
+ │ │ • Good for real-world deployment
514
+ │ │
515
+ │ └─ Fast prototyping? → PPO
516
+ │ • On-policy (simpler)
517
+ │ • Stable and reliable
518
+ │ • Good default choice
519
+
520
+ ├─ Reward Structure?
521
+ │ │
522
+ │ ├─ SPARSE Rewards (goal only)
523
+ │ │
524
+ │ ├─ Curiosity-driven? → PPO + ICM (Intrinsic Curiosity Module)
525
+ │ │ • Exploration bonus
526
+ │ │ • Works with sparse rewards
527
+ │ │
528
+ │ ├─ Hindsight? → HER (Hindsight Experience Replay) + DQN/TD3
529
+ │ │ • Learn from failures
530
+ │ │ • Relabel goals
531
+ │ │ • Excellent for robotic manipulation
532
+ │ │
533
+ │ └─ Reward shaping? → SAC/PPO + dense auxiliary rewards
534
+ │ • Distance to goal
535
+ │ • Progress tracking
536
+ │ • See "Reward Shaping" section above
537
+
538
+ ├─ Sample Efficiency Requirements?
539
+ │ │
540
+ │ ├─ UNLIMITED samples (simulators) → PPO
541
+ │ │ • Fast wall-clock time
542
+ │ │ • Vectorized environments
543
+ │ │ • Parallel rollouts
544
+ │ │
545
+ │ ├─ LIMITED samples (real robot) → SAC or TD3
546
+ │ │ • Off-policy (replay buffer)
547
+ │ │ • 5-10x more sample efficient
548
+ │ │ • Reuse past experience
549
+ │ │
550
+ │ └─ OFFLINE (fixed dataset) → Offline RL
551
+ │ • CQL (Conservative Q-Learning)
552
+ │ • IQL (Implicit Q-Learning)
553
+ │ • See "Offline RL" section below
554
+
555
+ └─ Environment Characteristics?
556
+
557
+ ├─ Partial Observability (POMDP)
558
+ │ • Use LSTM/GRU policies
559
+ │ • RecurrentPPO from SB3 Contrib
560
+ │ • Memory of past states
561
+
562
+ ├─ Multi-Agent
563
+ │ • MADDPG (cooperative/competitive)
564
+ │ • QMIX (value decomposition)
565
+ │ • See "Multi-Agent RL" section below
566
+
567
+ ├─ Image Observations
568
+ │ • Use CNN feature extractor
569
+ │ • Frame stacking (4 frames)
570
+ │ • PPO or DQN with CnnPolicy
571
+ │ • See "Custom Policies" section below
572
+
573
+ └─ High-Dimensional Continuous Control
574
+ • SAC (best for complex tasks)
575
+ • TD3 (if deterministic policy needed)
576
+ • Use layer normalization
577
+ ```
578
+
579
+ ### Algorithm Comparison Table
580
+
581
+ | Algorithm | Action Space | Sample Efficiency | Stability | Use When |
582
+ |-----------|--------------|-------------------|-----------|----------|
583
+ | **DQN** | Discrete | Low | Medium | Simple discrete tasks, Atari games |
584
+ | **PPO** | Both | Medium | **High** | General-purpose, default choice, robotics |
585
+ | **SAC** | Continuous | **High** | High | Continuous control, limited samples |
586
+ | **TD3** | Continuous | **High** | High | Deterministic policies, real-world deployment |
587
+ | **A2C** | Both | Low | Medium | Fast training, research prototyping |
588
+ | **DDPG** | Continuous | High | Low | Legacy (use TD3 instead) |
589
+ | **TRPO** | Both | Medium | **High** | When PPO too unstable (rare) |
590
+
591
+ ### Hyperparameter Starting Points
592
+
593
+ #### PPO (Most Common)
594
+
595
+ **Source**: SB3 default values (265 snippets, trust 8.0)
596
+
597
+ ```python
598
+ from stable_baselines3 import PPO
599
+
600
+ # Recommended starting configuration
601
+ model = PPO(
602
+ "MlpPolicy",
603
+ env,
604
+ learning_rate=3e-4, # Default: 3e-4 (good for most tasks)
605
+ n_steps=2048, # Rollout length (higher = more stable)
606
+ batch_size=64, # Minibatch size for optimization
607
+ n_epochs=10, # Optimization epochs per rollout
608
+ gamma=0.99, # Discount factor (0.95-0.99)
609
+ gae_lambda=0.95, # GAE parameter (bias-variance tradeoff)
610
+ clip_range=0.2, # PPO clipping parameter
611
+ ent_coef=0.0, # Entropy coefficient (exploration)
612
+ vf_coef=0.5, # Value function coefficient
613
+ max_grad_norm=0.5, # Gradient clipping
614
+ verbose=1
615
+ )
616
+ ```
617
+
618
+ **Tuning Tips**:
619
+ - **High sample efficiency**: Increase `n_steps` to 4096-8192
620
+ - **Faster training**: Decrease `n_steps` to 512-1024, use vectorized envs
621
+ - **More exploration**: Increase `ent_coef` to 0.01-0.1
622
+ - **Unstable training**: Decrease `learning_rate` to 1e-4
623
+
624
+ #### SAC (Continuous Control)
625
+
626
+ **Source**: SB3 SAC implementation (265 snippets, trust 8.0)
627
+
628
+ ```python
629
+ from stable_baselines3 import SAC
630
+
631
+ # Recommended starting configuration
632
+ model = SAC(
633
+ "MlpPolicy",
634
+ env,
635
+ learning_rate=3e-4, # Default: 3e-4
636
+ buffer_size=1_000_000, # Replay buffer size (1M is standard)
637
+ learning_starts=100, # Start training after N steps
638
+ batch_size=256, # Larger batches = more stable
639
+ tau=0.005, # Soft update coefficient
640
+ gamma=0.99, # Discount factor
641
+ train_freq=1, # Update every N steps (1 = every step)
642
+ gradient_steps=1, # Gradient updates per step
643
+ ent_coef="auto", # Automatic entropy tuning (RECOMMENDED)
644
+ target_update_interval=1, # Update target networks
645
+ verbose=1
646
+ )
647
+ ```
648
+
649
+ **Tuning Tips**:
650
+ - **Sample efficient**: Use `buffer_size=1_000_000`, `batch_size=256`
651
+ - **Faster convergence**: Increase `gradient_steps` to 2-4
652
+ - **More exploration**: Set `ent_coef=0.2` (if auto tuning fails)
653
+ - **Stable training**: Decrease `learning_rate` to 1e-4
654
+
655
+ #### DQN (Discrete Actions)
656
+
657
+ **Source**: SB3 DQN implementation (265 snippets, trust 8.0)
658
+
659
+ ```python
660
+ from stable_baselines3 import DQN
661
+
662
+ # Recommended starting configuration
663
+ model = DQN(
664
+ "MlpPolicy",
665
+ env,
666
+ learning_rate=1e-4, # Lower than PPO (off-policy)
667
+ buffer_size=100_000, # Replay buffer (100K-1M)
668
+ learning_starts=1000, # Warmup steps
669
+ batch_size=32, # Minibatch size
670
+ tau=1.0, # Hard update (1.0) or soft (0.005)
671
+ gamma=0.99, # Discount factor
672
+ train_freq=4, # Update every 4 steps
673
+ gradient_steps=1, # Gradient updates
674
+ target_update_interval=1000, # Hard update frequency
675
+ exploration_fraction=0.1, # Epsilon decay over first 10%
676
+ exploration_initial_eps=1.0, # Start epsilon
677
+ exploration_final_eps=0.05, # Final epsilon
678
+ verbose=1
679
+ )
680
+ ```
681
+
682
+ **Tuning Tips**:
683
+ - **Faster training**: Decrease `target_update_interval` to 500
684
+ - **More stable**: Use Double DQN (built-in), increase `buffer_size`
685
+ - **Better exploration**: Increase `exploration_final_eps` to 0.1
686
+
687
+ ### When to Use What: Quick Reference
688
+
689
+ **🎮 Atari Games / Discrete Control**
690
+ ```python
691
+ # Start with DQN
692
+ model = DQN("CnnPolicy", env) # Use CnnPolicy for images
693
+ ```
694
+
695
+ **🤖 Robotics / Continuous Control**
696
+ ```python
697
+ # Start with SAC (sample efficient)
698
+ model = SAC("MlpPolicy", env)
699
+ # Or PPO (more stable, but needs more samples)
700
+ model = PPO("MlpPolicy", env)
701
+ ```
702
+
703
+ **🏃 Fast Prototyping / Research**
704
+ ```python
705
+ # Start with PPO (most reliable)
706
+ model = PPO("MlpPolicy", env)
707
+ ```
708
+
709
+ **💰 Limited Samples / Real-World**
710
+ ```python
711
+ # Use SAC or TD3 (off-policy)
712
+ model = SAC("MlpPolicy", env, buffer_size=1_000_000)
713
+ ```
714
+
715
+ **🧪 Custom Environments**
716
+ ```python
717
+ # Start with PPO + vectorized envs
718
+ from gymnasium.vector import make_vec
719
+ vec_env = make_vec("YourEnv-v0", num_envs=16)
720
+ model = PPO("MlpPolicy", vec_env)
721
+ ```
722
+
723
+ ---
724
+
725
+ ## RL Hyperparameter Tuning Guide
726
+
727
+ **Source**: Context7-verified Optuna integration patterns from SB3
728
+
729
+ ### Automated Hyperparameter Optimization with Optuna
730
+
731
+ **✅ CORRECT: Use RL Zoo3 with Optuna for automatic tuning**
732
+
733
+ ```bash
734
+ # Install RL Baselines3 Zoo (includes Optuna integration)
735
+ pip install rl_baselines3_zoo
736
+
737
+ # Automated hyperparameter search (1000 trials)
738
+ python -m rl_zoo3.train \
739
+ --algo ppo \
740
+ --env CartPole-v1 \
741
+ -n 50000 \
742
+ --optimize \
743
+ --n-trials 1000 \
744
+ --n-jobs 4 \
745
+ --sampler tpe \
746
+ --pruner median \
747
+ --study-name ppo_cartpole \
748
+ --storage sqlite:///optuna.db
749
+ ```
750
+
751
+ **Key Parameters**:
752
+ - `--n-trials`: Number of hyperparameter combinations to try
753
+ - `--n-jobs`: Parallel trials (use CPU cores)
754
+ - `--sampler`: `tpe` (Tree-structured Parzen Estimator) or `random`
755
+ - `--pruner`: Early stopping for bad trials (`median` or `hyperband`)
756
+ - `--storage`: SQLite database for resuming optimization
757
+
758
+ ### Manual Hyperparameter Tuning
759
+
760
+ **Source**: SB3 best practices (265 snippets, trust 8.0)
761
+
762
+ #### Learning Rate Schedule
763
+
764
+ ```python
765
+ from stable_baselines3 import PPO
766
+ import torch.nn as nn
767
+
768
+ # ✅ CORRECT: Cosine annealing with warmup
769
+ def linear_schedule(initial_value):
770
+ """Linear learning rate schedule."""
771
+ def schedule(progress_remaining):
772
+ return progress_remaining * initial_value
773
+ return schedule
774
+
775
+ model = PPO(
776
+ "MlpPolicy",
777
+ env,
778
+ learning_rate=linear_schedule(3e-4), # Decreases over training
779
+ verbose=1
780
+ )
781
+ ```
782
+
783
+ **Learning Rate Guidelines**:
784
+ - **PPO**: Start with 3e-4, decay linearly
785
+ - **SAC**: Fixed 3e-4 (off-policy doesn't need decay)
786
+ - **DQN**: Start with 1e-4 (lower than on-policy)
787
+ - **Fine-tuning**: 1e-5 to 1e-4 (lower for stability)
788
+
789
+ #### Network Architecture Tuning
790
+
791
+ ```python
792
+ from stable_baselines3 import PPO
793
+ import torch as th
794
+
795
+ # ✅ CORRECT: Custom network architecture
796
+ policy_kwargs = dict(
797
+ activation_fn=th.nn.ReLU, # ReLU, Tanh, or ELU
798
+ net_arch=dict(
799
+ pi=[256, 256], # Policy network (actor)
800
+ vf=[256, 256] # Value network (critic)
801
+ ),
802
+ ortho_init=True, # Orthogonal initialization
803
+ log_std_init=-2.0, # Initial log std for actions
804
+ )
805
+
806
+ model = PPO(
807
+ "MlpPolicy",
808
+ env,
809
+ policy_kwargs=policy_kwargs,
810
+ verbose=1
811
+ )
812
+ ```
813
+
814
+ **Network Size Guidelines**:
815
+ - **Small tasks** (CartPole): `[64, 64]`
816
+ - **Medium tasks** (Humanoid): `[256, 256]`
817
+ - **Large tasks** (Atari): `[512, 512]` or CNN feature extractor
818
+ - **Image inputs**: Use `CnnPolicy` with custom CNN architecture
819
+
820
+ #### Exploration vs Exploitation
821
+
822
+ **PPO Entropy Coefficient**:
823
+ ```python
824
+ model = PPO(
825
+ "MlpPolicy",
826
+ env,
827
+ ent_coef=0.01, # Entropy bonus for exploration
828
+ # Higher = more exploration (0.01-0.1)
829
+ # Lower = more exploitation (0.0-0.001)
830
+ verbose=1
831
+ )
832
+ ```
833
+
834
+ **SAC Automatic Entropy Tuning**:
835
+ ```python
836
+ model = SAC(
837
+ "MlpPolicy",
838
+ env,
839
+ ent_coef="auto", # ✅ RECOMMENDED: Automatic tuning
840
+ target_entropy="auto", # Target entropy = -dim(actions)
841
+ verbose=1
842
+ )
843
+ ```
844
+
845
+ **DQN Epsilon Decay**:
846
+ ```python
847
+ model = DQN(
848
+ "MlpPolicy",
849
+ env,
850
+ exploration_fraction=0.1, # Epsilon decays over first 10%
851
+ exploration_initial_eps=1.0, # Start: 100% random
852
+ exploration_final_eps=0.05, # End: 5% random
853
+ verbose=1
854
+ )
855
+ ```
856
+
857
+ #### Discount Factor (Gamma)
858
+
859
+ **Rule of Thumb**:
860
+ - **Episodic tasks** (clear goal): γ = 0.99
861
+ - **Long-horizon tasks**: γ = 0.999
862
+ - **Short-term rewards**: γ = 0.95
863
+ - **Real-time control**: γ = 0.9
864
+
865
+ ```python
866
+ model = PPO(
867
+ "MlpPolicy",
868
+ env,
869
+ gamma=0.99, # Discount factor
870
+ # Higher = values future rewards more
871
+ # Lower = focuses on immediate rewards
872
+ verbose=1
873
+ )
874
+ ```
875
+
876
+ #### Batch Size and Training Frequency
877
+
878
+ **PPO (On-Policy)**:
879
+ ```python
880
+ model = PPO(
881
+ "MlpPolicy",
882
+ env,
883
+ n_steps=2048, # Rollout length before update
884
+ batch_size=64, # Minibatch size for SGD
885
+ n_epochs=10, # Optimization epochs per rollout
886
+ verbose=1
887
+ )
888
+ ```
889
+
890
+ **Guidelines**:
891
+ - **Small `n_steps`** (512-1024): Faster updates, less stable
892
+ - **Large `n_steps`** (4096-8192): More stable, slower updates
893
+ - **Batch size**: 32-256 (larger = more stable, slower)
894
+
895
+ **SAC/DQN (Off-Policy)**:
896
+ ```python
897
+ model = SAC(
898
+ "MlpPolicy",
899
+ env,
900
+ batch_size=256, # Larger for off-policy
901
+ train_freq=1, # Update every step (1) or every N steps
902
+ gradient_steps=1, # Gradient updates per env step
903
+ buffer_size=1_000_000, # Replay buffer size
904
+ verbose=1
905
+ )
906
+ ```
907
+
908
+ **Guidelines**:
909
+ - **`train_freq=1`**: Update every step (sample efficient)
910
+ - **`gradient_steps=1`**: Standard (increase to 2-4 for faster convergence)
911
+ - **`buffer_size`**: 100K-1M (larger = more diverse experience)
912
+
913
+ ### Hyperparameter Search Spaces
914
+
915
+ **Source**: RL Zoo3 Optuna configurations
916
+
917
+ #### PPO Search Space
918
+
919
+ ```python
920
+ import optuna
921
+ from stable_baselines3 import PPO
922
+
923
+ def objective(trial):
924
+ # Sample hyperparameters
925
+ learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 1e-3)
926
+ n_steps = trial.suggest_categorical("n_steps", [512, 1024, 2048, 4096])
927
+ batch_size = trial.suggest_categorical("batch_size", [32, 64, 128, 256])
928
+ n_epochs = trial.suggest_int("n_epochs", 3, 30)
929
+ gamma = trial.suggest_categorical("gamma", [0.95, 0.99, 0.999])
930
+ gae_lambda = trial.suggest_uniform("gae_lambda", 0.8, 1.0)
931
+ ent_coef = trial.suggest_loguniform("ent_coef", 1e-8, 1e-1)
932
+ clip_range = trial.suggest_uniform("clip_range", 0.1, 0.4)
933
+
934
+ # Create model with sampled hyperparameters
935
+ model = PPO(
936
+ "MlpPolicy",
937
+ env,
938
+ learning_rate=learning_rate,
939
+ n_steps=n_steps,
940
+ batch_size=batch_size,
941
+ n_epochs=n_epochs,
942
+ gamma=gamma,
943
+ gae_lambda=gae_lambda,
944
+ ent_coef=ent_coef,
945
+ clip_range=clip_range,
946
+ verbose=0
947
+ )
948
+
949
+ # Train and evaluate
950
+ model.learn(total_timesteps=50000)
951
+ mean_reward = evaluate_policy(model, env, n_eval_episodes=10)
952
+
953
+ return mean_reward
954
+
955
+ # Run optimization
956
+ study = optuna.create_study(direction="maximize")
957
+ study.optimize(objective, n_trials=100, n_jobs=4)
958
+
959
+ print("Best hyperparameters:", study.best_params)
960
+ ```
961
+
962
+ #### SAC Search Space
963
+
964
+ ```python
965
+ def objective_sac(trial):
966
+ learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 1e-3)
967
+ buffer_size = trial.suggest_categorical("buffer_size", [50000, 100000, 1000000])
968
+ batch_size = trial.suggest_categorical("batch_size", [64, 128, 256, 512])
969
+ tau = trial.suggest_uniform("tau", 0.001, 0.02)
970
+ gamma = trial.suggest_categorical("gamma", [0.95, 0.99, 0.999])
971
+ train_freq = trial.suggest_categorical("train_freq", [1, 4, 8])
972
+ gradient_steps = trial.suggest_int("gradient_steps", 1, 4)
973
+
974
+ model = SAC(
975
+ "MlpPolicy",
976
+ env,
977
+ learning_rate=learning_rate,
978
+ buffer_size=buffer_size,
979
+ batch_size=batch_size,
980
+ tau=tau,
981
+ gamma=gamma,
982
+ train_freq=train_freq,
983
+ gradient_steps=gradient_steps,
984
+ ent_coef="auto", # Keep auto entropy tuning
985
+ verbose=0
986
+ )
987
+
988
+ model.learn(total_timesteps=50000)
989
+ mean_reward = evaluate_policy(model, env, n_eval_episodes=10)
990
+
991
+ return mean_reward
992
+ ```
993
+
994
+ ### Debugging Hyperparameters
995
+
996
+ **Signs of Poor Hyperparameters**:
997
+
998
+ 1. **Learning Rate Too High**:
999
+ - Loss oscillates wildly
1000
+ - Policy performance drops suddenly
1001
+ - **Fix**: Decrease learning rate by 10x
1002
+
1003
+ 2. **Learning Rate Too Low**:
1004
+ - Very slow improvement
1005
+ - Gets stuck in local minima
1006
+ - **Fix**: Increase learning rate by 2-5x
1007
+
1008
+ 3. **Insufficient Exploration** (PPO):
1009
+ - Agent converges to suboptimal policy quickly
1010
+ - Low entropy (< 0.1)
1011
+ - **Fix**: Increase `ent_coef` from 0.0 to 0.01-0.1
1012
+
1013
+ 4. **Too Much Exploration** (SAC):
1014
+ - Agent never stabilizes
1015
+ - High entropy throughout training
1016
+ - **Fix**: Decrease `ent_coef` or use auto tuning
1017
+
1018
+ 5. **Unstable Training** (PPO):
1019
+ - Large policy updates
1020
+ - Value function explodes
1021
+ - **Fix**:
1022
+ - Decrease learning rate
1023
+ - Increase `n_steps` (more data per update)
1024
+ - Decrease `clip_range` (smaller policy updates)
1025
+
1026
+ 6. **Sample Inefficiency** (SAC/DQN):
1027
+ - Slow convergence despite replay buffer
1028
+ - **Fix**:
1029
+ - Increase `gradient_steps` (more updates per step)
1030
+ - Increase `batch_size` (more stable gradients)
1031
+ - Use larger replay buffer
1032
+
1033
+ ### Quick Tuning Checklist
1034
+
1035
+ **Before Training**:
1036
+ - [ ] Choose algorithm based on action space (discrete vs continuous)
1037
+ - [ ] Set learning rate (3e-4 for PPO, 1e-4 for DQN)
1038
+ - [ ] Set network size based on task complexity
1039
+ - [ ] Configure exploration (entropy, epsilon)
1040
+ - [ ] Set appropriate `gamma` for task horizon
1041
+
1042
+ **During Training**:
1043
+ - [ ] Monitor learning curves (reward, loss, entropy)
1044
+ - [ ] Check for overfitting (train vs eval performance)
1045
+ - [ ] Watch for policy collapse (sudden drop in reward)
1046
+ - [ ] Adjust learning rate if loss oscillates
1047
+
1048
+ **After Training**:
1049
+ - [ ] Evaluate on multiple seeds (10+ runs)
1050
+ - [ ] Test on different environment variations
1051
+ - [ ] Compare with baseline hyperparameters
1052
+ - [ ] Log best hyperparameters for future use
1053
+
1054
+ ---
1055
+
1056
+ ## RL Debugging Guide: Why Your Agent Doesn't Learn
1057
+
1058
+ **Source**: Context7-verified troubleshooting patterns from Gymnasium and SB3
1059
+
1060
+ ### Common RL Training Issues and Fixes
1061
+
1062
+ #### 1. Agent Never Improves (Reward Stays Random)
1063
+
1064
+ **Symptoms**:
1065
+ - Mean reward stays at initial level
1066
+ - No improvement after 10K+ timesteps
1067
+ - Policy acts randomly
1068
+
1069
+ **Possible Causes**:
1070
+
1071
+ **A. Reward Function Issues**
1072
+
1073
+ ```python
1074
+ # ❌ WRONG: Sparse reward (never reaches goal)
1075
+ def step(self, action):
1076
+ done = self._check_goal()
1077
+ reward = 1.0 if done else 0.0 # Too sparse!
1078
+ return obs, reward, done, {}
1079
+
1080
+ # ✅ CORRECT: Dense reward with progress tracking
1081
+ def step(self, action):
1082
+ done = self._check_goal()
1083
+ distance = np.linalg.norm(self.agent_pos - self.goal_pos)
1084
+ reward = -0.01 * distance # Guides toward goal
1085
+ if done:
1086
+ reward += 10.0 # Bonus for reaching goal
1087
+ return obs, reward, done, truncated, {}
1088
+ ```
1089
+
1090
+ **Fix**: Add dense rewards that guide the agent toward the goal.
1091
+
1092
+ **B. State Not Observable**
1093
+
1094
+ ```python
1095
+ # ❌ WRONG: Missing critical state information
1096
+ def _get_obs(self):
1097
+ return np.array([self.x, self.y]) # Missing velocity!
1098
+
1099
+ # ✅ CORRECT: Include all relevant state
1100
+ def _get_obs(self):
1101
+ return np.array([
1102
+ self.x, self.y, # Position
1103
+ self.vx, self.vy, # Velocity (critical!)
1104
+ self.goal_x, self.goal_y # Goal position
1105
+ ])
1106
+ ```
1107
+
1108
+ **Fix**: Ensure observation contains all information needed for decision-making.
1109
+
1110
+ **C. Learning Rate Too Low**
1111
+
1112
+ ```python
1113
+ # ❌ WRONG: Learning rate too small
1114
+ model = PPO("MlpPolicy", env, learning_rate=1e-6) # Too small!
1115
+
1116
+ # ✅ CORRECT: Use standard learning rate
1117
+ model = PPO("MlpPolicy", env, learning_rate=3e-4) # Good default
1118
+ ```
1119
+
1120
+ **Fix**: Increase learning rate to 3e-4 (PPO) or 1e-4 (DQN).
1121
+
1122
+ ---
1123
+
1124
+ #### 2. Agent Learns Then Forgets (Performance Degrades)
1125
+
1126
+ **Symptoms**:
1127
+ - Reward increases initially
1128
+ - Then drops back to random
1129
+ - Unstable training curves
1130
+
1131
+ **Possible Causes**:
1132
+
1133
+ **A. Learning Rate Too High (Policy Collapse)**
1134
+
1135
+ ```python
1136
+ # ❌ WRONG: Learning rate causes policy collapse
1137
+ model = PPO("MlpPolicy", env, learning_rate=1e-2) # Too high!
1138
+
1139
+ # ✅ CORRECT: Use smaller learning rate
1140
+ model = PPO("MlpPolicy", env, learning_rate=3e-4)
1141
+ # Or use learning rate schedule
1142
+ model = PPO("MlpPolicy", env, learning_rate=linear_schedule(3e-4))
1143
+ ```
1144
+
1145
+ **Fix**: Decrease learning rate or use learning rate schedule.
1146
+
1147
+ **B. Insufficient Training Data (PPO)**
1148
+
1149
+ ```python
1150
+ # ❌ WRONG: Too few steps per update
1151
+ model = PPO("MlpPolicy", env, n_steps=128) # Too small!
1152
+
1153
+ # ✅ CORRECT: Collect more data before updates
1154
+ model = PPO("MlpPolicy", env, n_steps=2048) # More stable
1155
+ ```
1156
+
1157
+ **Fix**: Increase `n_steps` for PPO to collect more diverse data.
1158
+
1159
+ **C. No Early Stopping (Overfitting to Recent Experience)**
1160
+
1161
+ ```python
1162
+ # ✅ CORRECT: Use evaluation callback to stop at peak
1163
+ from stable_baselines3.common.callbacks import EvalCallback
1164
+
1165
+ eval_callback = EvalCallback(
1166
+ eval_env,
1167
+ best_model_save_path="./logs/",
1168
+ eval_freq=1000,
1169
+ deterministic=True
1170
+ )
1171
+
1172
+ model.learn(total_timesteps=100_000, callback=eval_callback)
1173
+ # Best model saved automatically before collapse
1174
+ ```
1175
+
1176
+ **Fix**: Use EvalCallback to save best model before performance degrades.
1177
+
1178
+ ---
1179
+
1180
+ #### 3. Agent Gets Stuck in Local Optimum
1181
+
1182
+ **Symptoms**:
1183
+ - Agent finds suboptimal strategy
1184
+ - Refuses to explore better solutions
1185
+ - Low entropy (< 0.1 for PPO)
1186
+
1187
+ **Possible Causes**:
1188
+
1189
+ **A. Insufficient Exploration**
1190
+
1191
+ ```python
1192
+ # ❌ WRONG: No exploration bonus
1193
+ model = PPO("MlpPolicy", env, ent_coef=0.0) # No exploration!
1194
+
1195
+ # ✅ CORRECT: Add entropy bonus
1196
+ model = PPO("MlpPolicy", env, ent_coef=0.01) # Encourages exploration
1197
+ ```
1198
+
1199
+ **Fix**: Increase entropy coefficient (`ent_coef`) for PPO/SAC or epsilon for DQN.
1200
+
1201
+ **B. Premature Exploitation (DQN)**
1202
+
1203
+ ```python
1204
+ # ❌ WRONG: Epsilon decays too fast
1205
+ model = DQN(
1206
+ "MlpPolicy", env,
1207
+ exploration_fraction=0.01, # Decays in first 1% only!
1208
+ exploration_final_eps=0.01 # Stops exploring too early
1209
+ )
1210
+
1211
+ # ✅ CORRECT: Longer exploration phase
1212
+ model = DQN(
1213
+ "MlpPolicy", env,
1214
+ exploration_fraction=0.2, # Decay over first 20%
1215
+ exploration_final_eps=0.1 # Keep 10% random actions
1216
+ )
1217
+ ```
1218
+
1219
+ **Fix**: Extend exploration phase and keep final epsilon higher.
1220
+
1221
+ **C. Reward Hacking**
1222
+
1223
+ ```python
1224
+ # ❌ WRONG: Agent finds unintended shortcut
1225
+ def step(self, action):
1226
+ # Agent learns to stay still (0 penalty beats moving toward goal!)
1227
+ distance = np.linalg.norm(self.agent_pos - self.goal_pos)
1228
+ reward = -0.01 * distance - 0.1 # ❌ Staying still is best!
1229
+ return obs, reward, done, {}
1230
+
1231
+ # ✅ CORRECT: Penalize time, reward progress
1232
+ def step(self, action):
1233
+ prev_distance = self.prev_distance
1234
+ curr_distance = np.linalg.norm(self.agent_pos - self.goal_pos)
1235
+
1236
+ # Reward getting closer, penalize getting farther
1237
+ reward = (prev_distance - curr_distance) * 10.0
1238
+ reward -= 0.01 # Small time penalty to encourage speed
1239
+
1240
+ if done:
1241
+ reward += 100.0 # Large goal bonus
1242
+
1243
+ self.prev_distance = curr_distance
1244
+ return obs, reward, done, truncated, {}
1245
+ ```
1246
+
1247
+ **Fix**: Carefully design reward function to avoid unintended shortcuts.
1248
+
1249
+ ---
1250
+
1251
+ #### 4. Training is Too Slow
1252
+
1253
+ **Symptoms**:
1254
+ - Hours to train simple task
1255
+ - Low sample throughput
1256
+ - Single-threaded execution
1257
+
1258
+ **Possible Causes**:
1259
+
1260
+ **A. Not Using Vectorized Environments**
1261
+
1262
+ ```python
1263
+ # ❌ WRONG: Single environment (slow)
1264
+ env = gym.make("CartPole-v1")
1265
+ model = PPO("MlpPolicy", env)
1266
+
1267
+ # ✅ CORRECT: Vectorized environments (10x faster)
1268
+ from gymnasium.vector import make_vec
1269
+
1270
+ vec_env = make_vec("CartPole-v1", num_envs=16)
1271
+ model = PPO("MlpPolicy", vec_env)
1272
+ ```
1273
+
1274
+ **Fix**: Use 8-32 parallel environments with `make_vec()`.
1275
+
1276
+ **B. Inefficient Update Frequency (SAC/DQN)**
1277
+
1278
+ ```python
1279
+ # ❌ WRONG: Too many gradient updates
1280
+ model = SAC(
1281
+ "MlpPolicy", env,
1282
+ train_freq=1,
1283
+ gradient_steps=10 # 10 updates per step (overkill!)
1284
+ )
1285
+
1286
+ # ✅ CORRECT: Balanced update frequency
1287
+ model = SAC(
1288
+ "MlpPolicy", env,
1289
+ train_freq=1,
1290
+ gradient_steps=1 # 1 update per step
1291
+ )
1292
+ ```
1293
+
1294
+ **Fix**: Start with `gradient_steps=1`, increase only if needed.
1295
+
1296
+ **C. Environment is Slow**
1297
+
1298
+ ```python
1299
+ # ✅ CORRECT: Profile environment to find bottlenecks
1300
+ import time
1301
+
1302
+ env = gym.make("YourEnv-v0")
1303
+ obs, info = env.reset()
1304
+
1305
+ start = time.time()
1306
+ for _ in range(1000):
1307
+ obs, reward, terminated, truncated, info = env.step(env.action_space.sample())
1308
+ if terminated or truncated:
1309
+ obs, info = env.reset()
1310
+ end = time.time()
1311
+
1312
+ fps = 1000 / (end - start)
1313
+ print(f"Environment FPS: {fps:.2f}") # Should be >1000 for simple tasks
1314
+ ```
1315
+
1316
+ **Fix**: Optimize environment `step()` function (use NumPy instead of Python loops).
1317
+
1318
+ ---
1319
+
1320
+ #### 5. Agent Works in Training, Fails in Evaluation
1321
+
1322
+ **Symptoms**:
1323
+ - Good training reward
1324
+ - Poor evaluation reward
1325
+ - Different behavior in eval mode
1326
+
1327
+ **Possible Causes**:
1328
+
1329
+ **A. Stochastic Policy in Evaluation**
1330
+
1331
+ ```python
1332
+ # ❌ WRONG: Stochastic policy in eval (random actions)
1333
+ obs, info = env.reset()
1334
+ action, _ = model.predict(obs, deterministic=False) # Random!
1335
+
1336
+ # ✅ CORRECT: Deterministic policy in eval
1337
+ obs, info = env.reset()
1338
+ action, _ = model.predict(obs, deterministic=True) # Best action
1339
+ ```
1340
+
1341
+ **Fix**: Always use `deterministic=True` during evaluation.
1342
+
1343
+ **B. Overfitting to Training Environment**
1344
+
1345
+ ```python
1346
+ # ✅ CORRECT: Use different eval environment
1347
+ from stable_baselines3.common.callbacks import EvalCallback
1348
+
1349
+ # Training env: fixed seed
1350
+ train_env = gym.make("CartPole-v1")
1351
+
1352
+ # Eval env: different seed (tests generalization)
1353
+ eval_env = gym.make("CartPole-v1")
1354
+
1355
+ eval_callback = EvalCallback(
1356
+ eval_env,
1357
+ eval_freq=1000,
1358
+ deterministic=True,
1359
+ render=False
1360
+ )
1361
+
1362
+ model.learn(total_timesteps=50_000, callback=eval_callback)
1363
+ ```
1364
+
1365
+ **Fix**: Use separate evaluation environment with different random seed.
1366
+
1367
+ ---
1368
+
1369
+ #### 6. Nan/Inf in Training (Model Explodes)
1370
+
1371
+ **Symptoms**:
1372
+ - `NaN` or `Inf` in loss
1373
+ - Training crashes
1374
+ - Reward becomes invalid
1375
+
1376
+ **Possible Causes**:
1377
+
1378
+ **A. Gradient Explosion**
1379
+
1380
+ ```python
1381
+ # ❌ WRONG: No gradient clipping
1382
+ model = PPO(
1383
+ "MlpPolicy", env,
1384
+ max_grad_norm=None # No clipping!
1385
+ )
1386
+
1387
+ # ✅ CORRECT: Clip gradients
1388
+ model = PPO(
1389
+ "MlpPolicy", env,
1390
+ max_grad_norm=0.5 # Clip to prevent explosion
1391
+ )
1392
+ ```
1393
+
1394
+ **Fix**: Always use gradient clipping (`max_grad_norm=0.5`).
1395
+
1396
+ **B. Reward Scale Too Large**
1397
+
1398
+ ```python
1399
+ # ❌ WRONG: Rewards are huge (causes instability)
1400
+ def step(self, action):
1401
+ reward = 10000.0 if goal else 0.0 # Way too large!
1402
+ return obs, reward, done, {}
1403
+
1404
+ # ✅ CORRECT: Normalize rewards to [-1, 1] or [-10, 10]
1405
+ def step(self, action):
1406
+ reward = 1.0 if goal else -0.01 # Reasonable scale
1407
+ return obs, reward, done, truncated, {}
1408
+
1409
+ # Or use reward normalization
1410
+ from stable_baselines3.common.vec_env import VecNormalize
1411
+ vec_env = VecNormalize(vec_env, norm_reward=True)
1412
+ ```
1413
+
1414
+ **Fix**: Keep rewards in range [-10, 10] or use `VecNormalize`.
1415
+
1416
+ **C. Invalid Observations**
1417
+
1418
+ ```python
1419
+ # ✅ CORRECT: Check for NaN/Inf in observations
1420
+ def _get_obs(self):
1421
+ obs = np.array([self.x, self.y, self.vx, self.vy])
1422
+ assert not np.any(np.isnan(obs)), "NaN in observation!"
1423
+ assert not np.any(np.isinf(obs)), "Inf in observation!"
1424
+ return obs
1425
+ ```
1426
+
1427
+ **Fix**: Add assertions to catch invalid observations early.
1428
+
1429
+ ---
1430
+
1431
+ ### Debugging Checklist
1432
+
1433
+ **Environment Issues**:
1434
+ - [ ] Observation contains all necessary information
1435
+ - [ ] Reward function is dense (not too sparse)
1436
+ - [ ] Reward scale is reasonable ([-10, 10])
1437
+ - [ ] Episode terminates correctly (terminated vs truncated)
1438
+ - [ ] Custom environment follows Gymnasium API
1439
+
1440
+ **Algorithm Issues**:
1441
+ - [ ] Learning rate is appropriate (3e-4 for PPO, 1e-4 for DQN)
1442
+ - [ ] Network size matches task complexity
1443
+ - [ ] Exploration is sufficient (check entropy/epsilon)
1444
+ - [ ] Using vectorized environments for speed
1445
+ - [ ] Gradient clipping enabled (max_grad_norm=0.5)
1446
+
1447
+ **Training Issues**:
1448
+ - [ ] Using EvalCallback to save best model
1449
+ - [ ] Monitoring learning curves (reward, loss, entropy)
1450
+ - [ ] Training long enough (10K-1M timesteps)
1451
+ - [ ] Using deterministic policy in evaluation
1452
+ - [ ] Checking for NaN/Inf in training logs
1453
+
1454
+ **Debugging Tools**:
1455
+
1456
+ ```python
1457
+ # Log all hyperparameters and metrics
1458
+ from stable_baselines3.common.logger import configure
1459
+
1460
+ logger = configure("./logs/ppo_debug", ["stdout", "csv", "tensorboard"])
1461
+ model.set_logger(logger)
1462
+
1463
+ # Detailed monitoring callback
1464
+ from stable_baselines3.common.callbacks import CallbackList, CheckpointCallback
1465
+ from stable_baselines3.common.callbacks import EvalCallback
1466
+
1467
+ checkpoint_callback = CheckpointCallback(
1468
+ save_freq=10000,
1469
+ save_path="./logs/checkpoints/",
1470
+ name_prefix="rl_model"
1471
+ )
1472
+
1473
+ eval_callback = EvalCallback(
1474
+ eval_env,
1475
+ best_model_save_path="./logs/best_model/",
1476
+ log_path="./logs/eval/",
1477
+ eval_freq=1000,
1478
+ deterministic=True,
1479
+ render=False
1480
+ )
1481
+
1482
+ callback = CallbackList([checkpoint_callback, eval_callback])
1483
+
1484
+ # Train with full logging
1485
+ model.learn(total_timesteps=100_000, callback=callback)
1486
+
1487
+ # Visualize with TensorBoard
1488
+ # tensorboard --logdir ./logs/ppo_debug
1489
+ ```
1490
+
1491
+ ---
1492
+
1493
+ ## Multi-Agent Reinforcement Learning
1494
+
1495
+ **Source**: PettingZoo and multi-agent RL best practices
1496
+
1497
+ ### Multi-Agent Environments with PettingZoo
1498
+
1499
+ ```python
1500
+ # Install PettingZoo for multi-agent environments
1501
+ # pip install pettingzoo[all]
1502
+
1503
+ from pettingzoo.mpe import simple_spread_v3
1504
+
1505
+ # Create multi-agent environment
1506
+ env = simple_spread_v3.parallel_env(render_mode="human")
1507
+ observations, infos = env.reset()
1508
+
1509
+ # Multi-agent training loop
1510
+ while env.agents:
1511
+ actions = {agent: env.action_space(agent).sample() for agent in env.agents}
1512
+ observations, rewards, terminations, truncations, infos = env.step(actions)
1513
+
1514
+ env.close()
1515
+ ```
1516
+
1517
+ ### Multi-Agent Algorithms
1518
+
1519
+ #### 1. Independent Q-Learning (IQL)
1520
+
1521
+ **Use When**: Simple cooperative tasks, independent agents
1522
+
1523
+ ```python
1524
+ from stable_baselines3 import DQN
1525
+
1526
+ # Train each agent independently
1527
+ agents = {}
1528
+ for agent_id in env.possible_agents:
1529
+ agents[agent_id] = DQN("MlpPolicy", env, verbose=1)
1530
+
1531
+ # Train all agents
1532
+ for agent_id, model in agents.items():
1533
+ model.learn(total_timesteps=50_000)
1534
+ ```
1535
+
1536
+ **Pros**: Simple, parallelizable
1537
+ **Cons**: Non-stationary environment (other agents are learning)
1538
+
1539
+ #### 2. Multi-Agent PPO (MAPPO)
1540
+
1541
+ **Use When**: Cooperative tasks, centralized training
1542
+
1543
+ ```python
1544
+ # Centralized training with shared value function
1545
+ # Each agent has own policy, but shares critic
1546
+
1547
+ from stable_baselines3 import PPO
1548
+ from stable_baselines3.common.vec_env import DummyVecEnv
1549
+
1550
+ # Wrapper for PettingZoo → Gymnasium
1551
+ def make_env():
1552
+ env = simple_spread_v3.parallel_env()
1553
+ return env
1554
+
1555
+ # Train with shared experience
1556
+ vec_env = DummyVecEnv([make_env] * 4) # 4 parallel envs
1557
+ model = PPO("MlpPolicy", vec_env, verbose=1)
1558
+ model.learn(total_timesteps=200_000)
1559
+ ```
1560
+
1561
+ **Pros**: Centralized critic stabilizes training
1562
+ **Cons**: Requires coordination during training
1563
+
1564
+ #### 3. MADDPG (Multi-Agent DDPG)
1565
+
1566
+ **Use When**: Mixed cooperative-competitive scenarios
1567
+
1568
+ ```python
1569
+ # MADDPG: Each agent observes actions of all others
1570
+ # Uses centralized critic, decentralized actors
1571
+
1572
+ # Key idea: Critic sees all agent observations + actions
1573
+ # Actor only sees own observation
1574
+
1575
+ # Pseudo-code structure:
1576
+ class MADDPGAgent:
1577
+ def __init__(self, agent_id, n_agents):
1578
+ self.actor = Actor(obs_dim) # Decentralized
1579
+ self.critic = Critic(obs_dim * n_agents) # Centralized
1580
+
1581
+ def act(self, obs):
1582
+ return self.actor(obs) # Only needs own observation
1583
+
1584
+ def update(self, batch):
1585
+ # Critic uses global state (all agent obs + actions)
1586
+ q_value = self.critic(all_obs, all_actions)
1587
+ actor_loss = -q_value.mean()
1588
+ self.actor.optimizer.zero_grad()
1589
+ actor_loss.backward()
1590
+ ```
1591
+
1592
+ **Pros**: Handles mixed cooperative-competitive
1593
+ **Cons**: Complex implementation, high sample complexity
1594
+
1595
+ ### Multi-Agent Reward Structures
1596
+
1597
+ #### Cooperative (All agents share reward)
1598
+
1599
+ ```python
1600
+ def step(self, actions):
1601
+ # All agents get same reward
1602
+ team_reward = self._compute_team_reward()
1603
+ rewards = {agent: team_reward for agent in self.agents}
1604
+ return observations, rewards, dones, truncateds, infos
1605
+ ```
1606
+
1607
+ **Use With**: MAPPO, shared value function
1608
+
1609
+ #### Competitive (Zero-sum game)
1610
+
1611
+ ```python
1612
+ def step(self, actions):
1613
+ # Winner gets +1, loser gets -1
1614
+ winner = self._determine_winner()
1615
+ rewards = {
1616
+ agent: 1.0 if agent == winner else -1.0
1617
+ for agent in self.agents
1618
+ }
1619
+ return observations, rewards, dones, truncateds, infos
1620
+ ```
1621
+
1622
+ **Use With**: Self-play, adversarial training
1623
+
1624
+ #### Mixed (Individual + team rewards)
1625
+
1626
+ ```python
1627
+ def step(self, actions):
1628
+ team_reward = self._compute_team_reward()
1629
+ individual_rewards = self._compute_individual_rewards(actions)
1630
+
1631
+ # Combine both (e.g., 70% team, 30% individual)
1632
+ rewards = {
1633
+ agent: 0.7 * team_reward + 0.3 * individual_rewards[agent]
1634
+ for agent in self.agents
1635
+ }
1636
+ return observations, rewards, dones, truncateds, infos
1637
+ ```
1638
+
1639
+ **Use With**: Cooperative tasks with specialization
1640
+
1641
+ ### Communication Between Agents
1642
+
1643
+ ```python
1644
+ class CommunicativeAgent(gym.Env):
1645
+ def __init__(self, n_agents):
1646
+ super().__init__()
1647
+ self.n_agents = n_agents
1648
+
1649
+ # Observation = own state + messages from others
1650
+ obs_dim = state_dim + (n_agents - 1) * message_dim
1651
+ self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(obs_dim,))
1652
+
1653
+ # Action = physical action + message to broadcast
1654
+ self.action_space = spaces.Tuple([
1655
+ spaces.Discrete(4), # Physical action
1656
+ spaces.Box(0, 1, (message_dim,)) # Message
1657
+ ])
1658
+
1659
+ def step(self, actions):
1660
+ physical_actions, messages = zip(*actions)
1661
+
1662
+ # Each agent receives messages from others
1663
+ observations = {}
1664
+ for i, agent in enumerate(self.agents):
1665
+ other_messages = [messages[j] for j in range(self.n_agents) if j != i]
1666
+ observations[agent] = np.concatenate([
1667
+ self.states[i], # Own state
1668
+ *other_messages # Messages from others
1669
+ ])
1670
+
1671
+ return observations, rewards, dones, truncateds, infos
1672
+ ```
1673
+
1674
+ ### Multi-Agent Training Tips
1675
+
1676
+ 1. **Curriculum Learning**:
1677
+ ```python
1678
+ # Start with simple tasks, gradually increase difficulty
1679
+ # Stage 1: Train against random opponents
1680
+ # Stage 2: Train against fixed-policy opponents
1681
+ # Stage 3: Self-play (train against copies of self)
1682
+ ```
1683
+
1684
+ 2. **Population-Based Training**:
1685
+ ```python
1686
+ # Maintain diverse population of agents
1687
+ population = [PPO("MlpPolicy", env) for _ in range(10)]
1688
+
1689
+ # Periodically evaluate and replace worst performers
1690
+ for generation in range(100):
1691
+ # Train each agent
1692
+ for agent in population:
1693
+ agent.learn(total_timesteps=10_000)
1694
+
1695
+ # Evaluate against population
1696
+ scores = evaluate_population(population)
1697
+
1698
+ # Replace worst with mutations of best
1699
+ population = evolve_population(population, scores)
1700
+ ```
1701
+
1702
+ 3. **Credit Assignment**:
1703
+ ```python
1704
+ # In cooperative tasks, determine which agent contributed to success
1705
+ # Use shaped rewards based on contributions
1706
+
1707
+ def compute_contributions(self, agents_actions, team_reward):
1708
+ contributions = {}
1709
+ for agent in self.agents:
1710
+ # Counterfactual: "What if this agent did nothing?"
1711
+ counterfactual_reward = self._simulate_without(agent)
1712
+ contribution = team_reward - counterfactual_reward
1713
+ contributions[agent] = contribution
1714
+ return contributions
1715
+ ```
1716
+
1717
+ ---
1718
+
1719
+ ## Advanced Callback Patterns
1720
+
1721
+ **Source**: Context7-verified SB3 callback patterns (265 snippets, trust 8.0)
1722
+
1723
+ ### 1. Custom Feature Extractor for Images
1724
+
1725
+ ```python
1726
+ import torch as th
1727
+ import torch.nn as nn
1728
+ from stable_baselines3 import PPO
1729
+ from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
1730
+
1731
+ class CustomCNN(BaseFeaturesExtractor):
1732
+ """Custom CNN feature extractor for image observations."""
1733
+
1734
+ def __init__(self, observation_space, features_dim=256):
1735
+ super().__init__(observation_space, features_dim)
1736
+ n_input_channels = observation_space.shape[0]
1737
+
1738
+ self.cnn = nn.Sequential(
1739
+ nn.Conv2d(n_input_channels, 32, kernel_size=8, stride=4, padding=0),
1740
+ nn.ReLU(),
1741
+ nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=0),
1742
+ nn.ReLU(),
1743
+ nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=0),
1744
+ nn.ReLU(),
1745
+ nn.Flatten(),
1746
+ )
1747
+
1748
+ # Compute shape by doing one forward pass
1749
+ with th.no_grad():
1750
+ n_flatten = self.cnn(
1751
+ th.as_tensor(observation_space.sample()[None]).float()
1752
+ ).shape[1]
1753
+
1754
+ self.linear = nn.Sequential(
1755
+ nn.Linear(n_flatten, features_dim),
1756
+ nn.ReLU()
1757
+ )
1758
+
1759
+ def forward(self, observations):
1760
+ return self.linear(self.cnn(observations))
1761
+
1762
+ # Use custom CNN
1763
+ policy_kwargs = dict(
1764
+ features_extractor_class=CustomCNN,
1765
+ features_extractor_kwargs=dict(features_dim=256),
1766
+ )
1767
+
1768
+ model = PPO("CnnPolicy", env, policy_kwargs=policy_kwargs, verbose=1)
1769
+ ```
1770
+
1771
+ ### 2. Progressive Reward Scaling Callback
1772
+
1773
+ ```python
1774
+ from stable_baselines3.common.callbacks import BaseCallback
1775
+
1776
+ class ProgressiveRewardScalingCallback(BaseCallback):
1777
+ """Gradually increase reward difficulty over training."""
1778
+
1779
+ def __init__(self, initial_scale=0.1, final_scale=1.0, total_timesteps=100_000):
1780
+ super().__init__()
1781
+ self.initial_scale = initial_scale
1782
+ self.final_scale = final_scale
1783
+ self.total_timesteps = total_timesteps
1784
+
1785
+ def _on_step(self) -> bool:
1786
+ # Linearly increase reward scale
1787
+ progress = min(1.0, self.num_timesteps / self.total_timesteps)
1788
+ current_scale = self.initial_scale + (self.final_scale - self.initial_scale) * progress
1789
+
1790
+ # Update environment reward scale
1791
+ if hasattr(self.training_env, "reward_scale"):
1792
+ self.training_env.reward_scale = current_scale
1793
+
1794
+ # Log current scale
1795
+ self.logger.record("train/reward_scale", current_scale)
1796
+
1797
+ return True
1798
+ ```
1799
+
1800
+ ### 3. Adaptive Learning Rate Callback
1801
+
1802
+ ```python
1803
+ class AdaptiveLearningRateCallback(BaseCallback):
1804
+ """Adjust learning rate based on training progress."""
1805
+
1806
+ def __init__(self, check_freq=1000, lr_min=1e-6, lr_max=1e-3):
1807
+ super().__init__()
1808
+ self.check_freq = check_freq
1809
+ self.lr_min = lr_min
1810
+ self.lr_max = lr_max
1811
+ self.best_mean_reward = -np.inf
1812
+ self.last_mean_reward = -np.inf
1813
+
1814
+ def _on_step(self) -> bool:
1815
+ if self.n_calls % self.check_freq == 0:
1816
+ # Get mean reward from episode buffer
1817
+ if len(self.model.ep_info_buffer) > 0:
1818
+ mean_reward = np.mean([ep_info["r"] for ep_info in self.model.ep_info_buffer])
1819
+
1820
+ # If no improvement, decrease learning rate
1821
+ if mean_reward <= self.last_mean_reward:
1822
+ current_lr = self.model.learning_rate
1823
+ new_lr = max(self.lr_min, current_lr * 0.9)
1824
+ self.model.learning_rate = new_lr
1825
+ if self.verbose:
1826
+ print(f"Decreasing LR: {current_lr:.6f} → {new_lr:.6f}")
1827
+
1828
+ # If improvement, potentially increase learning rate
1829
+ elif mean_reward > self.best_mean_reward:
1830
+ current_lr = self.model.learning_rate
1831
+ new_lr = min(self.lr_max, current_lr * 1.05)
1832
+ self.model.learning_rate = new_lr
1833
+ self.best_mean_reward = mean_reward
1834
+ if self.verbose:
1835
+ print(f"Increasing LR: {current_lr:.6f} → {new_lr:.6f}")
1836
+
1837
+ self.last_mean_reward = mean_reward
1838
+
1839
+ return True
1840
+ ```
1841
+
1842
+ ### 4. Curriculum Learning Callback
1843
+
1844
+ ```python
1845
+ class CurriculumCallback(BaseCallback):
1846
+ """Progressively increase task difficulty."""
1847
+
1848
+ def __init__(self, difficulty_levels, timesteps_per_level):
1849
+ super().__init__()
1850
+ self.difficulty_levels = difficulty_levels
1851
+ self.timesteps_per_level = timesteps_per_level
1852
+ self.current_level = 0
1853
+
1854
+ def _on_step(self) -> bool:
1855
+ # Check if it's time to increase difficulty
1856
+ target_level = min(
1857
+ len(self.difficulty_levels) - 1,
1858
+ self.num_timesteps // self.timesteps_per_level
1859
+ )
1860
+
1861
+ if target_level > self.current_level:
1862
+ self.current_level = target_level
1863
+ difficulty = self.difficulty_levels[self.current_level]
1864
+
1865
+ # Update environment difficulty
1866
+ if hasattr(self.training_env, "set_difficulty"):
1867
+ self.training_env.set_difficulty(difficulty)
1868
+
1869
+ if self.verbose:
1870
+ print(f"Increased difficulty to level {self.current_level}: {difficulty}")
1871
+
1872
+ return True
1873
+
1874
+ # Usage
1875
+ difficulty_levels = ["easy", "medium", "hard", "expert"]
1876
+ curriculum_callback = CurriculumCallback(
1877
+ difficulty_levels=difficulty_levels,
1878
+ timesteps_per_level=50_000
1879
+ )
1880
+
1881
+ model.learn(total_timesteps=200_000, callback=curriculum_callback)
1882
+ ```
1883
+
1884
+ ### 5. Entropy Monitoring Callback
1885
+
1886
+ ```python
1887
+ class EntropyMonitoringCallback(BaseCallback):
1888
+ """Monitor and log policy entropy (exploration measure)."""
1889
+
1890
+ def __init__(self, check_freq=1000, target_entropy=None):
1891
+ super().__init__()
1892
+ self.check_freq = check_freq
1893
+ self.target_entropy = target_entropy
1894
+
1895
+ def _on_step(self) -> bool:
1896
+ if self.n_calls % self.check_freq == 0:
1897
+ # For PPO, get entropy from logger
1898
+ if hasattr(self.model, "logger"):
1899
+ # Entropy is logged by PPO during training
1900
+ # We can access it from the logger's name_to_value dict
1901
+ pass
1902
+
1903
+ # For SAC, check entropy coefficient
1904
+ if hasattr(self.model, "ent_coef"):
1905
+ if isinstance(self.model.ent_coef, th.Tensor):
1906
+ entropy = self.model.ent_coef.item()
1907
+ else:
1908
+ entropy = self.model.ent_coef
1909
+
1910
+ self.logger.record("train/entropy_coef", entropy)
1911
+
1912
+ # Warn if entropy too low (insufficient exploration)
1913
+ if entropy < 0.01:
1914
+ if self.verbose:
1915
+ print("⚠️ Warning: Low entropy - agent may not be exploring enough!")
1916
+
1917
+ return True
1918
+ ```
1919
+
1920
+ ### 6. Action Distribution Logging
1921
+
1922
+ ```python
1923
+ class ActionDistributionCallback(BaseCallback):
1924
+ """Log action distribution to detect policy collapse."""
1925
+
1926
+ def __init__(self, check_freq=5000):
1927
+ super().__init__()
1928
+ self.check_freq = check_freq
1929
+ self.action_counts = None
1930
+
1931
+ def _on_step(self) -> bool:
1932
+ if self.n_calls % self.check_freq == 0:
1933
+ # Initialize action counter
1934
+ if self.action_counts is None:
1935
+ if isinstance(self.training_env.action_space, gym.spaces.Discrete):
1936
+ n_actions = self.training_env.action_space.n
1937
+ self.action_counts = np.zeros(n_actions)
1938
+
1939
+ # Collect actions over next N steps
1940
+ if self.action_counts is not None:
1941
+ # Get last action from logger
1942
+ # This is a simplified version - in practice, collect over episode
1943
+ for action_idx in range(len(self.action_counts)):
1944
+ self.logger.record(f"actions/action_{action_idx}_freq",
1945
+ self.action_counts[action_idx] / self.action_counts.sum())
1946
+
1947
+ # Warn if one action dominates (>80%)
1948
+ max_freq = self.action_counts.max() / self.action_counts.sum()
1949
+ if max_freq > 0.8:
1950
+ if self.verbose:
1951
+ print(f"⚠️ Warning: Action {self.action_counts.argmax()} used {max_freq:.1%} of time!")
1952
+
1953
+ return True
1954
+ ```
1955
+
1956
+ ### 7. Multi-Callback Composition
1957
+
1958
+ ```python
1959
+ from stable_baselines3.common.callbacks import CallbackList
1960
+
1961
+ # Combine multiple callbacks for comprehensive monitoring
1962
+ callback_list = CallbackList([
1963
+ EvalCallback(
1964
+ eval_env,
1965
+ best_model_save_path="./logs/best_model/",
1966
+ eval_freq=5000,
1967
+ deterministic=True
1968
+ ),
1969
+ CheckpointCallback(
1970
+ save_freq=10000,
1971
+ save_path="./logs/checkpoints/",
1972
+ name_prefix="rl_model"
1973
+ ),
1974
+ ProgressiveRewardScalingCallback(
1975
+ initial_scale=0.1,
1976
+ final_scale=1.0,
1977
+ total_timesteps=200_000
1978
+ ),
1979
+ CurriculumCallback(
1980
+ difficulty_levels=["easy", "medium", "hard"],
1981
+ timesteps_per_level=50_000
1982
+ ),
1983
+ EntropyMonitoringCallback(
1984
+ check_freq=1000
1985
+ )
1986
+ ])
1987
+
1988
+ # Train with all callbacks
1989
+ model.learn(total_timesteps=200_000, callback=callback_list)
1990
+ ```
1991
+
1992
+ ### 8. TensorBoard Integration
1993
+
1994
+ ```python
1995
+ # Enhanced logging with TensorBoard
1996
+ from stable_baselines3.common.logger import configure
1997
+
1998
+ # Configure TensorBoard logging
1999
+ logger = configure("./logs/tensorboard", ["stdout", "csv", "tensorboard"])
2000
+ model.set_logger(logger)
2001
+
2002
+ # Custom metrics in callbacks
2003
+ class CustomMetricsCallback(BaseCallback):
2004
+ def _on_step(self) -> bool:
2005
+ if self.n_calls % 100 == 0:
2006
+ # Log custom metrics
2007
+ self.logger.record("custom/timesteps", self.num_timesteps)
2008
+ self.logger.record("custom/episodes", len(self.model.ep_info_buffer))
2009
+
2010
+ # Log environment-specific metrics
2011
+ if hasattr(self.training_env, "get_metrics"):
2012
+ metrics = self.training_env.get_metrics()
2013
+ for key, value in metrics.items():
2014
+ self.logger.record(f"env/{key}", value)
2015
+
2016
+ return True
2017
+
2018
+ # View with TensorBoard:
2019
+ # tensorboard --logdir ./logs/tensorboard
2020
+ ```
2021
+
2022
+ ---
2023
+
2024
+ ## Core Expertise
2025
+
2026
+ ### RL Algorithms
2027
+ - **Value-Based**: DQN, Double DQN, Dueling DQN
2028
+ - **Policy Gradient**: REINFORCE, A2C, PPO, TRPO
2029
+ - **Actor-Critic**: SAC, TD3, DDPG
2030
+ - **Model-Based**: Planning, World Models
2031
+
2032
+ ### Environment Design
2033
+ - Custom Gymnasium environments
2034
+ - Multi-agent environments
2035
+ - Partially observable environments (POMDPs)
2036
+ - Continuous/discrete action spaces
2037
+
2038
+ ### Training Optimization
2039
+ - Replay buffers and experience replay
2040
+ - Target networks and soft updates
2041
+ - Exploration strategies (epsilon-greedy, entropy regularization)
2042
+ - Reward shaping and normalization
2043
+
2044
+ ### Deployment
2045
+ - Model quantization for edge devices
2046
+ - ONNX export for cross-platform inference
2047
+ - Real-time decision making
2048
+ - Multi-agent coordination
2049
+
2050
+ ## Output Format
2051
+
2052
+ ```
2053
+ 🎮 REINFORCEMENT LEARNING IMPLEMENTATION
2054
+ ========================================
2055
+
2056
+ 📋 ENVIRONMENT:
2057
+ - [Environment type and complexity]
2058
+ - [State space dimensions]
2059
+ - [Action space (discrete/continuous)]
2060
+ - [Reward structure]
2061
+
2062
+ 🤖 ALGORITHM:
2063
+ - [Algorithm choice and justification]
2064
+ - [Hyperparameters]
2065
+ - [Training configuration]
2066
+
2067
+ 📊 TRAINING RESULTS:
2068
+ - [Learning curves]
2069
+ - [Final performance metrics]
2070
+ - [Sample efficiency]
2071
+
2072
+ 🚀 DEPLOYMENT:
2073
+ - [Model format]
2074
+ - [Inference latency]
2075
+ - [Edge device compatibility]
2076
+ ```
2077
+
2078
+ ## Self-Validation
2079
+
2080
+ - [ ] Context7 documentation consulted
2081
+ - [ ] Environment follows Gymnasium API
2082
+ - [ ] Proper exploration/exploitation balance
2083
+ - [ ] Reward function encourages desired behavior
2084
+ - [ ] Training monitored with callbacks
2085
+ - [ ] Best model saved
2086
+ - [ ] Test in environment after training
2087
+
2088
+ You deliver production-ready RL agents using Context7-verified best practices for maximum sample efficiency and performance.