rl-expert-skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (175) hide show
  1. package/README.md +65 -0
  2. package/bin/cli.js +66 -0
  3. package/lib/claude.js +37 -0
  4. package/lib/codex.js +108 -0
  5. package/lib/copy.js +37 -0
  6. package/package.json +33 -0
  7. package/skills/analyzing-saliency-and-values/SKILL.md +83 -0
  8. package/skills/analyzing-saliency-and-values/references/saliency_math.md +48 -0
  9. package/skills/blueprinting-system-architecture/SKILL.md +95 -0
  10. package/skills/blueprinting-system-architecture/references/distributed-patterns.md +64 -0
  11. package/skills/blueprinting-system-architecture/references/hardware-placement-rules.md +79 -0
  12. package/skills/blueprinting-system-architecture/scripts/validate_architecture.py +168 -0
  13. package/skills/conducting-adversarial-rl-testing/SKILL.md +74 -0
  14. package/skills/conducting-adversarial-rl-testing/references/adversarial_theory.md +54 -0
  15. package/skills/configuring-checkpoint-strategy/SKILL.md +99 -0
  16. package/skills/configuring-checkpoint-strategy/references/checkpoint_patterns.md +108 -0
  17. package/skills/configuring-checkpoint-strategy/scripts/checkpoint_manager.py +101 -0
  18. package/skills/configuring-distributed-rollouts/SKILL.md +104 -0
  19. package/skills/configuring-distributed-rollouts/references/distributed_patterns.md +95 -0
  20. package/skills/configuring-distributed-rollouts/scripts/batch_size_calculator.py +123 -0
  21. package/skills/configuring-hyperparameter-tuning/SKILL.md +131 -0
  22. package/skills/configuring-hyperparameter-tuning/references/hyperparameter_ranges.md +88 -0
  23. package/skills/configuring-hyperparameter-tuning/scripts/optuna_rl_study.py +130 -0
  24. package/skills/configuring-replay-buffers/SKILL.md +89 -0
  25. package/skills/configuring-replay-buffers/references/her_and_memory_patterns.md +153 -0
  26. package/skills/configuring-replay-buffers/scripts/per_sumtree.py +150 -0
  27. package/skills/creating-ood-robustness-tests/SKILL.md +79 -0
  28. package/skills/creating-ood-robustness-tests/references/ood_test_patterns.md +79 -0
  29. package/skills/defining-action-space/SKILL.md +78 -0
  30. package/skills/defining-action-space/references/action-space-patterns.md +88 -0
  31. package/skills/defining-action-space/scripts/validate_action_space.py +145 -0
  32. package/skills/defining-mdp-transition/SKILL.md +97 -0
  33. package/skills/defining-mdp-transition/references/mdp-formalism.md +75 -0
  34. package/skills/defining-mdp-transition/scripts/validate_mdp_spec.py +208 -0
  35. package/skills/defining-state-space/SKILL.md +91 -0
  36. package/skills/defining-state-space/references/observation-design-patterns.md +112 -0
  37. package/skills/defining-state-space/scripts/validate_state_space.py +175 -0
  38. package/skills/designing-adversarial-environments/SKILL.md +133 -0
  39. package/skills/designing-adversarial-environments/references/paired-algorithm.md +82 -0
  40. package/skills/designing-deterministic-eval-loop/SKILL.md +105 -0
  41. package/skills/designing-deterministic-eval-loop/references/eval_loop_patterns.md +95 -0
  42. package/skills/designing-feature-extractor/SKILL.md +141 -0
  43. package/skills/designing-feature-extractor/references/resnet_backbone_patterns.md +114 -0
  44. package/skills/designing-feature-extractor/scripts/multimodal_extractor.py +133 -0
  45. package/skills/designing-hierarchical-rl/SKILL.md +93 -0
  46. package/skills/designing-hierarchical-rl/references/hrl-algorithms.md +92 -0
  47. package/skills/designing-hierarchical-rl/scripts/validate_hrl_schema.py +185 -0
  48. package/skills/designing-policy-actor/SKILL.md +147 -0
  49. package/skills/designing-policy-actor/references/torch_distributions_api.md +97 -0
  50. package/skills/designing-policy-actor/scripts/gaussian_actor.py +134 -0
  51. package/skills/designing-reward-function/SKILL.md +113 -0
  52. package/skills/designing-reward-function/references/reward-shaping-theory.md +57 -0
  53. package/skills/designing-reward-function/scripts/validate_reward_function.py +157 -0
  54. package/skills/designing-value-critic/SKILL.md +143 -0
  55. package/skills/designing-value-critic/references/bellman_equations.md +134 -0
  56. package/skills/designing-value-critic/scripts/twin_critic_polyak.py +148 -0
  57. package/skills/diagnosing-training-failures/SKILL.md +89 -0
  58. package/skills/diagnosing-training-failures/references/failure_modes.md +73 -0
  59. package/skills/evaluating-marl-equilibria/SKILL.md +105 -0
  60. package/skills/evaluating-marl-equilibria/references/marl_equilibria_theory.md +70 -0
  61. package/skills/formulating-irl-and-imitation/SKILL.md +103 -0
  62. package/skills/formulating-irl-and-imitation/references/imitation-algorithms.md +124 -0
  63. package/skills/formulating-irl-and-imitation/scripts/validate_demo_dataset.py +192 -0
  64. package/skills/formulating-marl-architecture/SKILL.md +98 -0
  65. package/skills/formulating-marl-architecture/references/marl-algorithms.md +104 -0
  66. package/skills/formulating-marl-architecture/scripts/validate_marl_spec.py +196 -0
  67. package/skills/implementing-domain-randomization/SKILL.md +120 -0
  68. package/skills/implementing-domain-randomization/references/adr-guide.md +85 -0
  69. package/skills/implementing-domain-randomization/scripts/adr_controller.py +81 -0
  70. package/skills/implementing-gymnasium-env/SKILL.md +138 -0
  71. package/skills/implementing-gymnasium-env/references/api-reference.md +113 -0
  72. package/skills/implementing-gymnasium-env/scripts/validate_gymnasium_env.py +99 -0
  73. package/skills/implementing-imitation-learning/SKILL.md +169 -0
  74. package/skills/implementing-imitation-learning/references/dataset_and_bce_patterns.md +142 -0
  75. package/skills/implementing-imitation-learning/scripts/gail_discriminator.py +173 -0
  76. package/skills/implementing-loss-functions/SKILL.md +174 -0
  77. package/skills/implementing-loss-functions/references/gradient_clipping_guide.md +98 -0
  78. package/skills/implementing-loss-functions/scripts/ppo_losses.py +179 -0
  79. package/skills/implementing-marl-algorithms/SKILL.md +170 -0
  80. package/skills/implementing-marl-algorithms/references/pytorch_marl_broadcasting.md +116 -0
  81. package/skills/implementing-marl-algorithms/scripts/qmix_mixing_network.py +181 -0
  82. package/skills/implementing-memory-models/SKILL.md +172 -0
  83. package/skills/implementing-memory-models/references/lstm_and_causal_masking.md +142 -0
  84. package/skills/implementing-memory-models/scripts/bptt_iterator.py +177 -0
  85. package/skills/implementing-observation-wrappers/SKILL.md +138 -0
  86. package/skills/implementing-observation-wrappers/references/normalization-patterns.md +113 -0
  87. package/skills/implementing-observation-wrappers/scripts/running_mean_std.py +113 -0
  88. package/skills/implementing-offline-rl/SKILL.md +187 -0
  89. package/skills/implementing-offline-rl/references/hdf5_dataset_ingestion.md +171 -0
  90. package/skills/implementing-offline-rl/scripts/cql_offline.py +227 -0
  91. package/skills/implementing-pettingzoo-marl-env/SKILL.md +164 -0
  92. package/skills/implementing-pettingzoo-marl-env/references/ctde-integration.md +110 -0
  93. package/skills/implementing-rendering-and-plotting/SKILL.md +130 -0
  94. package/skills/implementing-rendering-and-plotting/references/plotting_patterns.md +114 -0
  95. package/skills/implementing-temporal-wrappers/SKILL.md +161 -0
  96. package/skills/implementing-temporal-wrappers/references/frame-patterns.md +116 -0
  97. package/skills/integrating-physics-simulator/SKILL.md +151 -0
  98. package/skills/integrating-physics-simulator/references/engine-guide.md +178 -0
  99. package/skills/rl-algo-engineer/SKILL.md +7 -0
  100. package/skills/rl-algo-engineer/steps/step-01-init.md +137 -0
  101. package/skills/rl-algo-engineer/steps/step-02-feature-extractor.md +97 -0
  102. package/skills/rl-algo-engineer/steps/step-03-policy-actor.md +90 -0
  103. package/skills/rl-algo-engineer/steps/step-04-value-critic.md +102 -0
  104. package/skills/rl-algo-engineer/steps/step-05-memory-models.md +92 -0
  105. package/skills/rl-algo-engineer/steps/step-06-replay-buffers.md +90 -0
  106. package/skills/rl-algo-engineer/steps/step-07-loss-functions.md +109 -0
  107. package/skills/rl-algo-engineer/steps/step-08-marl-algorithms.md +88 -0
  108. package/skills/rl-algo-engineer/steps/step-09-offline-rl.md +90 -0
  109. package/skills/rl-algo-engineer/steps/step-10-imitation-learning.md +95 -0
  110. package/skills/rl-algo-engineer/steps/step-11-rlhf-reward-model.md +93 -0
  111. package/skills/rl-algo-engineer/steps/step-12-synthesize.md +140 -0
  112. package/skills/rl-algo-engineer/workflow.md +82 -0
  113. package/skills/rl-architect/SKILL.md +7 -0
  114. package/skills/rl-architect/steps/step-01-init.md +131 -0
  115. package/skills/rl-architect/steps/step-02-define-state-space.md +116 -0
  116. package/skills/rl-architect/steps/step-03-define-action-space.md +89 -0
  117. package/skills/rl-architect/steps/step-04-design-reward-function.md +94 -0
  118. package/skills/rl-architect/steps/step-05-define-mdp-transition.md +95 -0
  119. package/skills/rl-architect/steps/step-06-select-rl-strategy.md +102 -0
  120. package/skills/rl-architect/steps/step-07-blueprint-system-architecture.md +113 -0
  121. package/skills/rl-architect/steps/step-08-formulate-marl-architecture.md +88 -0
  122. package/skills/rl-architect/steps/step-09-design-hierarchical-rl.md +93 -0
  123. package/skills/rl-architect/steps/step-10-formulate-irl-and-imitation.md +86 -0
  124. package/skills/rl-architect/steps/step-11-synthesize.md +142 -0
  125. package/skills/rl-architect/workflow.md +78 -0
  126. package/skills/rl-env-engineer/SKILL.md +7 -0
  127. package/skills/rl-env-engineer/steps/step-01-init.md +132 -0
  128. package/skills/rl-env-engineer/steps/step-02-gymnasium-env.md +99 -0
  129. package/skills/rl-env-engineer/steps/step-03-physics-simulator.md +99 -0
  130. package/skills/rl-env-engineer/steps/step-04-observation-wrappers.md +106 -0
  131. package/skills/rl-env-engineer/steps/step-05-temporal-wrappers.md +95 -0
  132. package/skills/rl-env-engineer/steps/step-06-domain-randomization.md +100 -0
  133. package/skills/rl-env-engineer/steps/step-07-validate-env-compatibility.md +111 -0
  134. package/skills/rl-env-engineer/steps/step-08-pettingzoo-marl-env.md +104 -0
  135. package/skills/rl-env-engineer/steps/step-09-adversarial-environments.md +96 -0
  136. package/skills/rl-env-engineer/steps/step-10-synthesize.md +126 -0
  137. package/skills/rl-env-engineer/workflow.md +80 -0
  138. package/skills/rl-evaluator/SKILL.md +7 -0
  139. package/skills/rl-evaluator/steps/step-01-init.md +135 -0
  140. package/skills/rl-evaluator/steps/step-02-deterministic-eval-loop.md +120 -0
  141. package/skills/rl-evaluator/steps/step-03-ood-robustness-tests.md +92 -0
  142. package/skills/rl-evaluator/steps/step-04-rendering-and-plotting.md +104 -0
  143. package/skills/rl-evaluator/steps/step-05-saliency-and-values.md +109 -0
  144. package/skills/rl-evaluator/steps/step-06-diagnose-training-failures.md +109 -0
  145. package/skills/rl-evaluator/steps/step-07-evaluate-marl-equilibria.md +93 -0
  146. package/skills/rl-evaluator/steps/step-08-adversarial-rl-testing.md +112 -0
  147. package/skills/rl-evaluator/steps/step-09-synthesize.md +136 -0
  148. package/skills/rl-evaluator/workflow.md +80 -0
  149. package/skills/rl-mlops-engineer/SKILL.md +7 -0
  150. package/skills/rl-mlops-engineer/steps/step-01-init.md +135 -0
  151. package/skills/rl-mlops-engineer/steps/step-02-experiment-tracking.md +121 -0
  152. package/skills/rl-mlops-engineer/steps/step-03-hyperparameter-tuning.md +124 -0
  153. package/skills/rl-mlops-engineer/steps/step-04-distributed-rollouts.md +121 -0
  154. package/skills/rl-mlops-engineer/steps/step-05-checkpoint-strategy.md +140 -0
  155. package/skills/rl-mlops-engineer/steps/step-06-offline-rl-datasets.md +128 -0
  156. package/skills/rl-mlops-engineer/steps/step-07-rlhf-data-pipeline.md +120 -0
  157. package/skills/rl-mlops-engineer/steps/step-08-synthesize.md +144 -0
  158. package/skills/rl-mlops-engineer/workflow.md +78 -0
  159. package/skills/selecting-rl-strategy/SKILL.md +97 -0
  160. package/skills/selecting-rl-strategy/references/algorithm-deep-dives.md +96 -0
  161. package/skills/selecting-rl-strategy/scripts/algorithm_selector.py +190 -0
  162. package/skills/setting-up-experiment-tracking/SKILL.md +146 -0
  163. package/skills/setting-up-experiment-tracking/references/metric_interpretations.md +93 -0
  164. package/skills/setting-up-experiment-tracking/scripts/wandb_rl_callback.py +136 -0
  165. package/skills/setting-up-offline-rl-datasets/SKILL.md +151 -0
  166. package/skills/setting-up-offline-rl-datasets/references/offline_rl_formats.md +117 -0
  167. package/skills/setting-up-offline-rl-datasets/scripts/offline_dataset_builder.py +207 -0
  168. package/skills/setting-up-rlhf-data-pipeline/SKILL.md +207 -0
  169. package/skills/setting-up-rlhf-data-pipeline/references/rlhf_architecture.md +123 -0
  170. package/skills/setting-up-rlhf-data-pipeline/scripts/rlhf_pipeline.py +251 -0
  171. package/skills/training-rlhf-reward-model/SKILL.md +195 -0
  172. package/skills/training-rlhf-reward-model/references/reward_bounding_patterns.md +162 -0
  173. package/skills/training-rlhf-reward-model/scripts/bradley_terry_loss.py +197 -0
  174. package/skills/validating-env-compatibility/SKILL.md +145 -0
  175. package/skills/validating-env-compatibility/scripts/validate_env_suite.py +170 -0
package/README.md ADDED
@@ -0,0 +1,65 @@
1
+ # rl-expert-skills
2
+
3
+ 45 reinforcement learning skills for **Claude Code** and **OpenAI Codex**.
4
+
5
+ Covers architecture, algorithms, environments, evaluation, and MLOps.
6
+
7
+ ## Usage
8
+
9
+ From any project folder:
10
+
11
+ ```bash
12
+ npx rl-expert-skills
13
+ ```
14
+
15
+ The installer will ask:
16
+
17
+ 1. **Which tool?** — Claude Code, OpenAI Codex, or both
18
+ 2. **Scope?** — current project or current user
19
+
20
+ ### What gets installed
21
+
22
+ | Tool | Project scope | User scope |
23
+ |------|--------------|------------|
24
+ | Claude Code | `.claude/skills/` | `~/.claude/skills/` |
25
+ | OpenAI Codex | `.codex/skills/` + `AGENTS.md` | `~/.codex/skills/` + `~/.codex/instructions.md` |
26
+
27
+ ## Skills included
28
+
29
+ **Orchestrator agents**
30
+ - `rl-architect` — MDP formulation, algorithm selection, system blueprint
31
+ - `rl-algo-engineer` — neural networks, training loop, loss functions
32
+ - `rl-env-engineer` — Gymnasium/PettingZoo environments and wrappers
33
+ - `rl-evaluator` — evaluation, OOD testing, failure diagnosis
34
+ - `rl-mlops-engineer` — experiment tracking, distributed training, checkpointing
35
+
36
+ **Individual skills** (40 skills covering reward design, policy/critic architecture, MARL, offline RL, RLHF, domain randomization, and more)
37
+
38
+ ## Publishing
39
+
40
+ ### First time
41
+
42
+ ```bash
43
+ cd F:/rl-expert-skills
44
+ npm login
45
+ npm publish --access public
46
+ ```
47
+
48
+ ### Updating skills
49
+
50
+ After editing skills in `F:\RL-Expert\skills\`, re-bundle and republish:
51
+
52
+ ```bash
53
+ cd F:/rl-expert-skills
54
+ npm run prepare # re-copies skills from ../RL-Expert/skills/
55
+ npm version patch # bump version (patch / minor / major)
56
+ npm publish --access public
57
+ ```
58
+
59
+ ### Using before publishing (local test)
60
+
61
+ From any project folder:
62
+
63
+ ```bash
64
+ npx F:/rl-expert-skills
65
+ ```
package/bin/cli.js ADDED
@@ -0,0 +1,66 @@
1
+ #!/usr/bin/env node
2
+
3
+ import { checkbox, select, confirm } from '@inquirer/prompts';
4
+ import path from 'path';
5
+ import { fileURLToPath } from 'url';
6
+ import { installClaude } from '../lib/claude.js';
7
+ import { installCodex } from '../lib/codex.js';
8
+
9
+ const __dirname = path.dirname(fileURLToPath(import.meta.url));
10
+ const SKILLS_DIR = path.join(__dirname, '..', 'skills');
11
+
12
+ async function main() {
13
+ console.log('\nRL Expert Skills Installer');
14
+ console.log('==========================');
15
+ console.log('45 reinforcement learning skills for Claude Code and OpenAI Codex\n');
16
+
17
+ const tools = await checkbox({
18
+ message: 'Which tool(s) do you want to install the skills for?',
19
+ choices: [
20
+ { name: 'Claude Code', value: 'claude', checked: true },
21
+ { name: 'OpenAI Codex', value: 'codex', checked: false },
22
+ ],
23
+ validate: (choices) => choices.length > 0 || 'Please select at least one tool.',
24
+ });
25
+
26
+ const scope = await select({
27
+ message: 'Install scope?',
28
+ choices: [
29
+ {
30
+ name: 'Current project (.claude/skills/ or .codex/skills/)',
31
+ value: 'project',
32
+ },
33
+ {
34
+ name: 'Current user (~/.claude/skills/ or ~/.codex/skills/)',
35
+ value: 'user',
36
+ },
37
+ ],
38
+ });
39
+
40
+ console.log('');
41
+
42
+ const results = [];
43
+
44
+ for (const tool of tools) {
45
+ if (tool === 'claude') {
46
+ const result = await installClaude(SKILLS_DIR, scope);
47
+ results.push(result);
48
+ } else if (tool === 'codex') {
49
+ const result = await installCodex(SKILLS_DIR, scope);
50
+ results.push(result);
51
+ }
52
+ }
53
+
54
+ console.log('\nDone');
55
+ console.log('----');
56
+ for (const r of results) {
57
+ console.log(` ${r.tool}: ${r.count} skills -> ${r.path}`);
58
+ if (r.extra) console.log(` ${r.extra}`);
59
+ }
60
+ console.log('');
61
+ }
62
+
63
+ main().catch((err) => {
64
+ console.error('\nError:', err.message);
65
+ process.exit(1);
66
+ });
package/lib/claude.js ADDED
@@ -0,0 +1,37 @@
1
+ import fs from 'fs';
2
+ import path from 'path';
3
+ import os from 'os';
4
+ import { copyDir } from './copy.js';
5
+
6
+ /**
7
+ * Install skills for Claude Code.
8
+ *
9
+ * Project scope: <cwd>/.claude/skills/<skill>/
10
+ * User scope: ~/.claude/skills/<skill>/
11
+ */
12
+ export async function installClaude(skillsDir, scope) {
13
+ const targetDir =
14
+ scope === 'project'
15
+ ? path.join(process.cwd(), '.claude', 'skills')
16
+ : path.join(os.homedir(), '.claude', 'skills');
17
+
18
+ fs.mkdirSync(targetDir, { recursive: true });
19
+
20
+ const skills = fs.readdirSync(skillsDir, { withFileTypes: true })
21
+ .filter((e) => e.isDirectory())
22
+ .map((e) => e.name);
23
+
24
+ for (const skill of skills) {
25
+ const src = path.join(skillsDir, skill);
26
+ const dest = path.join(targetDir, skill);
27
+ copyDir(src, dest);
28
+ }
29
+
30
+ process.stdout.write(` Claude Code installed ${skills.length} skills\n`);
31
+
32
+ return {
33
+ tool: 'Claude Code ',
34
+ count: skills.length,
35
+ path: targetDir,
36
+ };
37
+ }
package/lib/codex.js ADDED
@@ -0,0 +1,108 @@
1
+ import fs from 'fs';
2
+ import path from 'path';
3
+ import os from 'os';
4
+ import { copyDir, readSkillMeta } from './copy.js';
5
+
6
+ const AGENTS_HEADER = '<!-- rl-expert-skills -->';
7
+ const AGENTS_FOOTER = '<!-- /rl-expert-skills -->';
8
+
9
+ /**
10
+ * Install skills for OpenAI Codex.
11
+ *
12
+ * Project scope:
13
+ * - Skills copied to <cwd>/.codex/skills/<skill>/
14
+ * - <cwd>/AGENTS.md updated with skill index block
15
+ *
16
+ * User scope:
17
+ * - Skills copied to ~/.codex/skills/<skill>/
18
+ * - ~/.codex/instructions.md updated with skill index block
19
+ */
20
+ export async function installCodex(skillsDir, scope) {
21
+ const isProject = scope === 'project';
22
+ const baseDir = isProject ? process.cwd() : os.homedir();
23
+ const targetDir = path.join(baseDir, '.codex', 'skills');
24
+
25
+ fs.mkdirSync(targetDir, { recursive: true });
26
+
27
+ const skills = fs.readdirSync(skillsDir, { withFileTypes: true })
28
+ .filter((e) => e.isDirectory())
29
+ .map((e) => e.name);
30
+
31
+ for (const skill of skills) {
32
+ const src = path.join(skillsDir, skill);
33
+ const dest = path.join(targetDir, skill);
34
+ copyDir(src, dest);
35
+ }
36
+
37
+ // Build the skill index block to inject into AGENTS.md / instructions.md
38
+ const skillLines = skills.map((skill) => {
39
+ const meta = readSkillMeta(path.join(skillsDir, skill));
40
+ const label = meta?.name || skill;
41
+ const desc = meta?.description
42
+ ? ` — ${meta.description.slice(0, 120)}${meta.description.length > 120 ? '…' : ''}`
43
+ : '';
44
+ return `- **${label}** (\`.codex/skills/${skill}/SKILL.md\`)${desc}`;
45
+ });
46
+
47
+ const relPath = isProject ? '.codex/skills' : '~/.codex/skills';
48
+ const block = [
49
+ AGENTS_HEADER,
50
+ '## RL Expert Skills',
51
+ '',
52
+ `The following reinforcement learning skills are installed in \`${relPath}\`.`,
53
+ 'Reference a skill by name to load its instructions into your context.',
54
+ '',
55
+ ...skillLines,
56
+ AGENTS_FOOTER,
57
+ ].join('\n');
58
+
59
+ // Determine target instructions file
60
+ const instructionsFile = isProject
61
+ ? path.join(baseDir, 'AGENTS.md')
62
+ : path.join(baseDir, '.codex', 'instructions.md');
63
+
64
+ if (!isProject) {
65
+ fs.mkdirSync(path.join(baseDir, '.codex'), { recursive: true });
66
+ }
67
+
68
+ upsertBlock(instructionsFile, block);
69
+
70
+ const extra = `updated ${isProject ? 'AGENTS.md' : '~/.codex/instructions.md'}`;
71
+ process.stdout.write(` OpenAI Codex installed ${skills.length} skills\n`);
72
+
73
+ return {
74
+ tool: 'OpenAI Codex',
75
+ count: skills.length,
76
+ path: targetDir,
77
+ extra,
78
+ };
79
+ }
80
+
81
+ /**
82
+ * Insert or replace the rl-expert-skills block in a markdown file.
83
+ */
84
+ function upsertBlock(filePath, block) {
85
+ let existing = '';
86
+ if (fs.existsSync(filePath)) {
87
+ existing = fs.readFileSync(filePath, 'utf8');
88
+ }
89
+
90
+ const startIdx = existing.indexOf(AGENTS_HEADER);
91
+ const endIdx = existing.indexOf(AGENTS_FOOTER);
92
+
93
+ let updated;
94
+ if (startIdx !== -1 && endIdx !== -1) {
95
+ // Replace existing block
96
+ updated =
97
+ existing.slice(0, startIdx) +
98
+ block +
99
+ existing.slice(endIdx + AGENTS_FOOTER.length);
100
+ } else {
101
+ // Append block
102
+ updated = existing
103
+ ? existing.trimEnd() + '\n\n' + block + '\n'
104
+ : block + '\n';
105
+ }
106
+
107
+ fs.writeFileSync(filePath, updated, 'utf8');
108
+ }
package/lib/copy.js ADDED
@@ -0,0 +1,37 @@
1
+ import fs from 'fs';
2
+ import path from 'path';
3
+
4
+ /**
5
+ * Recursively copy a directory from src to dest.
6
+ * Overwrites existing files.
7
+ */
8
+ export function copyDir(src, dest) {
9
+ fs.mkdirSync(dest, { recursive: true });
10
+ for (const entry of fs.readdirSync(src, { withFileTypes: true })) {
11
+ const srcPath = path.join(src, entry.name);
12
+ const destPath = path.join(dest, entry.name);
13
+ if (entry.isDirectory()) {
14
+ copyDir(srcPath, destPath);
15
+ } else {
16
+ fs.copyFileSync(srcPath, destPath);
17
+ }
18
+ }
19
+ }
20
+
21
+ /**
22
+ * Read SKILL.md in a skill directory and return its frontmatter fields.
23
+ * Returns { name, description } or null on failure.
24
+ */
25
+ export function readSkillMeta(skillDir) {
26
+ const skillMdPath = path.join(skillDir, 'SKILL.md');
27
+ if (!fs.existsSync(skillMdPath)) return null;
28
+ const content = fs.readFileSync(skillMdPath, 'utf8');
29
+ const match = content.match(/^---\n([\s\S]*?)\n---/);
30
+ if (!match) return null;
31
+ const front = match[1];
32
+ const name = (front.match(/^name:\s*(.+)$/m) || [])[1]?.trim();
33
+ // description may be single-line or quoted multi-line — grab everything after "description: "
34
+ const descMatch = front.match(/^description:\s*(['"]?)([\s\S]*?)(\1)\s*(?:\n\w|$)/m);
35
+ const description = descMatch ? descMatch[2].replace(/\n\s*/g, ' ').trim() : '';
36
+ return { name, description };
37
+ }
package/package.json ADDED
@@ -0,0 +1,33 @@
1
+ {
2
+ "name": "rl-expert-skills",
3
+ "version": "1.0.0",
4
+ "description": "RL Expert skills for Claude Code and OpenAI Codex — 45 reinforcement learning skills covering architecture, algorithms, environments, evaluation, and MLOps.",
5
+ "type": "module",
6
+ "bin": {
7
+ "rl-expert-skills": "./bin/cli.js"
8
+ },
9
+ "files": [
10
+ "bin/",
11
+ "lib/",
12
+ "skills/"
13
+ ],
14
+ "scripts": {
15
+ "prepare": "node scripts/copy-skills.js"
16
+ },
17
+ "dependencies": {
18
+ "@inquirer/prompts": "^7.5.0"
19
+ },
20
+ "devDependencies": {},
21
+ "engines": {
22
+ "node": ">=18.0.0"
23
+ },
24
+ "keywords": [
25
+ "reinforcement-learning",
26
+ "rl",
27
+ "claude",
28
+ "codex",
29
+ "skills",
30
+ "ai"
31
+ ],
32
+ "license": "MIT"
33
+ }
@@ -0,0 +1,83 @@
1
+ ---
2
+ name: Analyze Saliency and Values
3
+ description: This skill should be used when the user asks to "analyze saliency", "compute feature saliency", "run Grad-CAM", "apply integrated gradients", "extract attention weights", "visualize what the agent looks at", "inspect temporal attention maps", or "check if the memory architecture is useful". Do NOT hallucinate parameters outside the boundaries of Analyze Saliency and Values.
4
+ version: 0.1.0
5
+ ---
6
+
7
+ # Analyze Saliency and Values
8
+
9
+ Isolate attention and feature saliency using Grad-CAM or Integrated Gradients to reveal which inputs drive the agent's decisions and whether the memory architecture is earning its compute.
10
+
11
+ ## Step 1 — Gradient Saliency Extraction
12
+
13
+ Push a specific observation forward through the Actor network and backpropagate gradients to the input with respect to the chosen continuous action magnitude:
14
+
15
+ ```python
16
+ obs_tensor = torch.tensor(obs, requires_grad=True)
17
+ action = actor(obs_tensor)
18
+ action[target_dim].backward()
19
+ saliency = obs_tensor.grad.abs()
20
+ ```
21
+
22
+ For image-based policies, apply Grad-CAM to the final convolutional feature map:
23
+
24
+ ```python
25
+ # See scripts/grad_cam.py for full implementation
26
+ gradients = [] # hooked during backward
27
+ activations = [] # hooked during forward
28
+ cam = (gradients * activations).mean(dim=1).relu()
29
+ cam = F.interpolate(cam, size=obs.shape[-2:], mode='bilinear')
30
+ ```
31
+
32
+ Overlay the resulting heatmap on the original frame to highlight which pixels or vector indices drove the decision most.
33
+
34
+ For continuous vector observations, produce a bar chart of per-dimension gradient magnitudes ranked by importance.
35
+
36
+ ## Step 2 — Temporal Attention Weights
37
+
38
+ If the Algo Engineer implemented a Transformer or LSTM, extract the attention map matrix:
39
+
40
+ **Transformer attention:**
41
+ ```python
42
+ # Hook into multi-head attention output_weights
43
+ attn_weights = model.transformer.layers[i].self_attn(..., need_weights=True)
44
+ # Shape: (batch, heads, seq_len, seq_len)
45
+ ```
46
+
47
+ **LSTM hidden state analysis:**
48
+ ```python
49
+ # Track hidden state norm over time — large norm = high reliance on that step
50
+ hidden_norms = [h.norm().item() for h in hidden_sequence]
51
+ ```
52
+
53
+ **Decision rule — evaluate temporal range:**
54
+ - If the agent relies exclusively on the instant frame (`attn_weights[:, :, -1, -1] ≈ 1.0`), the memory architecture is useless.
55
+ - Inform the Architect: the LSTM/Transformer should be replaced with a simple MLP to save compute and reduce variance.
56
+ - Document the finding in the saliency report with the attention matrix visualized as a heatmap.
57
+
58
+ ## Step 3 — Integrated Gradients (Baseline Comparison)
59
+
60
+ Integrated Gradients attribute importance relative to a neutral baseline (e.g., zero observation):
61
+
62
+ $$\text{IG}_i(x) = (x_i - x'_i) \times \int_{\alpha=0}^{1} \frac{\partial F(x' + \alpha(x - x'))}{\partial x_i} d\alpha$$
63
+
64
+ Approximate with 50 interpolation steps. See `scripts/integrated_gradients.py` for the full loop.
65
+
66
+ Useful for checking whether the agent responds to physically meaningful features or spurious correlations.
67
+
68
+ ## Step 4 — Report
69
+
70
+ Produce a **Saliency Analysis Report** containing:
71
+ 1. Grad-CAM / saliency heatmaps for at least 5 representative states (goal-approach, obstacle-avoidance, failure state)
72
+ 2. Temporal attention matrix visualization (if memory architecture present)
73
+ 3. Verdict on memory architecture utility
74
+ 4. Top-5 most influential observation dimensions with physical interpretation
75
+
76
+ ## Additional Resources
77
+
78
+ ### Reference Files
79
+ - **`references/saliency_math.md`** — Grad-CAM and Integrated Gradients derivations, formulas, and edge cases
80
+
81
+ ### Scripts
82
+ - **`scripts/grad_cam.py`** — Full Grad-CAM implementation with forward/backward hooks
83
+ - **`scripts/integrated_gradients.py`** — Integrated Gradients loop with interpolation
@@ -0,0 +1,48 @@
1
+ # Saliency Math Reference
2
+
3
+ ## Grad-CAM Derivation
4
+
5
+ Grad-CAM (Gradient-weighted Class Activation Mapping) computes importance weights for each feature map channel $k$ by global-average-pooling the gradients of the target output $y^c$ with respect to the feature map activations $A^k$:
6
+
7
+ $$\alpha_k^c = \frac{1}{Z} \sum_i \sum_j \frac{\partial y^c}{\partial A_{ij}^k}$$
8
+
9
+ The class activation map is then:
10
+
11
+ $$L_{\text{Grad-CAM}}^c = \text{ReLU}\left(\sum_k \alpha_k^c A^k\right)$$
12
+
13
+ The ReLU is critical: only features with a positive influence on the target class are retained.
14
+
15
+ **For RL actor networks:** Replace the class score $y^c$ with the action magnitude $\pi_\theta(s)[\text{dim}]$ or the Q-value $Q(s, a)$.
16
+
17
+ ## Integrated Gradients Derivation
18
+
19
+ Integrated Gradients (IG) compute the contribution of each input feature $x_i$ relative to a baseline $x'$ (typically zeros or Gaussian noise):
20
+
21
+ $$\text{IG}_i(x) = (x_i - x'_i) \times \int_{\alpha=0}^{1} \frac{\partial F(x' + \alpha(x - x'))}{\partial x_i} \, d\alpha$$
22
+
23
+ **Numerical approximation (Riemann sum, 50 steps):**
24
+
25
+ $$\text{IG}_i(x) \approx (x_i - x'_i) \times \sum_{k=1}^{m} \frac{\partial F\left(x' + \frac{k}{m}(x - x')\right)}{\partial x_i} \times \frac{1}{m}$$
26
+
27
+ IG satisfies the **completeness axiom**: $\sum_i \text{IG}_i(x) = F(x) - F(x')$. This means attributions sum to the actual output difference — a correctness guarantee Grad-CAM lacks.
28
+
29
+ ## Transformer Attention Extraction
30
+
31
+ Multi-head attention computes:
32
+
33
+ $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
34
+
35
+ The attention weight matrix $W = \text{softmax}(QK^T / \sqrt{d_k})$ has shape `(batch, heads, seq_len, seq_len)`. Entry $W[b, h, i, j]$ represents how much position $i$ attends to position $j$.
36
+
37
+ **Temporal range analysis:**
38
+ - Average across heads: `W_avg = W.mean(dim=1)` — shape `(batch, seq_len, seq_len)`
39
+ - If `W_avg[:, -1, -1] > 0.9` consistently, the agent is almost exclusively attending to the most recent frame
40
+
41
+ **LSTM hidden state norm as attention proxy:**
42
+ LSTM does not produce an explicit attention matrix, but the hidden state norm $\|h_t\|_2$ over time approximates the "importance" the network assigns to history. A flat norm curve suggests the LSTM gate is near saturation and not selectively reading history.
43
+
44
+ ## Edge Cases
45
+
46
+ - **Vanishing gradients in deep networks:** Saliency computed at the input layer through many ReLU activations may be near-zero even for influential features. Prefer Grad-CAM on intermediate layers or use SmoothGrad (average over noisy copies of the input).
47
+ - **Discrete action spaces:** Grad-CAM w.r.t. `argmax` is undefined (not differentiable). Use the pre-softmax logit of the chosen action instead.
48
+ - **Normalization sensitivity:** Always normalize the saliency map to `[0, 1]` before overlaying. Raw gradient magnitudes vary across observations by orders of magnitude.
@@ -0,0 +1,95 @@
1
+ ---
2
+ name: Blueprint System Architecture
3
+ description: This skill should be used when the user asks to "blueprint the system architecture", "map the RL data flow", "design the RL pipeline", "define environment-agent interaction", "architect replay buffer placement", "design rollout and learner layers", or "draw the end-to-end RL system diagram". Do NOT hallucinate parameters outside the boundaries of Blueprint System Architecture.
4
+ version: 0.1.0
5
+ ---
6
+
7
+ # Blueprint System Architecture
8
+
9
+ Map the end-to-end data flow between Environment, Core RL Loop, and Replay memories. This skill produces a structured architectural diagram and component breakdown for any Reinforcement Learning system, covering process topology, hardware placement, and data flow direction.
10
+
11
+ ## Core Components to Define
12
+
13
+ Every RL system architecture consists of four layers that must be explicitly specified:
14
+
15
+ | Layer | Role | Key Questions |
16
+ |---|---|---|
17
+ | **Rollout Worker Layer** | Generates experience | Async or sync? SubprocVecEnv or single process? |
18
+ | **Experience Buffer Storage Layer** | Stores transitions | RAM-local or distributed (e.g. Ape-X)? |
19
+ | **Learner / Optimizer Layer** | Computes gradients | CPU→GPU batch transfer or GPU-native? |
20
+ | **Evaluation / Logging Hook Layer** | Tracks metrics | Frequency, checkpointing, W&B/TensorBoard? |
21
+
22
+ ## Step 1 — Map Component Interactivity
23
+
24
+ Define the process topology first:
25
+
26
+ - **Single-process**: Environment runs in the same process as the agent. Simple, debuggable. Use for prototyping.
27
+ - **SubprocVecEnv**: Multiple environment instances run as separate OS processes. Use when environment computation is the bottleneck.
28
+ - **Distributed (Ape-X / IMPALA style)**: Rollout workers run on separate machines and push experience to a central Replay Buffer. Learner pulls batches remotely. Use when scale > 1 machine.
29
+
30
+ Specify the communication mechanism between components: shared memory, pipes, sockets, or gRPC.
31
+
32
+ ## Step 2 — Define Hardware Data Placement
33
+
34
+ Data placement critically impacts throughput. Apply the following rules:
35
+
36
+ **Standard CPU/GPU split (most systems):**
37
+ 1. Environment `step()` executes on CPU.
38
+ 2. Experience `(s, a, r, s', done)` stored in CPU RAM as NumPy arrays.
39
+ 3. Sampled mini-batches transferred to GPU via `.to(device)` only during gradient computation.
40
+ 4. Policy inference during rollout: run on CPU (single sample, low batch size) or GPU (vectorized envs).
41
+
42
+ **GPU-native physics simulation (Isaac Gym / Brax):**
43
+ - Environment state tensors live permanently on GPU memory.
44
+ - Policy inference and gradient computation both occur on GPU.
45
+ - **Critical rule:** Never call `.cpu()` or `.numpy()` on these tensors inside the rollout loop — this triggers a GPU→CPU bus transfer that destroys throughput. Keep tensors on-device until logging.
46
+
47
+ ## Step 3 — Specify Replay Buffer Architecture
48
+
49
+ | System Type | Buffer Location | Buffer Class |
50
+ |---|---|---|
51
+ | On-Policy (PPO, A2C) | In-process RAM | `RolloutBuffer` (cleared after each update) |
52
+ | Off-Policy single machine (SAC, DQN) | In-process RAM | `ReplayBuffer` (circular, persists across updates) |
53
+ | Distributed Off-Policy (Ape-X) | Centralized Redis / Ray store | `PrioritizedReplayBuffer` (remote) |
54
+
55
+ ## Output Format
56
+
57
+ Generate an ASCII diagram specifying all four layers, then follow with a structured breakdown:
58
+
59
+ ```
60
+ ┌─────────────────────────────────────────────┐
61
+ │ ROLLOUT WORKER LAYER │
62
+ │ [Env 0] [Env 1] ... [Env N] (CPU/GPU) │
63
+ │ Generates: (s, a, r, s', done) tuples │
64
+ └────────────────────┬────────────────────────┘
65
+ │ push experience
66
+ ┌────────────────────▼────────────────────────┐
67
+ │ EXPERIENCE BUFFER STORAGE LAYER │
68
+ │ ReplayBuffer / RolloutBuffer (RAM / Remote) │
69
+ │ Capacity: N transitions │
70
+ └────────────────────┬────────────────────────┘
71
+ │ sample mini-batch → GPU
72
+ ┌────────────────────▼────────────────────────┐
73
+ │ LEARNER / OPTIMIZER LAYER │
74
+ │ Actor + Critic forward pass (GPU) │
75
+ │ Loss computation + backprop (GPU) │
76
+ │ Parameter sync back to workers │
77
+ └────────────────────┬────────────────────────┘
78
+ │ metrics
79
+ ┌────────────────────▼────────────────────────┐
80
+ │ EVALUATION / LOGGING HOOK LAYER │
81
+ │ Episode return, entropy, loss, FPS │
82
+ │ Checkpoint saves, W&B / TensorBoard │
83
+ └─────────────────────────────────────────────┘
84
+ ```
85
+
86
+ ## Additional Resources
87
+
88
+ ### Reference Files
89
+
90
+ - **`references/distributed-patterns.md`** — Ape-X, IMPALA, and R2D2 architectural patterns
91
+ - **`references/hardware-placement-rules.md`** — GPU-native simulation rules (Isaac Gym / Brax)
92
+
93
+ ### Scripts
94
+
95
+ - **`scripts/validate_architecture.py`** — Validates that a proposed architecture specification dict contains all required layers and fields
@@ -0,0 +1,64 @@
1
+ # Distributed RL Architecture Patterns
2
+
3
+ ## Ape-X (Distributed Prioritized Experience Replay)
4
+
5
+ ```
6
+ Architecture:
7
+ N Actor workers (CPU) — each runs a copy of the policy
8
+ 1 Central Replay Buffer (Redis / Ray store) — receives prioritized transitions
9
+ 1 Learner (GPU) — pulls mini-batches, computes gradients, broadcasts weights back
10
+
11
+ Data flow:
12
+ Actor → (s, a, r, s', done, priority) → Replay Buffer
13
+ Learner → sample(batch) → Replay Buffer
14
+ Learner → updated weights → Actors (async broadcast)
15
+ ```
16
+
17
+ Key properties:
18
+ - Actors run old policy parameters (lag of 1–100 gradient steps)
19
+ - Prioritization is done at the actor level using local TD-error estimates
20
+ - Learner runs at maximum throughput, decoupled from rollout speed
21
+ - Scale: 360 actors in original DeepMind paper, scales linearly
22
+
23
+ ## IMPALA (Importance Weighted Actor-Learner Architecture)
24
+
25
+ ```
26
+ Architecture:
27
+ N Actor workers — generate trajectories, send to queue
28
+ 1 Learner — pulls trajectory chunks from queue, applies V-trace correction
29
+
30
+ V-trace correction:
31
+ Corrects for off-policy data via importance sampling ratio clipping:
32
+ ρ_t = min(ρ̄, π_target(a_t|s_t) / π_behavior(a_t|s_t))
33
+ ```
34
+
35
+ Key properties:
36
+ - Designed for CPU-only actors with GPU learner
37
+ - V-trace makes it robust to large policy lags (stable despite stale actors)
38
+ - Policy gradient-based (not Q-learning), so compatible with continuous actions
39
+
40
+ ## R2D2 (Recurrent Replay Distributed DQN)
41
+
42
+ ```
43
+ Architecture:
44
+ Like Ape-X but:
45
+ - Actor/Learner networks use LSTM for POMDPs
46
+ - Replay Buffer stores sequences (not individual transitions)
47
+ - Stored hidden state h_t alongside (s, a, r, done) sequences
48
+
49
+ Burn-in: First K steps of each stored sequence are used to warm up LSTM state
50
+ ```
51
+
52
+ Use R2D2 when: environment requires memory (POMDP) + distributed scale needed.
53
+
54
+ ## Single-Machine Multi-GPU
55
+
56
+ For smaller scale (8–32 environments, 1–4 GPUs):
57
+ ```
58
+ Process layout:
59
+ Main process: Learner on GPU 0
60
+ Subprocess pool: N env workers on CPU
61
+ Optional: Separate GPU for inference during rollout (if batch size > 256)
62
+ ```
63
+
64
+ Recommended tools: `torch.multiprocessing`, `ray.remote`, or `stable-baselines3` VecEnv.