rl-expert-skills 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +65 -0
- package/bin/cli.js +66 -0
- package/lib/claude.js +37 -0
- package/lib/codex.js +108 -0
- package/lib/copy.js +37 -0
- package/package.json +33 -0
- package/skills/analyzing-saliency-and-values/SKILL.md +83 -0
- package/skills/analyzing-saliency-and-values/references/saliency_math.md +48 -0
- package/skills/blueprinting-system-architecture/SKILL.md +95 -0
- package/skills/blueprinting-system-architecture/references/distributed-patterns.md +64 -0
- package/skills/blueprinting-system-architecture/references/hardware-placement-rules.md +79 -0
- package/skills/blueprinting-system-architecture/scripts/validate_architecture.py +168 -0
- package/skills/conducting-adversarial-rl-testing/SKILL.md +74 -0
- package/skills/conducting-adversarial-rl-testing/references/adversarial_theory.md +54 -0
- package/skills/configuring-checkpoint-strategy/SKILL.md +99 -0
- package/skills/configuring-checkpoint-strategy/references/checkpoint_patterns.md +108 -0
- package/skills/configuring-checkpoint-strategy/scripts/checkpoint_manager.py +101 -0
- package/skills/configuring-distributed-rollouts/SKILL.md +104 -0
- package/skills/configuring-distributed-rollouts/references/distributed_patterns.md +95 -0
- package/skills/configuring-distributed-rollouts/scripts/batch_size_calculator.py +123 -0
- package/skills/configuring-hyperparameter-tuning/SKILL.md +131 -0
- package/skills/configuring-hyperparameter-tuning/references/hyperparameter_ranges.md +88 -0
- package/skills/configuring-hyperparameter-tuning/scripts/optuna_rl_study.py +130 -0
- package/skills/configuring-replay-buffers/SKILL.md +89 -0
- package/skills/configuring-replay-buffers/references/her_and_memory_patterns.md +153 -0
- package/skills/configuring-replay-buffers/scripts/per_sumtree.py +150 -0
- package/skills/creating-ood-robustness-tests/SKILL.md +79 -0
- package/skills/creating-ood-robustness-tests/references/ood_test_patterns.md +79 -0
- package/skills/defining-action-space/SKILL.md +78 -0
- package/skills/defining-action-space/references/action-space-patterns.md +88 -0
- package/skills/defining-action-space/scripts/validate_action_space.py +145 -0
- package/skills/defining-mdp-transition/SKILL.md +97 -0
- package/skills/defining-mdp-transition/references/mdp-formalism.md +75 -0
- package/skills/defining-mdp-transition/scripts/validate_mdp_spec.py +208 -0
- package/skills/defining-state-space/SKILL.md +91 -0
- package/skills/defining-state-space/references/observation-design-patterns.md +112 -0
- package/skills/defining-state-space/scripts/validate_state_space.py +175 -0
- package/skills/designing-adversarial-environments/SKILL.md +133 -0
- package/skills/designing-adversarial-environments/references/paired-algorithm.md +82 -0
- package/skills/designing-deterministic-eval-loop/SKILL.md +105 -0
- package/skills/designing-deterministic-eval-loop/references/eval_loop_patterns.md +95 -0
- package/skills/designing-feature-extractor/SKILL.md +141 -0
- package/skills/designing-feature-extractor/references/resnet_backbone_patterns.md +114 -0
- package/skills/designing-feature-extractor/scripts/multimodal_extractor.py +133 -0
- package/skills/designing-hierarchical-rl/SKILL.md +93 -0
- package/skills/designing-hierarchical-rl/references/hrl-algorithms.md +92 -0
- package/skills/designing-hierarchical-rl/scripts/validate_hrl_schema.py +185 -0
- package/skills/designing-policy-actor/SKILL.md +147 -0
- package/skills/designing-policy-actor/references/torch_distributions_api.md +97 -0
- package/skills/designing-policy-actor/scripts/gaussian_actor.py +134 -0
- package/skills/designing-reward-function/SKILL.md +113 -0
- package/skills/designing-reward-function/references/reward-shaping-theory.md +57 -0
- package/skills/designing-reward-function/scripts/validate_reward_function.py +157 -0
- package/skills/designing-value-critic/SKILL.md +143 -0
- package/skills/designing-value-critic/references/bellman_equations.md +134 -0
- package/skills/designing-value-critic/scripts/twin_critic_polyak.py +148 -0
- package/skills/diagnosing-training-failures/SKILL.md +89 -0
- package/skills/diagnosing-training-failures/references/failure_modes.md +73 -0
- package/skills/evaluating-marl-equilibria/SKILL.md +105 -0
- package/skills/evaluating-marl-equilibria/references/marl_equilibria_theory.md +70 -0
- package/skills/formulating-irl-and-imitation/SKILL.md +103 -0
- package/skills/formulating-irl-and-imitation/references/imitation-algorithms.md +124 -0
- package/skills/formulating-irl-and-imitation/scripts/validate_demo_dataset.py +192 -0
- package/skills/formulating-marl-architecture/SKILL.md +98 -0
- package/skills/formulating-marl-architecture/references/marl-algorithms.md +104 -0
- package/skills/formulating-marl-architecture/scripts/validate_marl_spec.py +196 -0
- package/skills/implementing-domain-randomization/SKILL.md +120 -0
- package/skills/implementing-domain-randomization/references/adr-guide.md +85 -0
- package/skills/implementing-domain-randomization/scripts/adr_controller.py +81 -0
- package/skills/implementing-gymnasium-env/SKILL.md +138 -0
- package/skills/implementing-gymnasium-env/references/api-reference.md +113 -0
- package/skills/implementing-gymnasium-env/scripts/validate_gymnasium_env.py +99 -0
- package/skills/implementing-imitation-learning/SKILL.md +169 -0
- package/skills/implementing-imitation-learning/references/dataset_and_bce_patterns.md +142 -0
- package/skills/implementing-imitation-learning/scripts/gail_discriminator.py +173 -0
- package/skills/implementing-loss-functions/SKILL.md +174 -0
- package/skills/implementing-loss-functions/references/gradient_clipping_guide.md +98 -0
- package/skills/implementing-loss-functions/scripts/ppo_losses.py +179 -0
- package/skills/implementing-marl-algorithms/SKILL.md +170 -0
- package/skills/implementing-marl-algorithms/references/pytorch_marl_broadcasting.md +116 -0
- package/skills/implementing-marl-algorithms/scripts/qmix_mixing_network.py +181 -0
- package/skills/implementing-memory-models/SKILL.md +172 -0
- package/skills/implementing-memory-models/references/lstm_and_causal_masking.md +142 -0
- package/skills/implementing-memory-models/scripts/bptt_iterator.py +177 -0
- package/skills/implementing-observation-wrappers/SKILL.md +138 -0
- package/skills/implementing-observation-wrappers/references/normalization-patterns.md +113 -0
- package/skills/implementing-observation-wrappers/scripts/running_mean_std.py +113 -0
- package/skills/implementing-offline-rl/SKILL.md +187 -0
- package/skills/implementing-offline-rl/references/hdf5_dataset_ingestion.md +171 -0
- package/skills/implementing-offline-rl/scripts/cql_offline.py +227 -0
- package/skills/implementing-pettingzoo-marl-env/SKILL.md +164 -0
- package/skills/implementing-pettingzoo-marl-env/references/ctde-integration.md +110 -0
- package/skills/implementing-rendering-and-plotting/SKILL.md +130 -0
- package/skills/implementing-rendering-and-plotting/references/plotting_patterns.md +114 -0
- package/skills/implementing-temporal-wrappers/SKILL.md +161 -0
- package/skills/implementing-temporal-wrappers/references/frame-patterns.md +116 -0
- package/skills/integrating-physics-simulator/SKILL.md +151 -0
- package/skills/integrating-physics-simulator/references/engine-guide.md +178 -0
- package/skills/rl-algo-engineer/SKILL.md +7 -0
- package/skills/rl-algo-engineer/steps/step-01-init.md +137 -0
- package/skills/rl-algo-engineer/steps/step-02-feature-extractor.md +97 -0
- package/skills/rl-algo-engineer/steps/step-03-policy-actor.md +90 -0
- package/skills/rl-algo-engineer/steps/step-04-value-critic.md +102 -0
- package/skills/rl-algo-engineer/steps/step-05-memory-models.md +92 -0
- package/skills/rl-algo-engineer/steps/step-06-replay-buffers.md +90 -0
- package/skills/rl-algo-engineer/steps/step-07-loss-functions.md +109 -0
- package/skills/rl-algo-engineer/steps/step-08-marl-algorithms.md +88 -0
- package/skills/rl-algo-engineer/steps/step-09-offline-rl.md +90 -0
- package/skills/rl-algo-engineer/steps/step-10-imitation-learning.md +95 -0
- package/skills/rl-algo-engineer/steps/step-11-rlhf-reward-model.md +93 -0
- package/skills/rl-algo-engineer/steps/step-12-synthesize.md +140 -0
- package/skills/rl-algo-engineer/workflow.md +82 -0
- package/skills/rl-architect/SKILL.md +7 -0
- package/skills/rl-architect/steps/step-01-init.md +131 -0
- package/skills/rl-architect/steps/step-02-define-state-space.md +116 -0
- package/skills/rl-architect/steps/step-03-define-action-space.md +89 -0
- package/skills/rl-architect/steps/step-04-design-reward-function.md +94 -0
- package/skills/rl-architect/steps/step-05-define-mdp-transition.md +95 -0
- package/skills/rl-architect/steps/step-06-select-rl-strategy.md +102 -0
- package/skills/rl-architect/steps/step-07-blueprint-system-architecture.md +113 -0
- package/skills/rl-architect/steps/step-08-formulate-marl-architecture.md +88 -0
- package/skills/rl-architect/steps/step-09-design-hierarchical-rl.md +93 -0
- package/skills/rl-architect/steps/step-10-formulate-irl-and-imitation.md +86 -0
- package/skills/rl-architect/steps/step-11-synthesize.md +142 -0
- package/skills/rl-architect/workflow.md +78 -0
- package/skills/rl-env-engineer/SKILL.md +7 -0
- package/skills/rl-env-engineer/steps/step-01-init.md +132 -0
- package/skills/rl-env-engineer/steps/step-02-gymnasium-env.md +99 -0
- package/skills/rl-env-engineer/steps/step-03-physics-simulator.md +99 -0
- package/skills/rl-env-engineer/steps/step-04-observation-wrappers.md +106 -0
- package/skills/rl-env-engineer/steps/step-05-temporal-wrappers.md +95 -0
- package/skills/rl-env-engineer/steps/step-06-domain-randomization.md +100 -0
- package/skills/rl-env-engineer/steps/step-07-validate-env-compatibility.md +111 -0
- package/skills/rl-env-engineer/steps/step-08-pettingzoo-marl-env.md +104 -0
- package/skills/rl-env-engineer/steps/step-09-adversarial-environments.md +96 -0
- package/skills/rl-env-engineer/steps/step-10-synthesize.md +126 -0
- package/skills/rl-env-engineer/workflow.md +80 -0
- package/skills/rl-evaluator/SKILL.md +7 -0
- package/skills/rl-evaluator/steps/step-01-init.md +135 -0
- package/skills/rl-evaluator/steps/step-02-deterministic-eval-loop.md +120 -0
- package/skills/rl-evaluator/steps/step-03-ood-robustness-tests.md +92 -0
- package/skills/rl-evaluator/steps/step-04-rendering-and-plotting.md +104 -0
- package/skills/rl-evaluator/steps/step-05-saliency-and-values.md +109 -0
- package/skills/rl-evaluator/steps/step-06-diagnose-training-failures.md +109 -0
- package/skills/rl-evaluator/steps/step-07-evaluate-marl-equilibria.md +93 -0
- package/skills/rl-evaluator/steps/step-08-adversarial-rl-testing.md +112 -0
- package/skills/rl-evaluator/steps/step-09-synthesize.md +136 -0
- package/skills/rl-evaluator/workflow.md +80 -0
- package/skills/rl-mlops-engineer/SKILL.md +7 -0
- package/skills/rl-mlops-engineer/steps/step-01-init.md +135 -0
- package/skills/rl-mlops-engineer/steps/step-02-experiment-tracking.md +121 -0
- package/skills/rl-mlops-engineer/steps/step-03-hyperparameter-tuning.md +124 -0
- package/skills/rl-mlops-engineer/steps/step-04-distributed-rollouts.md +121 -0
- package/skills/rl-mlops-engineer/steps/step-05-checkpoint-strategy.md +140 -0
- package/skills/rl-mlops-engineer/steps/step-06-offline-rl-datasets.md +128 -0
- package/skills/rl-mlops-engineer/steps/step-07-rlhf-data-pipeline.md +120 -0
- package/skills/rl-mlops-engineer/steps/step-08-synthesize.md +144 -0
- package/skills/rl-mlops-engineer/workflow.md +78 -0
- package/skills/selecting-rl-strategy/SKILL.md +97 -0
- package/skills/selecting-rl-strategy/references/algorithm-deep-dives.md +96 -0
- package/skills/selecting-rl-strategy/scripts/algorithm_selector.py +190 -0
- package/skills/setting-up-experiment-tracking/SKILL.md +146 -0
- package/skills/setting-up-experiment-tracking/references/metric_interpretations.md +93 -0
- package/skills/setting-up-experiment-tracking/scripts/wandb_rl_callback.py +136 -0
- package/skills/setting-up-offline-rl-datasets/SKILL.md +151 -0
- package/skills/setting-up-offline-rl-datasets/references/offline_rl_formats.md +117 -0
- package/skills/setting-up-offline-rl-datasets/scripts/offline_dataset_builder.py +207 -0
- package/skills/setting-up-rlhf-data-pipeline/SKILL.md +207 -0
- package/skills/setting-up-rlhf-data-pipeline/references/rlhf_architecture.md +123 -0
- package/skills/setting-up-rlhf-data-pipeline/scripts/rlhf_pipeline.py +251 -0
- package/skills/training-rlhf-reward-model/SKILL.md +195 -0
- package/skills/training-rlhf-reward-model/references/reward_bounding_patterns.md +162 -0
- package/skills/training-rlhf-reward-model/scripts/bradley_terry_loss.py +197 -0
- package/skills/validating-env-compatibility/SKILL.md +145 -0
- package/skills/validating-env-compatibility/scripts/validate_env_suite.py +170 -0
package/README.md
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
# rl-expert-skills
|
|
2
|
+
|
|
3
|
+
45 reinforcement learning skills for **Claude Code** and **OpenAI Codex**.
|
|
4
|
+
|
|
5
|
+
Covers architecture, algorithms, environments, evaluation, and MLOps.
|
|
6
|
+
|
|
7
|
+
## Usage
|
|
8
|
+
|
|
9
|
+
From any project folder:
|
|
10
|
+
|
|
11
|
+
```bash
|
|
12
|
+
npx rl-expert-skills
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
The installer will ask:
|
|
16
|
+
|
|
17
|
+
1. **Which tool?** — Claude Code, OpenAI Codex, or both
|
|
18
|
+
2. **Scope?** — current project or current user
|
|
19
|
+
|
|
20
|
+
### What gets installed
|
|
21
|
+
|
|
22
|
+
| Tool | Project scope | User scope |
|
|
23
|
+
|------|--------------|------------|
|
|
24
|
+
| Claude Code | `.claude/skills/` | `~/.claude/skills/` |
|
|
25
|
+
| OpenAI Codex | `.codex/skills/` + `AGENTS.md` | `~/.codex/skills/` + `~/.codex/instructions.md` |
|
|
26
|
+
|
|
27
|
+
## Skills included
|
|
28
|
+
|
|
29
|
+
**Orchestrator agents**
|
|
30
|
+
- `rl-architect` — MDP formulation, algorithm selection, system blueprint
|
|
31
|
+
- `rl-algo-engineer` — neural networks, training loop, loss functions
|
|
32
|
+
- `rl-env-engineer` — Gymnasium/PettingZoo environments and wrappers
|
|
33
|
+
- `rl-evaluator` — evaluation, OOD testing, failure diagnosis
|
|
34
|
+
- `rl-mlops-engineer` — experiment tracking, distributed training, checkpointing
|
|
35
|
+
|
|
36
|
+
**Individual skills** (40 skills covering reward design, policy/critic architecture, MARL, offline RL, RLHF, domain randomization, and more)
|
|
37
|
+
|
|
38
|
+
## Publishing
|
|
39
|
+
|
|
40
|
+
### First time
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
cd F:/rl-expert-skills
|
|
44
|
+
npm login
|
|
45
|
+
npm publish --access public
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
### Updating skills
|
|
49
|
+
|
|
50
|
+
After editing skills in `F:\RL-Expert\skills\`, re-bundle and republish:
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
cd F:/rl-expert-skills
|
|
54
|
+
npm run prepare # re-copies skills from ../RL-Expert/skills/
|
|
55
|
+
npm version patch # bump version (patch / minor / major)
|
|
56
|
+
npm publish --access public
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
### Using before publishing (local test)
|
|
60
|
+
|
|
61
|
+
From any project folder:
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
npx F:/rl-expert-skills
|
|
65
|
+
```
|
package/bin/cli.js
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
1
|
+
#!/usr/bin/env node
|
|
2
|
+
|
|
3
|
+
import { checkbox, select, confirm } from '@inquirer/prompts';
|
|
4
|
+
import path from 'path';
|
|
5
|
+
import { fileURLToPath } from 'url';
|
|
6
|
+
import { installClaude } from '../lib/claude.js';
|
|
7
|
+
import { installCodex } from '../lib/codex.js';
|
|
8
|
+
|
|
9
|
+
const __dirname = path.dirname(fileURLToPath(import.meta.url));
|
|
10
|
+
const SKILLS_DIR = path.join(__dirname, '..', 'skills');
|
|
11
|
+
|
|
12
|
+
async function main() {
|
|
13
|
+
console.log('\nRL Expert Skills Installer');
|
|
14
|
+
console.log('==========================');
|
|
15
|
+
console.log('45 reinforcement learning skills for Claude Code and OpenAI Codex\n');
|
|
16
|
+
|
|
17
|
+
const tools = await checkbox({
|
|
18
|
+
message: 'Which tool(s) do you want to install the skills for?',
|
|
19
|
+
choices: [
|
|
20
|
+
{ name: 'Claude Code', value: 'claude', checked: true },
|
|
21
|
+
{ name: 'OpenAI Codex', value: 'codex', checked: false },
|
|
22
|
+
],
|
|
23
|
+
validate: (choices) => choices.length > 0 || 'Please select at least one tool.',
|
|
24
|
+
});
|
|
25
|
+
|
|
26
|
+
const scope = await select({
|
|
27
|
+
message: 'Install scope?',
|
|
28
|
+
choices: [
|
|
29
|
+
{
|
|
30
|
+
name: 'Current project (.claude/skills/ or .codex/skills/)',
|
|
31
|
+
value: 'project',
|
|
32
|
+
},
|
|
33
|
+
{
|
|
34
|
+
name: 'Current user (~/.claude/skills/ or ~/.codex/skills/)',
|
|
35
|
+
value: 'user',
|
|
36
|
+
},
|
|
37
|
+
],
|
|
38
|
+
});
|
|
39
|
+
|
|
40
|
+
console.log('');
|
|
41
|
+
|
|
42
|
+
const results = [];
|
|
43
|
+
|
|
44
|
+
for (const tool of tools) {
|
|
45
|
+
if (tool === 'claude') {
|
|
46
|
+
const result = await installClaude(SKILLS_DIR, scope);
|
|
47
|
+
results.push(result);
|
|
48
|
+
} else if (tool === 'codex') {
|
|
49
|
+
const result = await installCodex(SKILLS_DIR, scope);
|
|
50
|
+
results.push(result);
|
|
51
|
+
}
|
|
52
|
+
}
|
|
53
|
+
|
|
54
|
+
console.log('\nDone');
|
|
55
|
+
console.log('----');
|
|
56
|
+
for (const r of results) {
|
|
57
|
+
console.log(` ${r.tool}: ${r.count} skills -> ${r.path}`);
|
|
58
|
+
if (r.extra) console.log(` ${r.extra}`);
|
|
59
|
+
}
|
|
60
|
+
console.log('');
|
|
61
|
+
}
|
|
62
|
+
|
|
63
|
+
main().catch((err) => {
|
|
64
|
+
console.error('\nError:', err.message);
|
|
65
|
+
process.exit(1);
|
|
66
|
+
});
|
package/lib/claude.js
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
import fs from 'fs';
|
|
2
|
+
import path from 'path';
|
|
3
|
+
import os from 'os';
|
|
4
|
+
import { copyDir } from './copy.js';
|
|
5
|
+
|
|
6
|
+
/**
|
|
7
|
+
* Install skills for Claude Code.
|
|
8
|
+
*
|
|
9
|
+
* Project scope: <cwd>/.claude/skills/<skill>/
|
|
10
|
+
* User scope: ~/.claude/skills/<skill>/
|
|
11
|
+
*/
|
|
12
|
+
export async function installClaude(skillsDir, scope) {
|
|
13
|
+
const targetDir =
|
|
14
|
+
scope === 'project'
|
|
15
|
+
? path.join(process.cwd(), '.claude', 'skills')
|
|
16
|
+
: path.join(os.homedir(), '.claude', 'skills');
|
|
17
|
+
|
|
18
|
+
fs.mkdirSync(targetDir, { recursive: true });
|
|
19
|
+
|
|
20
|
+
const skills = fs.readdirSync(skillsDir, { withFileTypes: true })
|
|
21
|
+
.filter((e) => e.isDirectory())
|
|
22
|
+
.map((e) => e.name);
|
|
23
|
+
|
|
24
|
+
for (const skill of skills) {
|
|
25
|
+
const src = path.join(skillsDir, skill);
|
|
26
|
+
const dest = path.join(targetDir, skill);
|
|
27
|
+
copyDir(src, dest);
|
|
28
|
+
}
|
|
29
|
+
|
|
30
|
+
process.stdout.write(` Claude Code installed ${skills.length} skills\n`);
|
|
31
|
+
|
|
32
|
+
return {
|
|
33
|
+
tool: 'Claude Code ',
|
|
34
|
+
count: skills.length,
|
|
35
|
+
path: targetDir,
|
|
36
|
+
};
|
|
37
|
+
}
|
package/lib/codex.js
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
1
|
+
import fs from 'fs';
|
|
2
|
+
import path from 'path';
|
|
3
|
+
import os from 'os';
|
|
4
|
+
import { copyDir, readSkillMeta } from './copy.js';
|
|
5
|
+
|
|
6
|
+
const AGENTS_HEADER = '<!-- rl-expert-skills -->';
|
|
7
|
+
const AGENTS_FOOTER = '<!-- /rl-expert-skills -->';
|
|
8
|
+
|
|
9
|
+
/**
|
|
10
|
+
* Install skills for OpenAI Codex.
|
|
11
|
+
*
|
|
12
|
+
* Project scope:
|
|
13
|
+
* - Skills copied to <cwd>/.codex/skills/<skill>/
|
|
14
|
+
* - <cwd>/AGENTS.md updated with skill index block
|
|
15
|
+
*
|
|
16
|
+
* User scope:
|
|
17
|
+
* - Skills copied to ~/.codex/skills/<skill>/
|
|
18
|
+
* - ~/.codex/instructions.md updated with skill index block
|
|
19
|
+
*/
|
|
20
|
+
export async function installCodex(skillsDir, scope) {
|
|
21
|
+
const isProject = scope === 'project';
|
|
22
|
+
const baseDir = isProject ? process.cwd() : os.homedir();
|
|
23
|
+
const targetDir = path.join(baseDir, '.codex', 'skills');
|
|
24
|
+
|
|
25
|
+
fs.mkdirSync(targetDir, { recursive: true });
|
|
26
|
+
|
|
27
|
+
const skills = fs.readdirSync(skillsDir, { withFileTypes: true })
|
|
28
|
+
.filter((e) => e.isDirectory())
|
|
29
|
+
.map((e) => e.name);
|
|
30
|
+
|
|
31
|
+
for (const skill of skills) {
|
|
32
|
+
const src = path.join(skillsDir, skill);
|
|
33
|
+
const dest = path.join(targetDir, skill);
|
|
34
|
+
copyDir(src, dest);
|
|
35
|
+
}
|
|
36
|
+
|
|
37
|
+
// Build the skill index block to inject into AGENTS.md / instructions.md
|
|
38
|
+
const skillLines = skills.map((skill) => {
|
|
39
|
+
const meta = readSkillMeta(path.join(skillsDir, skill));
|
|
40
|
+
const label = meta?.name || skill;
|
|
41
|
+
const desc = meta?.description
|
|
42
|
+
? ` — ${meta.description.slice(0, 120)}${meta.description.length > 120 ? '…' : ''}`
|
|
43
|
+
: '';
|
|
44
|
+
return `- **${label}** (\`.codex/skills/${skill}/SKILL.md\`)${desc}`;
|
|
45
|
+
});
|
|
46
|
+
|
|
47
|
+
const relPath = isProject ? '.codex/skills' : '~/.codex/skills';
|
|
48
|
+
const block = [
|
|
49
|
+
AGENTS_HEADER,
|
|
50
|
+
'## RL Expert Skills',
|
|
51
|
+
'',
|
|
52
|
+
`The following reinforcement learning skills are installed in \`${relPath}\`.`,
|
|
53
|
+
'Reference a skill by name to load its instructions into your context.',
|
|
54
|
+
'',
|
|
55
|
+
...skillLines,
|
|
56
|
+
AGENTS_FOOTER,
|
|
57
|
+
].join('\n');
|
|
58
|
+
|
|
59
|
+
// Determine target instructions file
|
|
60
|
+
const instructionsFile = isProject
|
|
61
|
+
? path.join(baseDir, 'AGENTS.md')
|
|
62
|
+
: path.join(baseDir, '.codex', 'instructions.md');
|
|
63
|
+
|
|
64
|
+
if (!isProject) {
|
|
65
|
+
fs.mkdirSync(path.join(baseDir, '.codex'), { recursive: true });
|
|
66
|
+
}
|
|
67
|
+
|
|
68
|
+
upsertBlock(instructionsFile, block);
|
|
69
|
+
|
|
70
|
+
const extra = `updated ${isProject ? 'AGENTS.md' : '~/.codex/instructions.md'}`;
|
|
71
|
+
process.stdout.write(` OpenAI Codex installed ${skills.length} skills\n`);
|
|
72
|
+
|
|
73
|
+
return {
|
|
74
|
+
tool: 'OpenAI Codex',
|
|
75
|
+
count: skills.length,
|
|
76
|
+
path: targetDir,
|
|
77
|
+
extra,
|
|
78
|
+
};
|
|
79
|
+
}
|
|
80
|
+
|
|
81
|
+
/**
|
|
82
|
+
* Insert or replace the rl-expert-skills block in a markdown file.
|
|
83
|
+
*/
|
|
84
|
+
function upsertBlock(filePath, block) {
|
|
85
|
+
let existing = '';
|
|
86
|
+
if (fs.existsSync(filePath)) {
|
|
87
|
+
existing = fs.readFileSync(filePath, 'utf8');
|
|
88
|
+
}
|
|
89
|
+
|
|
90
|
+
const startIdx = existing.indexOf(AGENTS_HEADER);
|
|
91
|
+
const endIdx = existing.indexOf(AGENTS_FOOTER);
|
|
92
|
+
|
|
93
|
+
let updated;
|
|
94
|
+
if (startIdx !== -1 && endIdx !== -1) {
|
|
95
|
+
// Replace existing block
|
|
96
|
+
updated =
|
|
97
|
+
existing.slice(0, startIdx) +
|
|
98
|
+
block +
|
|
99
|
+
existing.slice(endIdx + AGENTS_FOOTER.length);
|
|
100
|
+
} else {
|
|
101
|
+
// Append block
|
|
102
|
+
updated = existing
|
|
103
|
+
? existing.trimEnd() + '\n\n' + block + '\n'
|
|
104
|
+
: block + '\n';
|
|
105
|
+
}
|
|
106
|
+
|
|
107
|
+
fs.writeFileSync(filePath, updated, 'utf8');
|
|
108
|
+
}
|
package/lib/copy.js
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
import fs from 'fs';
|
|
2
|
+
import path from 'path';
|
|
3
|
+
|
|
4
|
+
/**
|
|
5
|
+
* Recursively copy a directory from src to dest.
|
|
6
|
+
* Overwrites existing files.
|
|
7
|
+
*/
|
|
8
|
+
export function copyDir(src, dest) {
|
|
9
|
+
fs.mkdirSync(dest, { recursive: true });
|
|
10
|
+
for (const entry of fs.readdirSync(src, { withFileTypes: true })) {
|
|
11
|
+
const srcPath = path.join(src, entry.name);
|
|
12
|
+
const destPath = path.join(dest, entry.name);
|
|
13
|
+
if (entry.isDirectory()) {
|
|
14
|
+
copyDir(srcPath, destPath);
|
|
15
|
+
} else {
|
|
16
|
+
fs.copyFileSync(srcPath, destPath);
|
|
17
|
+
}
|
|
18
|
+
}
|
|
19
|
+
}
|
|
20
|
+
|
|
21
|
+
/**
|
|
22
|
+
* Read SKILL.md in a skill directory and return its frontmatter fields.
|
|
23
|
+
* Returns { name, description } or null on failure.
|
|
24
|
+
*/
|
|
25
|
+
export function readSkillMeta(skillDir) {
|
|
26
|
+
const skillMdPath = path.join(skillDir, 'SKILL.md');
|
|
27
|
+
if (!fs.existsSync(skillMdPath)) return null;
|
|
28
|
+
const content = fs.readFileSync(skillMdPath, 'utf8');
|
|
29
|
+
const match = content.match(/^---\n([\s\S]*?)\n---/);
|
|
30
|
+
if (!match) return null;
|
|
31
|
+
const front = match[1];
|
|
32
|
+
const name = (front.match(/^name:\s*(.+)$/m) || [])[1]?.trim();
|
|
33
|
+
// description may be single-line or quoted multi-line — grab everything after "description: "
|
|
34
|
+
const descMatch = front.match(/^description:\s*(['"]?)([\s\S]*?)(\1)\s*(?:\n\w|$)/m);
|
|
35
|
+
const description = descMatch ? descMatch[2].replace(/\n\s*/g, ' ').trim() : '';
|
|
36
|
+
return { name, description };
|
|
37
|
+
}
|
package/package.json
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "rl-expert-skills",
|
|
3
|
+
"version": "1.0.0",
|
|
4
|
+
"description": "RL Expert skills for Claude Code and OpenAI Codex — 45 reinforcement learning skills covering architecture, algorithms, environments, evaluation, and MLOps.",
|
|
5
|
+
"type": "module",
|
|
6
|
+
"bin": {
|
|
7
|
+
"rl-expert-skills": "./bin/cli.js"
|
|
8
|
+
},
|
|
9
|
+
"files": [
|
|
10
|
+
"bin/",
|
|
11
|
+
"lib/",
|
|
12
|
+
"skills/"
|
|
13
|
+
],
|
|
14
|
+
"scripts": {
|
|
15
|
+
"prepare": "node scripts/copy-skills.js"
|
|
16
|
+
},
|
|
17
|
+
"dependencies": {
|
|
18
|
+
"@inquirer/prompts": "^7.5.0"
|
|
19
|
+
},
|
|
20
|
+
"devDependencies": {},
|
|
21
|
+
"engines": {
|
|
22
|
+
"node": ">=18.0.0"
|
|
23
|
+
},
|
|
24
|
+
"keywords": [
|
|
25
|
+
"reinforcement-learning",
|
|
26
|
+
"rl",
|
|
27
|
+
"claude",
|
|
28
|
+
"codex",
|
|
29
|
+
"skills",
|
|
30
|
+
"ai"
|
|
31
|
+
],
|
|
32
|
+
"license": "MIT"
|
|
33
|
+
}
|
|
@@ -0,0 +1,83 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: Analyze Saliency and Values
|
|
3
|
+
description: This skill should be used when the user asks to "analyze saliency", "compute feature saliency", "run Grad-CAM", "apply integrated gradients", "extract attention weights", "visualize what the agent looks at", "inspect temporal attention maps", or "check if the memory architecture is useful". Do NOT hallucinate parameters outside the boundaries of Analyze Saliency and Values.
|
|
4
|
+
version: 0.1.0
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Analyze Saliency and Values
|
|
8
|
+
|
|
9
|
+
Isolate attention and feature saliency using Grad-CAM or Integrated Gradients to reveal which inputs drive the agent's decisions and whether the memory architecture is earning its compute.
|
|
10
|
+
|
|
11
|
+
## Step 1 — Gradient Saliency Extraction
|
|
12
|
+
|
|
13
|
+
Push a specific observation forward through the Actor network and backpropagate gradients to the input with respect to the chosen continuous action magnitude:
|
|
14
|
+
|
|
15
|
+
```python
|
|
16
|
+
obs_tensor = torch.tensor(obs, requires_grad=True)
|
|
17
|
+
action = actor(obs_tensor)
|
|
18
|
+
action[target_dim].backward()
|
|
19
|
+
saliency = obs_tensor.grad.abs()
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
For image-based policies, apply Grad-CAM to the final convolutional feature map:
|
|
23
|
+
|
|
24
|
+
```python
|
|
25
|
+
# See scripts/grad_cam.py for full implementation
|
|
26
|
+
gradients = [] # hooked during backward
|
|
27
|
+
activations = [] # hooked during forward
|
|
28
|
+
cam = (gradients * activations).mean(dim=1).relu()
|
|
29
|
+
cam = F.interpolate(cam, size=obs.shape[-2:], mode='bilinear')
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
Overlay the resulting heatmap on the original frame to highlight which pixels or vector indices drove the decision most.
|
|
33
|
+
|
|
34
|
+
For continuous vector observations, produce a bar chart of per-dimension gradient magnitudes ranked by importance.
|
|
35
|
+
|
|
36
|
+
## Step 2 — Temporal Attention Weights
|
|
37
|
+
|
|
38
|
+
If the Algo Engineer implemented a Transformer or LSTM, extract the attention map matrix:
|
|
39
|
+
|
|
40
|
+
**Transformer attention:**
|
|
41
|
+
```python
|
|
42
|
+
# Hook into multi-head attention output_weights
|
|
43
|
+
attn_weights = model.transformer.layers[i].self_attn(..., need_weights=True)
|
|
44
|
+
# Shape: (batch, heads, seq_len, seq_len)
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
**LSTM hidden state analysis:**
|
|
48
|
+
```python
|
|
49
|
+
# Track hidden state norm over time — large norm = high reliance on that step
|
|
50
|
+
hidden_norms = [h.norm().item() for h in hidden_sequence]
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
**Decision rule — evaluate temporal range:**
|
|
54
|
+
- If the agent relies exclusively on the instant frame (`attn_weights[:, :, -1, -1] ≈ 1.0`), the memory architecture is useless.
|
|
55
|
+
- Inform the Architect: the LSTM/Transformer should be replaced with a simple MLP to save compute and reduce variance.
|
|
56
|
+
- Document the finding in the saliency report with the attention matrix visualized as a heatmap.
|
|
57
|
+
|
|
58
|
+
## Step 3 — Integrated Gradients (Baseline Comparison)
|
|
59
|
+
|
|
60
|
+
Integrated Gradients attribute importance relative to a neutral baseline (e.g., zero observation):
|
|
61
|
+
|
|
62
|
+
$$\text{IG}_i(x) = (x_i - x'_i) \times \int_{\alpha=0}^{1} \frac{\partial F(x' + \alpha(x - x'))}{\partial x_i} d\alpha$$
|
|
63
|
+
|
|
64
|
+
Approximate with 50 interpolation steps. See `scripts/integrated_gradients.py` for the full loop.
|
|
65
|
+
|
|
66
|
+
Useful for checking whether the agent responds to physically meaningful features or spurious correlations.
|
|
67
|
+
|
|
68
|
+
## Step 4 — Report
|
|
69
|
+
|
|
70
|
+
Produce a **Saliency Analysis Report** containing:
|
|
71
|
+
1. Grad-CAM / saliency heatmaps for at least 5 representative states (goal-approach, obstacle-avoidance, failure state)
|
|
72
|
+
2. Temporal attention matrix visualization (if memory architecture present)
|
|
73
|
+
3. Verdict on memory architecture utility
|
|
74
|
+
4. Top-5 most influential observation dimensions with physical interpretation
|
|
75
|
+
|
|
76
|
+
## Additional Resources
|
|
77
|
+
|
|
78
|
+
### Reference Files
|
|
79
|
+
- **`references/saliency_math.md`** — Grad-CAM and Integrated Gradients derivations, formulas, and edge cases
|
|
80
|
+
|
|
81
|
+
### Scripts
|
|
82
|
+
- **`scripts/grad_cam.py`** — Full Grad-CAM implementation with forward/backward hooks
|
|
83
|
+
- **`scripts/integrated_gradients.py`** — Integrated Gradients loop with interpolation
|
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
# Saliency Math Reference
|
|
2
|
+
|
|
3
|
+
## Grad-CAM Derivation
|
|
4
|
+
|
|
5
|
+
Grad-CAM (Gradient-weighted Class Activation Mapping) computes importance weights for each feature map channel $k$ by global-average-pooling the gradients of the target output $y^c$ with respect to the feature map activations $A^k$:
|
|
6
|
+
|
|
7
|
+
$$\alpha_k^c = \frac{1}{Z} \sum_i \sum_j \frac{\partial y^c}{\partial A_{ij}^k}$$
|
|
8
|
+
|
|
9
|
+
The class activation map is then:
|
|
10
|
+
|
|
11
|
+
$$L_{\text{Grad-CAM}}^c = \text{ReLU}\left(\sum_k \alpha_k^c A^k\right)$$
|
|
12
|
+
|
|
13
|
+
The ReLU is critical: only features with a positive influence on the target class are retained.
|
|
14
|
+
|
|
15
|
+
**For RL actor networks:** Replace the class score $y^c$ with the action magnitude $\pi_\theta(s)[\text{dim}]$ or the Q-value $Q(s, a)$.
|
|
16
|
+
|
|
17
|
+
## Integrated Gradients Derivation
|
|
18
|
+
|
|
19
|
+
Integrated Gradients (IG) compute the contribution of each input feature $x_i$ relative to a baseline $x'$ (typically zeros or Gaussian noise):
|
|
20
|
+
|
|
21
|
+
$$\text{IG}_i(x) = (x_i - x'_i) \times \int_{\alpha=0}^{1} \frac{\partial F(x' + \alpha(x - x'))}{\partial x_i} \, d\alpha$$
|
|
22
|
+
|
|
23
|
+
**Numerical approximation (Riemann sum, 50 steps):**
|
|
24
|
+
|
|
25
|
+
$$\text{IG}_i(x) \approx (x_i - x'_i) \times \sum_{k=1}^{m} \frac{\partial F\left(x' + \frac{k}{m}(x - x')\right)}{\partial x_i} \times \frac{1}{m}$$
|
|
26
|
+
|
|
27
|
+
IG satisfies the **completeness axiom**: $\sum_i \text{IG}_i(x) = F(x) - F(x')$. This means attributions sum to the actual output difference — a correctness guarantee Grad-CAM lacks.
|
|
28
|
+
|
|
29
|
+
## Transformer Attention Extraction
|
|
30
|
+
|
|
31
|
+
Multi-head attention computes:
|
|
32
|
+
|
|
33
|
+
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
|
|
34
|
+
|
|
35
|
+
The attention weight matrix $W = \text{softmax}(QK^T / \sqrt{d_k})$ has shape `(batch, heads, seq_len, seq_len)`. Entry $W[b, h, i, j]$ represents how much position $i$ attends to position $j$.
|
|
36
|
+
|
|
37
|
+
**Temporal range analysis:**
|
|
38
|
+
- Average across heads: `W_avg = W.mean(dim=1)` — shape `(batch, seq_len, seq_len)`
|
|
39
|
+
- If `W_avg[:, -1, -1] > 0.9` consistently, the agent is almost exclusively attending to the most recent frame
|
|
40
|
+
|
|
41
|
+
**LSTM hidden state norm as attention proxy:**
|
|
42
|
+
LSTM does not produce an explicit attention matrix, but the hidden state norm $\|h_t\|_2$ over time approximates the "importance" the network assigns to history. A flat norm curve suggests the LSTM gate is near saturation and not selectively reading history.
|
|
43
|
+
|
|
44
|
+
## Edge Cases
|
|
45
|
+
|
|
46
|
+
- **Vanishing gradients in deep networks:** Saliency computed at the input layer through many ReLU activations may be near-zero even for influential features. Prefer Grad-CAM on intermediate layers or use SmoothGrad (average over noisy copies of the input).
|
|
47
|
+
- **Discrete action spaces:** Grad-CAM w.r.t. `argmax` is undefined (not differentiable). Use the pre-softmax logit of the chosen action instead.
|
|
48
|
+
- **Normalization sensitivity:** Always normalize the saliency map to `[0, 1]` before overlaying. Raw gradient magnitudes vary across observations by orders of magnitude.
|
|
@@ -0,0 +1,95 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: Blueprint System Architecture
|
|
3
|
+
description: This skill should be used when the user asks to "blueprint the system architecture", "map the RL data flow", "design the RL pipeline", "define environment-agent interaction", "architect replay buffer placement", "design rollout and learner layers", or "draw the end-to-end RL system diagram". Do NOT hallucinate parameters outside the boundaries of Blueprint System Architecture.
|
|
4
|
+
version: 0.1.0
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Blueprint System Architecture
|
|
8
|
+
|
|
9
|
+
Map the end-to-end data flow between Environment, Core RL Loop, and Replay memories. This skill produces a structured architectural diagram and component breakdown for any Reinforcement Learning system, covering process topology, hardware placement, and data flow direction.
|
|
10
|
+
|
|
11
|
+
## Core Components to Define
|
|
12
|
+
|
|
13
|
+
Every RL system architecture consists of four layers that must be explicitly specified:
|
|
14
|
+
|
|
15
|
+
| Layer | Role | Key Questions |
|
|
16
|
+
|---|---|---|
|
|
17
|
+
| **Rollout Worker Layer** | Generates experience | Async or sync? SubprocVecEnv or single process? |
|
|
18
|
+
| **Experience Buffer Storage Layer** | Stores transitions | RAM-local or distributed (e.g. Ape-X)? |
|
|
19
|
+
| **Learner / Optimizer Layer** | Computes gradients | CPU→GPU batch transfer or GPU-native? |
|
|
20
|
+
| **Evaluation / Logging Hook Layer** | Tracks metrics | Frequency, checkpointing, W&B/TensorBoard? |
|
|
21
|
+
|
|
22
|
+
## Step 1 — Map Component Interactivity
|
|
23
|
+
|
|
24
|
+
Define the process topology first:
|
|
25
|
+
|
|
26
|
+
- **Single-process**: Environment runs in the same process as the agent. Simple, debuggable. Use for prototyping.
|
|
27
|
+
- **SubprocVecEnv**: Multiple environment instances run as separate OS processes. Use when environment computation is the bottleneck.
|
|
28
|
+
- **Distributed (Ape-X / IMPALA style)**: Rollout workers run on separate machines and push experience to a central Replay Buffer. Learner pulls batches remotely. Use when scale > 1 machine.
|
|
29
|
+
|
|
30
|
+
Specify the communication mechanism between components: shared memory, pipes, sockets, or gRPC.
|
|
31
|
+
|
|
32
|
+
## Step 2 — Define Hardware Data Placement
|
|
33
|
+
|
|
34
|
+
Data placement critically impacts throughput. Apply the following rules:
|
|
35
|
+
|
|
36
|
+
**Standard CPU/GPU split (most systems):**
|
|
37
|
+
1. Environment `step()` executes on CPU.
|
|
38
|
+
2. Experience `(s, a, r, s', done)` stored in CPU RAM as NumPy arrays.
|
|
39
|
+
3. Sampled mini-batches transferred to GPU via `.to(device)` only during gradient computation.
|
|
40
|
+
4. Policy inference during rollout: run on CPU (single sample, low batch size) or GPU (vectorized envs).
|
|
41
|
+
|
|
42
|
+
**GPU-native physics simulation (Isaac Gym / Brax):**
|
|
43
|
+
- Environment state tensors live permanently on GPU memory.
|
|
44
|
+
- Policy inference and gradient computation both occur on GPU.
|
|
45
|
+
- **Critical rule:** Never call `.cpu()` or `.numpy()` on these tensors inside the rollout loop — this triggers a GPU→CPU bus transfer that destroys throughput. Keep tensors on-device until logging.
|
|
46
|
+
|
|
47
|
+
## Step 3 — Specify Replay Buffer Architecture
|
|
48
|
+
|
|
49
|
+
| System Type | Buffer Location | Buffer Class |
|
|
50
|
+
|---|---|---|
|
|
51
|
+
| On-Policy (PPO, A2C) | In-process RAM | `RolloutBuffer` (cleared after each update) |
|
|
52
|
+
| Off-Policy single machine (SAC, DQN) | In-process RAM | `ReplayBuffer` (circular, persists across updates) |
|
|
53
|
+
| Distributed Off-Policy (Ape-X) | Centralized Redis / Ray store | `PrioritizedReplayBuffer` (remote) |
|
|
54
|
+
|
|
55
|
+
## Output Format
|
|
56
|
+
|
|
57
|
+
Generate an ASCII diagram specifying all four layers, then follow with a structured breakdown:
|
|
58
|
+
|
|
59
|
+
```
|
|
60
|
+
┌─────────────────────────────────────────────┐
|
|
61
|
+
│ ROLLOUT WORKER LAYER │
|
|
62
|
+
│ [Env 0] [Env 1] ... [Env N] (CPU/GPU) │
|
|
63
|
+
│ Generates: (s, a, r, s', done) tuples │
|
|
64
|
+
└────────────────────┬────────────────────────┘
|
|
65
|
+
│ push experience
|
|
66
|
+
┌────────────────────▼────────────────────────┐
|
|
67
|
+
│ EXPERIENCE BUFFER STORAGE LAYER │
|
|
68
|
+
│ ReplayBuffer / RolloutBuffer (RAM / Remote) │
|
|
69
|
+
│ Capacity: N transitions │
|
|
70
|
+
└────────────────────┬────────────────────────┘
|
|
71
|
+
│ sample mini-batch → GPU
|
|
72
|
+
┌────────────────────▼────────────────────────┐
|
|
73
|
+
│ LEARNER / OPTIMIZER LAYER │
|
|
74
|
+
│ Actor + Critic forward pass (GPU) │
|
|
75
|
+
│ Loss computation + backprop (GPU) │
|
|
76
|
+
│ Parameter sync back to workers │
|
|
77
|
+
└────────────────────┬────────────────────────┘
|
|
78
|
+
│ metrics
|
|
79
|
+
┌────────────────────▼────────────────────────┐
|
|
80
|
+
│ EVALUATION / LOGGING HOOK LAYER │
|
|
81
|
+
│ Episode return, entropy, loss, FPS │
|
|
82
|
+
│ Checkpoint saves, W&B / TensorBoard │
|
|
83
|
+
└─────────────────────────────────────────────┘
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
## Additional Resources
|
|
87
|
+
|
|
88
|
+
### Reference Files
|
|
89
|
+
|
|
90
|
+
- **`references/distributed-patterns.md`** — Ape-X, IMPALA, and R2D2 architectural patterns
|
|
91
|
+
- **`references/hardware-placement-rules.md`** — GPU-native simulation rules (Isaac Gym / Brax)
|
|
92
|
+
|
|
93
|
+
### Scripts
|
|
94
|
+
|
|
95
|
+
- **`scripts/validate_architecture.py`** — Validates that a proposed architecture specification dict contains all required layers and fields
|
|
@@ -0,0 +1,64 @@
|
|
|
1
|
+
# Distributed RL Architecture Patterns
|
|
2
|
+
|
|
3
|
+
## Ape-X (Distributed Prioritized Experience Replay)
|
|
4
|
+
|
|
5
|
+
```
|
|
6
|
+
Architecture:
|
|
7
|
+
N Actor workers (CPU) — each runs a copy of the policy
|
|
8
|
+
1 Central Replay Buffer (Redis / Ray store) — receives prioritized transitions
|
|
9
|
+
1 Learner (GPU) — pulls mini-batches, computes gradients, broadcasts weights back
|
|
10
|
+
|
|
11
|
+
Data flow:
|
|
12
|
+
Actor → (s, a, r, s', done, priority) → Replay Buffer
|
|
13
|
+
Learner → sample(batch) → Replay Buffer
|
|
14
|
+
Learner → updated weights → Actors (async broadcast)
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
Key properties:
|
|
18
|
+
- Actors run old policy parameters (lag of 1–100 gradient steps)
|
|
19
|
+
- Prioritization is done at the actor level using local TD-error estimates
|
|
20
|
+
- Learner runs at maximum throughput, decoupled from rollout speed
|
|
21
|
+
- Scale: 360 actors in original DeepMind paper, scales linearly
|
|
22
|
+
|
|
23
|
+
## IMPALA (Importance Weighted Actor-Learner Architecture)
|
|
24
|
+
|
|
25
|
+
```
|
|
26
|
+
Architecture:
|
|
27
|
+
N Actor workers — generate trajectories, send to queue
|
|
28
|
+
1 Learner — pulls trajectory chunks from queue, applies V-trace correction
|
|
29
|
+
|
|
30
|
+
V-trace correction:
|
|
31
|
+
Corrects for off-policy data via importance sampling ratio clipping:
|
|
32
|
+
ρ_t = min(ρ̄, π_target(a_t|s_t) / π_behavior(a_t|s_t))
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
Key properties:
|
|
36
|
+
- Designed for CPU-only actors with GPU learner
|
|
37
|
+
- V-trace makes it robust to large policy lags (stable despite stale actors)
|
|
38
|
+
- Policy gradient-based (not Q-learning), so compatible with continuous actions
|
|
39
|
+
|
|
40
|
+
## R2D2 (Recurrent Replay Distributed DQN)
|
|
41
|
+
|
|
42
|
+
```
|
|
43
|
+
Architecture:
|
|
44
|
+
Like Ape-X but:
|
|
45
|
+
- Actor/Learner networks use LSTM for POMDPs
|
|
46
|
+
- Replay Buffer stores sequences (not individual transitions)
|
|
47
|
+
- Stored hidden state h_t alongside (s, a, r, done) sequences
|
|
48
|
+
|
|
49
|
+
Burn-in: First K steps of each stored sequence are used to warm up LSTM state
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Use R2D2 when: environment requires memory (POMDP) + distributed scale needed.
|
|
53
|
+
|
|
54
|
+
## Single-Machine Multi-GPU
|
|
55
|
+
|
|
56
|
+
For smaller scale (8–32 environments, 1–4 GPUs):
|
|
57
|
+
```
|
|
58
|
+
Process layout:
|
|
59
|
+
Main process: Learner on GPU 0
|
|
60
|
+
Subprocess pool: N env workers on CPU
|
|
61
|
+
Optional: Separate GPU for inference during rollout (if batch size > 256)
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
Recommended tools: `torch.multiprocessing`, `ray.remote`, or `stable-baselines3` VecEnv.
|