academic-army 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (68) hide show
  1. package/.editorconfig +9 -0
  2. package/.github/workflows/publish.yml +44 -0
  3. package/.prettierrc.json +3 -0
  4. package/LICENSE +21 -0
  5. package/README.md +172 -0
  6. package/README.zh-CN.md +172 -0
  7. package/agent-forge.yaml +83 -0
  8. package/eslint.config.js +28 -0
  9. package/install_mcp.py +85 -0
  10. package/mcp-server/__main__.py +33 -0
  11. package/mcp-server/deepresearch/__init__.py +3 -0
  12. package/mcp-server/deepresearch/tools.py +33 -0
  13. package/mcp-server/requirements.txt +4 -0
  14. package/metaskills/README.md +131 -0
  15. package/metaskills/README.zh-CN.md +131 -0
  16. package/metaskills/academic-army-architect/METASKILL.md +91 -0
  17. package/metaskills/academic-army-architect/envolve.sh +9 -0
  18. package/metaskills/academic-army-coding-plan/ENVOLVETASK.md +1 -0
  19. package/metaskills/academic-army-coding-plan/METASKILL.md +118 -0
  20. package/metaskills/academic-army-coding-plan/envolve.sh +9 -0
  21. package/metaskills/academic-army-coding-style/METASKILL.md +292 -0
  22. package/metaskills/academic-army-experiment-plan/ENVOLVETASK.md +1 -0
  23. package/metaskills/academic-army-experiment-plan/METASKILL.md +82 -0
  24. package/metaskills/academic-army-experiment-plan/envolve.sh +9 -0
  25. package/metaskills/academic-army-repo-scaffold/ENVOLVETASK.md +1 -0
  26. package/metaskills/academic-army-repo-scaffold/METASKILL.md +223 -0
  27. package/metaskills/academic-army-repo-scaffold/envolve.sh +9 -0
  28. package/package.json +35 -0
  29. package/runs/develop-skill.sh +17 -0
  30. package/runs/develop.sh +16 -0
  31. package/skills/academic-army-architect/SKILL.md +336 -0
  32. package/skills/academic-army-architect/agents/openai.yaml +11 -0
  33. package/skills/academic-army-architect/references/blueprint-schema.md +345 -0
  34. package/skills/academic-army-coding-plan/SKILL.md +491 -0
  35. package/skills/academic-army-coding-plan/agents/openai.yaml +11 -0
  36. package/skills/academic-army-coding-style/SKILL.md +915 -0
  37. package/skills/academic-army-coding-style/agents/openai.yaml +11 -0
  38. package/skills/academic-army-experiment-plan/SKILL.md +517 -0
  39. package/skills/academic-army-experiment-plan/agents/openai.yaml +11 -0
  40. package/skills/academic-army-repo-scaffold/SKILL.md +756 -0
  41. package/skills/academic-army-repo-scaffold/agents/openai.yaml +10 -0
  42. package/src/README.md +79 -0
  43. package/src/README.zh-CN.md +79 -0
  44. package/src/cli.ts +55 -0
  45. package/src/developing/README.md +146 -0
  46. package/src/developing/README.zh-CN.md +146 -0
  47. package/src/developing/agents/developer.ts +40 -0
  48. package/src/developing/agents/factory.ts +11 -0
  49. package/src/developing/agents/index.ts +8 -0
  50. package/src/developing/agents/manager.ts +74 -0
  51. package/src/developing/agents/prompts.ts +12 -0
  52. package/src/developing/agents/reviewer.ts +44 -0
  53. package/src/developing/agents/trajectory-optimizer.ts +70 -0
  54. package/src/developing/agents/types.ts +41 -0
  55. package/src/developing/index.ts +2 -0
  56. package/src/developing/pipeline.ts +306 -0
  57. package/src/developing/pipelineskill.ts +169 -0
  58. package/src/evolve-skill/README.md +116 -0
  59. package/src/evolve-skill/README.zh-CN.md +116 -0
  60. package/src/evolve-skill/agents/evaluator.ts +28 -0
  61. package/src/evolve-skill/agents/factory.ts +11 -0
  62. package/src/evolve-skill/agents/index.ts +4 -0
  63. package/src/evolve-skill/agents/modifier.ts +27 -0
  64. package/src/evolve-skill/agents/runner.ts +19 -0
  65. package/src/evolve-skill/index.ts +1 -0
  66. package/src/evolve-skill/pipeline.ts +140 -0
  67. package/src/pipeline.ts +65 -0
  68. package/tsconfig.json +22 -0
@@ -0,0 +1,491 @@
1
+ ---
2
+ name: academic-army-coding-plan
3
+ description: >-
4
+ Create an English coding_plan.md and a Chinese coding_plan.explain.md from a
5
+ paper blueprint and experiment plan. Use when Codex needs to translate paper
6
+ goals, candidate methods, baselines, datasets, metrics, harnesses, tests, and
7
+ result requirements into a detailed logical coding plan for downstream
8
+ implementation, without writing code or deciding physical file layout.
9
+ ---
10
+
11
+ # Academic Army Coding Plan
12
+
13
+ ## Purpose
14
+
15
+ Produce exactly two Markdown files in the requested output directory:
16
+
17
+ - `coding_plan.md`: English, AI-facing, and only the coding plan.
18
+ - `coding_plan.explain.md`: Chinese, human-facing, and only the explanation and decision rationale for the coding plan.
19
+
20
+ This skill writes planning artifacts only. Code implementation, physical file placement, plotting, paper prose, and final figure/table formatting belong to later skills.
21
+
22
+ The coding plan is a downstream implementation contract at the logical level. It describes components, interfaces, entrypoint semantics, execution stages, harnesses, tests, and result artifact schemas. It does not prescribe where the downstream coding skill must create files, unless the user-provided inputs already name an existing path that must be cited as a fact.
23
+
24
+ ## Review Feedback Intake
25
+
26
+ Classify feedback before changing or regenerating artifacts:
27
+
28
+ - `Artifact-content feedback`: the reviewer quotes file contents or names a concrete defect in `coding_plan.md` or `coding_plan.explain.md`. Convert the defect into stronger generation, readability, validation, or self-audit rules.
29
+ - `Artifact-access feedback`: the reviewer could not inspect files because local commands, Node REPL, MCP resources, mounted paths, connectors, local browser or `file://` opens, or sandbox startup failed. Process-spawn errors such as `windows sandbox: spawn setup refresh` are access feedback.
30
+
31
+ Access feedback changes delivery behavior, not the planning schema. Preserve the plan requirements unless concrete artifact contents show a real content problem. When access feedback occurs, treat handoff mode as sticky for the next artifact generation in the thread: the pasted read-back contents are the primary review artifact, and path-only delivery is incomplete.
32
+
33
+ When the only feedback is access feedback, do not invent content critiques, redundancy fixes, or prompt changes about plan substance. Tighten delivery, read-back, and self-contained review rules instead. The reviewer can only evaluate language, boundaries, redundancy, harness/test separation, and artifact schemas after the generated files are pasted.
34
+
35
+ ## Artifact Delivery
36
+
37
+ Write both files to the requested output directory and read them back before responding.
38
+
39
+ For `output/evolve-*` outputs, or whenever artifact-access feedback is active or sticky in the thread, the final response must be directly reviewable without local filesystem access:
40
+
41
+ 1. Start with one concise validation sentence.
42
+ 2. Add `Review Handoff` immediately after that sentence.
43
+ 3. Paste the complete read-back contents of both files under exact relative path headings, including the output directory.
44
+ 4. Put optional status notes only after the handoff.
45
+
46
+ Use five-backtick fences for full-file handoffs so embedded command fences remain readable:
47
+
48
+ ````markdown
49
+ ## output/evolve-.../coding_plan.md
50
+
51
+ `````markdown
52
+ <full coding_plan.md content>
53
+ `````
54
+
55
+ ## output/evolve-.../coding_plan.explain.md
56
+
57
+ `````markdown
58
+ <full coding_plan.explain.md content>
59
+ `````
60
+ ````
61
+
62
+ When files are long, read each file with a complete read method or bounded chunks before composing the handoff. Paste the read-back contents, not a regenerated approximation. If read-back fails after writing, try another local read mechanism. If read-back remains impossible, report the read-back failure clearly and mark delivery blocked rather than presenting unverified contents.
63
+
64
+ When a reviewer reports repeated sandbox, PowerShell, Node REPL, connector, mounted-path, browser-open, or `file://` failures, keep the final response concise before `Review Handoff`; do not ask the reviewer to retry local access as the main remedy after producing the artifacts.
65
+
66
+ If the generated artifact is too long for a comfortable final response, still prefer the complete read-back handoff for `output/evolve-*` or active access-feedback tasks. If a platform limit prevents pasting both files, paste as much as possible, clearly mark the truncation point, and state that review is blocked until the remaining read-back content can be provided.
67
+
68
+ ## Output Style
69
+
70
+ Use natural, readable Markdown. Organize both files with clear semantic headings, short paragraphs, bullets, and compact tables when they clarify parallel entities.
71
+
72
+ Use numbered lists only for real sequence, such as implementation order, experiment stages, priority, or step-by-step entrypoint semantics. Do not use global abstract ID systems such as `C1`, `B2`, `H3`, or `T4`. If an existing repository already uses short registry keys, preserve them only as aliases beside semantic names and use the semantic names for headings and cross-references.
73
+
74
+ Prefer names that stand alone:
75
+
76
+ - `Candidate Method Selection Harness`
77
+ - `Reference Lifecycle Deadline Harness`
78
+ - `Full-System Deadline-Hit QoE Harness`
79
+ - `Data Loading Tests`
80
+ - `Metric Computation Tests`
81
+ - `Result Export Tests`
82
+ - `CLI Smoke Tests`
83
+ - `Result Export Layer`
84
+ - `Method Adapter Layer`
85
+
86
+ Use positive, task-facing language. State what the coding plan includes, which logical component owns each concern, what each harness evaluates, and how artifacts flow to later skills. Keep runtime, sandbox, tool-call, fallback-path, and local execution troubleshooting details out of both generated artifacts.
87
+
88
+ ### Chinese Explanation Style
89
+
90
+ `coding_plan.explain.md` must be Chinese-first natural prose. Preserve English method names, repository names, dataset names, benchmark names, metric names, command semantics, and code identifiers when exact spelling matters. Preserve existing paths only when the user input or project context explicitly gave them and the explanation needs that fact.
91
+
92
+ When explaining a design choice, first summarize the corresponding plan content, then explain why it supports the paper blueprint and experiment plan. Use natural references such as “method替换模块”, “candidate筛选harness”, “result export layer”, and “CLI smoke tests”, not abstract IDs or “见H2”.
93
+
94
+ ## Workflow
95
+
96
+ ### Gather Minimal Local Context
97
+
98
+ Locate the two required local inputs from explicit user paths, conventional names, or the closest semantic match:
99
+
100
+ - paper blueprint
101
+ - experiment plan
102
+
103
+ After locating them, read only those two files. Do not inspect old plans, logs, README files, source trees, notebooks, package metadata, previous outputs, or nearby drafts merely because they are nearby.
104
+
105
+ Read nearby local files only when the blueprint or experiment plan explicitly references them as required implementation context. If a required local dependency is missing, record it as an open coding question. Supply general engineering patterns through DeepResearch, not unrelated local files.
106
+
107
+ Before drafting, perform an input-hygiene check: the planned artifacts should depend only on the blueprint, experiment plan, user-provided task constraints, and necessary DeepResearch evidence. Remove unrelated local context if it slipped in.
108
+
109
+ ### Run Pre-Planning DeepResearch
110
+
111
+ Before drafting `coding_plan.md`, run `academic_army_mcp_tools.deepresearch` unless the provided context already contains a fresh lookup artifact covering the current paper domain, method family, experiment style, and repository-design questions.
112
+
113
+ Use DeepResearch to inspect high-quality related codebases, official benchmark artifacts, evaluation harnesses, experiment frameworks, paper artifacts, configuration systems, and result-logging conventions relevant to the current domain. Let the lookup choose sources; do not hardcode a fixed source list into the skill output.
114
+
115
+ Prompt shape:
116
+
117
+ ```text
118
+ You are supporting a coding-plan generator for a research paper.
119
+
120
+ Research brief:
121
+ [paper goal, system, candidate methods, baselines, datasets, metrics,
122
+ experiment-plan requirements, and any explicit local context]
123
+
124
+ Return concise implementation-planning evidence:
125
+
126
+ - Highly engineered related repositories or official artifacts and how they structure logical modules, configs, registries, evaluation harnesses, tests, and result exports.
127
+ - Canonical implementation shape for the candidate methods and baselines.
128
+ - Current benchmark or dataset protocol details that affect loaders, evaluators, metrics, or comparators.
129
+ - Harness implications from traditional test harnesses and modern evaluation harnesses: controlled inputs, drivers, fixtures, evaluator separation, metrics, raw result records, smoke/full protocols, frozen variables, and decision rules.
130
+ - Raw result fields needed for later tables, figures, and paper claims.
131
+ - Source table with title, link, date, version, or commit when visible; role; whether the takeaway is a confirmed source fact or inferred design pattern; and the planning decision it affects.
132
+ ```
133
+
134
+ Put planning consequences in `coding_plan.md`. Put lookup topic, sources, dates or versions when visible, takeaways, evidence type, affected design choices, confidence, and remaining uncertainty in `coding_plan.explain.md`.
135
+
136
+ ## Planning Object: Logical Design Over File Layout
137
+
138
+ Plan logical components, not physical files. The coding plan may name:
139
+
140
+ - logical modules
141
+ - components
142
+ - interfaces
143
+ - adapter layers
144
+ - registries as concepts
145
+ - entrypoint semantics
146
+ - configuration concepts
147
+ - artifact types and schemas
148
+ - test capabilities
149
+
150
+ The coding plan must not invent concrete file paths, directory trees, or filenames for implementation code, configs, scripts, tests, or outputs. Existing input paths explicitly provided by the user may be cited as input facts. The requested output directory and required artifact filenames may be used for delivery.
151
+
152
+ Describe code organization with phrases such as:
153
+
154
+ - “Implement a Method Adapter Layer that exposes a common scheduling interface.”
155
+ - “Provide a harness runner entrypoint that accepts harness name, method, dataset split, seed, and configuration identifier.”
156
+ - “Emit raw lifecycle records with object ID, event type, timestamp, deadline, useful flag, and drop reason.”
157
+
158
+ Avoid physical-layout statements such as “put this class in `src/...`”, “create `tests/...`”, or “write metrics to `output/...`” unless the user-provided inputs already require those paths.
159
+
160
+ ## Draft `coding_plan.md`
161
+
162
+ Write `coding_plan.md` as an engineering contract for the downstream coding skill. Include the sections that apply to the project:
163
+
164
+ - scope and planning assumptions
165
+ - inputs read and input-hygiene summary
166
+ - execution assumptions and reusable entrypoint semantics
167
+ - logical architecture overview
168
+ - core domain model and shared interfaces
169
+ - semantic logical modules and ownership boundaries
170
+ - replaceable candidate method and baseline interfaces
171
+ - workload, dataset, trace, and configuration concepts
172
+ - metric definitions and executable decision rules
173
+ - staged experiment pipeline
174
+ - harness structure for paper goals
175
+ - testing structure for functional correctness
176
+ - method selection and freeze protocol when needed
177
+ - run matrix or staged comparison matrix
178
+ - raw-first result export contract
179
+ - derivation path from raw artifacts to paper tables, figures, and claims
180
+ - implementation order for the downstream coding skill
181
+ - acceptance criteria
182
+ - assumptions and open coding questions
183
+
184
+ Keep the plan specific enough to implement: define interfaces, inputs, outputs, dependencies, artifact schemas, and entrypoint parameters. Keep file placement and directory layout for the downstream coding skill.
185
+
186
+ ## Draft `coding_plan.explain.md`
187
+
188
+ Write `coding_plan.explain.md` as a Chinese explanation of the coding plan and its decision rationale. It should be understandable without repeatedly checking `coding_plan.md`.
189
+
190
+ Explain:
191
+
192
+ - which user-provided inputs were read and what requirements were extracted
193
+ - what DeepResearch found and how it affected the design
194
+ - why the logical modules are separated this way
195
+ - why candidate methods and baselines use replaceable interfaces
196
+ - why the staged experiment flow matches the experiment plan
197
+ - why each harness exists and what paper claim, method-selection question, or optimization question it supports
198
+ - why testing is separate from harness execution
199
+ - why raw-first exports support later plotting, tables, and writing
200
+ - why physical file layout is left to the downstream coding skill
201
+ - which assumptions remain and what they block
202
+ - how a downstream coding skill should use the plan
203
+
204
+ Recommended shape:
205
+
206
+ ```markdown
207
+ # 编码计划说明:<Paper/System Name>
208
+
209
+ ## 已读取输入与需求提取
210
+ ## 预规划研究(DeepResearch)
211
+ ## 主要逻辑模块设计
212
+ ## 方法与基线替换结构
213
+ ## 实验阶段设计理由
214
+ ## Harness Structure 设计理由
215
+ ## Testing Structure 设计理由
216
+ ## 原始结果导出理由
217
+ ## 文件布局边界说明
218
+ ## 假设与不确定性
219
+ ## 下游 Coding Skill 使用方式
220
+ ```
221
+
222
+ This outline is a guide. Add, merge, or rename sections when semantic headings would be clearer.
223
+
224
+ ## Core Domain Model And Shared Interfaces
225
+
226
+ When the system has interacting loaders, replay, controllers, methods, baselines, evaluators, harnesses, and exporters, include a shared-domain-model section before module details.
227
+
228
+ For each shared type, specify:
229
+
230
+ - type name
231
+ - owning logical component
232
+ - purpose
233
+ - key fields
234
+ - producers
235
+ - consumers
236
+ - raw export mapping when applicable
237
+
238
+ Use shared domain types to keep schemas consistent across loaders, methods, evaluators, harnesses, and export writers.
239
+
240
+ ## Logical Modules
241
+
242
+ Map the system into logical modules. For each module, specify:
243
+
244
+ - semantic module name
245
+ - responsibility
246
+ - inputs
247
+ - outputs
248
+ - dependencies on other logical modules
249
+ - implementation requirements for the downstream coding skill
250
+ - relevant interfaces or artifact schemas
251
+
252
+ Typical module families include:
253
+
254
+ - data preparation and workload manifest module
255
+ - substrate or external-system adapter module
256
+ - method interface and candidate method adapter module
257
+ - baseline adapter module
258
+ - replay or execution environment module
259
+ - metric computation module
260
+ - harness execution module
261
+ - testing support module
262
+ - result export layer
263
+ - paper-output derivation interface
264
+
265
+ Use the project’s paper and experiment requirements to choose the actual modules.
266
+
267
+ ## Methods And Baselines As Replaceable Components
268
+
269
+ Map every candidate method, modified variant, baseline, ablation, and oracle to a replaceable logical boundary.
270
+
271
+ For each method or baseline, specify:
272
+
273
+ - semantic method name
274
+ - role, such as proposed candidate, candidate route, headline baseline, diagnostic baseline, ablation, support estimator, or oracle
275
+ - shared interface it implements
276
+ - configuration concepts it needs
277
+ - observations it may access
278
+ - actions it may select
279
+ - raw outputs needed for comparison
280
+ - harnesses or experiment stages that use it
281
+
282
+ When two baselines overlap, explain the behavioral difference and why both are included. Candidate selection harnesses should be able to compare naive methods, modified methods, baseline methods, and oracles under the same input protocol and metrics.
283
+
284
+ ## Metrics And Decision Rules
285
+
286
+ For every metric used by a harness, method-selection rule, acceptance criterion, or paper-output derivation, define:
287
+
288
+ - metric name
289
+ - definition
290
+ - unit
291
+ - direction: `higher_is_better` or `lower_is_better`
292
+ - computation procedure or formula
293
+ - numerator and denominator for ratio metrics
294
+ - raw required fields
295
+ - upstream metric dependencies when any
296
+ - derived outputs
297
+ - aggregation rule
298
+ - missing-data behavior
299
+ - harnesses and paper outputs that use it
300
+
301
+ Decision rules should be executable. If a threshold is unknown, record a high-blocking open question that states which harness can compute metrics but cannot automatically select or promote a method yet.
302
+
303
+ ## Harness Structure
304
+
305
+ Create a dedicated `Harness Structure` section. A harness is a controlled experiment execution environment for paper goals, method selection, module optimization, ablation, stress, robustness, scalability, latency, quality, cost, or other metrics named by the blueprint and experiment plan.
306
+
307
+ Each harness should have a semantic name and a clear research purpose. For each harness, specify:
308
+
309
+ - purpose and associated paper claim, experiment question, method-selection question, or optimization question
310
+ - role, such as development, candidate selection, final validation, diagnostic analysis, regression, or claim calibration
311
+ - target logical module or replaceable method area
312
+ - allowed modification scope
313
+ - stable interfaces and frozen variables
314
+ - entrypoint semantics and parameter meanings
315
+ - input dataset, workload, trace, split, seed, and configuration protocol
316
+ - methods, modified methods, naive methods, baselines, ablations, and oracles compared
317
+ - metrics and decision rule
318
+ - raw result artifact types and minimum fields
319
+ - derived metric artifact types
320
+ - comparison procedure
321
+ - smoke, pilot, and full modes when useful
322
+ - relationship to other harnesses
323
+ - failure modes that should be visible in artifacts
324
+
325
+ Harnesses should support the development loop:
326
+
327
+ ```text
328
+ modify logical module -> run harness -> inspect parseable results -> refine module
329
+ ```
330
+
331
+ Harness outputs should include the least processed records needed to audit the run: per-example predictions or decisions, raw scores, timing traces, resource usage, intermediate decisions, error cases, method/config identifiers, dataset, split, seed, run ID, timestamp, source metadata, metric values, and raw artifact references.
332
+
333
+ ## Testing Structure
334
+
335
+ Create a dedicated `Testing Structure` section separate from harnesses. Testing answers whether code behaves according to its interfaces. Harnesses answer whether a method or module change helps paper metrics.
336
+
337
+ Plan test capabilities by function, using semantic names such as:
338
+
339
+ - `Data Loading Tests`
340
+ - `Configuration Parsing Tests`
341
+ - `Method Interface Tests`
342
+ - `Metric Computation Tests`
343
+ - `Result Export Tests`
344
+ - `CLI Smoke Tests`
345
+
346
+ For each test capability, specify:
347
+
348
+ - functional behavior under test
349
+ - logical module, interface, or entrypoint semantics under test
350
+ - fixture, toy input, mock data, or minimum example used
351
+ - expected behavior, output schema, or exception
352
+ - pass/fail criterion
353
+ - temporary artifact, terminal output, test report, or minimal debug-log behavior
354
+ - harness dependency protected by the test
355
+
356
+ Tests should use small fixtures or mock data and keep debug outputs separate from paper experiment results. They should make it clear whether a harness failure comes from a bad method result or broken code behavior.
357
+
358
+ ## Experiment Stages And Entrypoint Semantics
359
+
360
+ For complex experiments, plan staged execution. Typical stages include:
361
+
362
+ - data or asset preparation
363
+ - workload or task-instance construction
364
+ - candidate method run
365
+ - module-level optimization run
366
+ - full-system evaluation
367
+ - ablation run
368
+ - robustness or stress run
369
+ - metric computation
370
+ - method freeze
371
+ - paper-output derivation interface
372
+
373
+ For each stage, describe:
374
+
375
+ - stage purpose
376
+ - entrypoint semantics
377
+ - required parameters such as method, dataset, split, seed, configuration identifier, resource budget, and run mode
378
+ - input artifact types
379
+ - output artifact types
380
+ - validation checks
381
+
382
+ The same stage should be reusable across methods, datasets, splits, seeds, and configurations through parameter semantics or configuration concepts.
383
+
384
+ ## Method Selection And Freeze Protocol
385
+
386
+ When candidate methods, learned variants, modified variants, or stress-tuned variants exist, include a method-selection and freeze protocol:
387
+
388
+ - which harnesses may influence method design
389
+ - which harness selects the final method
390
+ - what information the frozen method manifest records
391
+ - which final-validation runs use the frozen method
392
+ - how diagnostic or stress-tuned variants are labeled separately
393
+ - how final split contamination is prevented
394
+
395
+ Paper-facing final evaluation should use a frozen method. Development, calibration, and candidate-selection harnesses can inform the method, while final validation results stay separated from unrestricted tuning runs.
396
+
397
+ ## Raw-First Result Export
398
+
399
+ Plan export artifacts so later analysis, plotting, and writing skills can work without rerunning experiments. Describe artifact types and schemas, not output paths.
400
+
401
+ Use these classifications:
402
+
403
+ - `raw_observation`: observed events, identifiers, timestamps, states, labels supplied by data, component outputs, directly measured values, and per-example decisions or predictions
404
+ - `metadata`: run manifests, resolved configs, environment details, dependency versions, source commits, command or entrypoint text, and orchestration records
405
+ - `metric`: derived scores, rates, deltas, deadline statistics, quality scores, aggregate summaries, statistical summaries, and decision-rule results
406
+ - `analysis`: counterfactuals, attributions, simulated alternatives, oracle analyses, and generated analytical records
407
+ - `summary`: human-readable reports and validation summaries
408
+
409
+ For each important artifact type, specify:
410
+
411
+ - artifact name
412
+ - classification
413
+ - purpose
414
+ - producing stage
415
+ - required fields
416
+ - granularity
417
+ - format tendency, such as JSONL for raw per-event records or JSON/CSV for aggregates
418
+ - source raw artifacts for metrics, summaries, and analyses
419
+ - downstream consumer, such as plotting, paper writing, or coding validation
420
+ - validation checks
421
+
422
+ Raw observations should be exported before aggregation. Paper-specific plotting and table formatting consume exported artifacts later.
423
+
424
+ ## Paper Result Derivations
425
+
426
+ Map each required paper table, figure, or claim to exported artifacts:
427
+
428
+ - paper output name
429
+ - claim or evidence role
430
+ - raw artifact types
431
+ - metric artifact types
432
+ - grouping or filtering
433
+ - derived quantities
434
+ - statistical summary
435
+ - expected downstream artifact type
436
+ - notes for plotting or writing skills
437
+
438
+ Keep paper-specific plotting and final table formatting outside the core experiment system.
439
+
440
+ ## Readability And Path Hygiene Pass
441
+
442
+ Before writing files, revise for readability and logical-design hygiene:
443
+
444
+ - Use semantic names as primary anchors for methods, modules, harnesses, tests, exports, and stages.
445
+ - Replace alias-only or abstract-ID cross-references with natural references.
446
+ - Use numbered lists only for real sequence or priority.
447
+ - Make `coding_plan.explain.md` understandable without repeatedly checking `coding_plan.md`.
448
+ - Confirm the plan describes logical modules and interfaces rather than invented file paths, directory trees, or filenames.
449
+ - Preserve only user-provided existing input paths or project facts that must be cited.
450
+ - Express boundaries as ownership rules, such as `Code implementation belongs to the downstream coding skill` and `Paper plotting consumes exported artifacts later`.
451
+
452
+ ## Artifact Quality Self-Audit
453
+
454
+ Before writing files and again after read-back, check:
455
+
456
+ - `coding_plan.md` contains only the English coding plan.
457
+ - `coding_plan.explain.md` is Chinese-first explanation in natural sentences.
458
+ - Neither file relies on global abstract IDs such as `C1`, `B2`, `H3`, or `T4`.
459
+ - Neither file invents implementation file paths, directory layout, script paths, test file paths, or output paths beyond the requested artifact delivery location and explicitly provided input paths.
460
+ - Harnesses are research-facing evaluation loops with explicit goals, controlled inputs, modification scope, entrypoint semantics, metrics, raw artifacts, and comparison logic.
461
+ - Testing remains separate from harness structure and focuses on functional correctness of loaders, interfaces, configs, metrics, exports, and entrypoint wiring.
462
+ - Candidate methods and baselines map to replaceable logical interfaces.
463
+ - Result exports are raw and parseable first; paper-figure/table derivations are downstream analysis artifacts.
464
+ - The files are project-specific and avoid repeating skill rules as defensive boilerplate.
465
+
466
+ ## Validation
467
+
468
+ Before the final response, confirm:
469
+
470
+ - `coding_plan.md` exists and is English-only coding plan content.
471
+ - `coding_plan.explain.md` exists and is Chinese-first explanation content.
472
+ - The output directory contains exactly these two files unless the user explicitly requested additional artifacts.
473
+ - DeepResearch was run or a fresh lookup artifact was reused.
474
+ - The plan includes logical modules, shared interfaces, replaceable methods and baselines, metrics and decision rules, harness structure, testing structure, staged entrypoint semantics, method freeze protocol when needed, raw-first exports, paper-output derivations, implementation order, acceptance criteria, and open coding questions.
475
+ - Every harness has a semantic name, paper-goal mapping, modification scope, stable inputs, entrypoint semantics, parseable raw artifact schema, metric rule, and relationship to other harnesses.
476
+ - Every test capability has a semantic name, small fixture or mock input, expected behavior, pass/fail criteria, and debug-output behavior separated from paper results.
477
+ - Paper outputs can be derived from raw and metric artifacts without rerunning experiments.
478
+ - The readability and path hygiene pass succeeds.
479
+ - For `output/evolve-*` outputs or artifact-access feedback, the final response includes a `Review Handoff` section with both complete read-back files under relative path headings.
480
+ - For repeated artifact-access feedback, the handoff is self-contained enough for a reviewer to evaluate language, content boundaries, path hygiene, harness/testing separation, result artifact schema quality, redundancy, and defensive wording without opening local files.
481
+
482
+ ## Final Response
483
+
484
+ After writing and validating the files, summarize:
485
+
486
+ - paths written
487
+ - major plan components
488
+ - high-blocking open questions
489
+ - validation performed, including read-back result
490
+
491
+ For `output/evolve-*` outputs or when artifact-access feedback requests pasted contents, add a `Review Handoff` heading immediately after the concise validation sentence and paste the complete read-back contents of both files using the five-backtick handoff format. A path-only response is incomplete for access-limited review.
@@ -0,0 +1,11 @@
1
+ interface:
2
+ display_name: "Academic Army Coding Plan"
3
+ short_description: "Readable coding plan with semantic harnesses, tests, and raw exports"
4
+ default_prompt: "Create an English coding_plan.md and Chinese coding_plan.explain.md with $academic-army-coding-plan from the paper blueprint, experiment plan, and mandatory pre-planning deepresearch. Use only those local task inputs unless they explicitly reference another required file. Use project-relative paths, semantic module/method/harness/test names, natural cross-references instead of abstract global IDs, separate paper-goal harness structure from functional testing structure, include a Chinese decision-rationale explanation, separate raw, metadata, metric, analysis, and summary outputs, and read both artifacts back before the final response. If the request writes to output/evolve-* or prior feedback says a reviewer cannot read local artifacts, add a Review Handoff section and paste the complete read-back contents of both generated files under clear path headings after the validation summary; paths-only, summary-only, or partial-excerpt final responses are incomplete."
5
+
6
+ dependencies:
7
+ tools:
8
+ - type: "mcp"
9
+ value: "academic_army_mcp_tools"
10
+ description: "Provides academic_army_mcp_tools.deepresearch for current method, baseline, dataset, benchmark, metric, artifact, and evaluation-harness evidence."
11
+ transport: "stdio"