npm - gm-qwen - Versions diffs - 2.0.781 → 2.0.783 - Mend

gm-qwen 2.0.781 → 2.0.783

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/gm.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "gm",
-  "version": "2.0.781",
+  "version": "2.0.783",
   "description": "State machine agent with hooks, skills, and automated git enforcement",
   "author": "AnEntrypoint",
   "license": "MIT",

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "gm-qwen",
-  "version": "2.0.781",
+  "version": "2.0.783",
   "description": "State machine agent with hooks, skills, and automated git enforcement",
   "author": "AnEntrypoint",
   "license": "MIT",

package/skills/gm/SKILL.md CHANGED Viewed

@@ -85,6 +85,16 @@ Pure-prose edits to static documents with no JS/canvas/DOM behavior change are e
 This rule fires in EXECUTE (witness on edit), EMIT (post-emit verify), and VERIFY (final gate). All three.
+## NOTHING FAKE — HARD RULE
+What ships runs against real services, real data, real binaries. Stubs, mocks, placeholder returns, fixture-only paths, "TODO: implement", `return null /* fake */`, hardcoded sample responses, and demo-mode fallbacks are forbidden in source the user will run. They produce green checks that survive into production and lie about what works.
+Scaffolding and shims are permitted when they call through to real behavior — an empty file laid down before its body, a thin adapter wrapping an upstream API, a build target that compiles but is wired to nothing yet *and is the only callsite of itself*. The test is whether the artifact, executed, would do the thing it claims. If it would not, it is a stub.
+Before writing a shim or adapter, the agent asks whether an existing library or tool already provides the same surface. Maintaining a local reimplementation of something an upstream package solves is its own failure mode — the shim drifts, ages, and accumulates the bugs the upstream already fixed. If a published package fits, the shim becomes one line of import.
+Stub detection is by behavior, not by keyword: code paths that always succeed, always return the same value regardless of input, or short-circuit a real call to satisfy a type signature, are stubs. Comments asserting realness do not make code real. The witnessing rule that closes a mutable also closes this one — until real input has produced real output through the new code, it is provisional, and shipping provisional code as done is forced closure.
 ## EXECUTION ORDER
 1. Recall — `plugkit recall` for any familiar-feeling unknown (cheapest, 200 tokens)

package/skills/planning/SKILL.md CHANGED Viewed

@@ -75,6 +75,16 @@ The trigger is functional, not a path-list: any change whose effect is observabl
 Propagation: EXECUTE witnesses on edit, EMIT re-witnesses post-write, VERIFY runs the final gate. The plan must encode the rule so all three layers fire.
+## NOTHING FAKE — HARD RULE
+Plan items resolve when real input flows through real code into real output. Stubs, mocks, placeholder returns, fixture-only branches, "TODO: implement", and demo-mode short-circuits do not count as resolution — they are mutables wearing closed-status disguise.
+Acceptance criteria must witness behavior, not the existence of a function with the right name. "X is implemented" is not acceptance; "X called with real Y produces real Z" is. The agent that satisfies the criterion via a stub has built something that will lie when production calls it.
+Scaffolding and shims are permitted only when the shim *delegates* to real behavior — wraps an upstream API, calls a real subprocess, hits a real disk. Before adding a shim, the plan asks whether a published library or tool already provides that surface; maintaining a local reimplementation of an upstream solution is its own failure mode and the shim line should usually become an import line.
+The fake-detection test is behavioral: would the code, executed against the inputs it claims to accept, produce the outputs it claims to produce? If the answer requires "after we fill in the body" or "once X is wired up", the plan item is open, not done.
 ## ORIENT — HARD RULE
 Open every plan with a parallel pack of `exec:recall` and `exec:codesearch` against the request's nouns. Hits land as `weak_prior`; misses confirm the unknown is fresh. The pack runs in one message — never serially. The agent that skips orient pays the same cost in fresh probes a turn later, plus the price of disagreeing with its own prior witness.
@@ -164,10 +174,24 @@ No comments. No scattered test files. 200-line limit per file. Fail loud. No dup
 ## SINGLE INTEGRATION TEST POLICY
-One `test.js` at project root. 200-line max. No fixtures, mocks, or scattered test files under any naming convention. Plain assertions, real data, real system. `gm-complete` runs it. Failure = regression to EXECUTE.
+One `test.js` at project root. **200-line hard cap.** No fixtures, mocks, or scattered test files under any naming convention. Plain assertions, real data, real system. `gm-complete` runs it. Failure = regression to EXECUTE.
 Any second test runner — under any name, in any directory — is a smuggled parallel surface and fights the discipline. If a behavior needs to be exercised in-page, register it in `window.__debug` and assert via `test.js`.
+**Purpose: maximum surface coverage in 200 lines.** test.js is a budget, not a target. Every line should witness a load-bearing behavior; redundant assertions are dead weight. Subsystems get *one* group each — combined groups (e.g. `profiles+observability+auth+context+cron+batch`) are the norm, not the exception. As thoth grew from 17 → 21 → 14 named groups while the surface tripled, the win came from collapsing per-subsystem groups into multi-subsystem ones.
+**Use overlap to exclude.** When subsystem A's test exercises B as a side effect, B does not need its own group — drop the redundant assertion. Examples that have proven out:
+- The agent-machine tool-loop test exercises bash dispatch → no separate bash test needed beyond a smoke-call inside the tools+toolsets group.
+- The dashboard test asserts the API surface AND that the registry has ≥N tools → covers tool registration coverage.
+- The plugins+memory group exercises observability metrics + achievements → no need for a separate plugins-extra group.
+- The gateway test exercising one platform plus a platform-stub-shape loop covers all 18 adapters in one group.
+**Adding a new subsystem:** first try to fold its assertion into the closest existing group. Only create a new group when the subsystem's failure mode is genuinely orthogonal (e.g. compressor's iterative-update behavior is not exercised by any other group). Test surface should grow linearly with subsystem count, not multiplicatively, and the line budget is the forcing function.
+**Pattern that works:** name combined groups by joining their subsystems with `+`, e.g. `home+config+skin`, `mcp+swe+distributions+account+credpool`, `env+pi+cli+tui+setup+website`. Future readers see the coverage at a glance. A group title with 4–6 components is healthy; a group with 1 component should be questioned.
+**Hygiene at edit time:** every change to test.js prefers compaction over expansion. If `wc -l test.js > 200`, the discipline is *not* "split" — it's "merge groups + drop redundancy" until it fits. If the budget is genuinely insufficient for the load-bearing surface, the right move is to question whether the assertion is load-bearing, not to lift the cap.
 ## RESPONSE POLICY
 Terse. Drop filler. Fragments OK. Pattern: `[thing] [action] [reason]. [next step].` Code/commits/PRs = normal prose.