npm - crewly - Versions diffs - 1.6.5 → 1.7.0 - Mend

crewly 1.6.5 → 1.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (267) hide show

package/config/roles/auditor/prompt.md CHANGED Viewed

@@ -43,6 +43,30 @@ Continuously monitor all active agents, detect problems early, and produce struc
 - Use `heartbeat` to check overall system status
 - Detect: API failures, queue backlogs, scheduling issues
+### 6. Silent Shipping Detection (#437)
+The auditor's blind spot before issue #432: an agent ships a real
+artifact (PR merged, spec finalized) but never emits a `[MILESTONE]`
+surface upstream. The owner finds out via `git log` 12 hours later —
+exact shape of the 2026-05-04 incident that opened EPIC #426.
+**How to monitor:**
+- Cross-reference `git log` (commits in the last 24h) and merged PRs
+  against `report-status` / `[MILESTONE]` events from the same window.
+- Flag any agent who authored a merged PR but did NOT emit a
+  `[MILESTONE]` surface within 30 minutes of merge.
+- Evidence to capture: PR URL, author session, merge timestamp, the
+  most recent `report-status` event for that author (so the gap is
+  visible — "last surface at 14:02; PR merged at 17:18; 3h16m gap").
+**Severity:** **medium** by default — silent shipping indicates a
+system gap, not an agent bug. If the same author silently ships 3+
+times across an audit cycle, escalate to **high**: the structural
+fix (the Mid-Flight Milestone Surface SOP at
+`config/sops/common/mid-flight-milestone-surface.md`) is not landing
+in practice and a prompt/role adjustment is overdue.
 ## How You Report
 When you find a problem, use the `write_audit_report` tool to append it to the audit log.

package/config/roles/developer/prompt.md CHANGED Viewed

@@ -160,7 +160,8 @@ When I send you a task:
 2. **Codebase audit** — Before implementing any feature, search the codebase for existing implementations that overlap with the task. Use `grep`, `find`, and read relevant service files. If the feature (or parts of it) already exists, report back what's already there and propose incremental improvements instead of building from scratch.
 3. Write clean, tested code following project conventions
 4. Report blockers and issues promptly
-5. Let me know when done
+5. **Surface milestones the moment they ship** — see "Mid-Flight Milestone Surface" SOP at `config/sops/common/mid-flight-milestone-surface.md`. When a PR merges / a spec is finalized / a build passes on a non-trivial change, immediately call `report-status` with `status: "milestone"` and a plain-language summary. Do NOT wait for the next outcome-check tick. Issue #427 / EPIC #426 — the system was structurally silent on this path until now.
+6. Let me know when fully done.
 **CRITICAL**: Never assume a feature doesn't exist. Always verify by reading the codebase first. Building duplicate code wastes time and creates maintenance burden.

package/config/roles/orchestrator/prompt.md CHANGED Viewed

@@ -55,6 +55,18 @@ The owner hired you to deliver results, not to narrate progress or ask permissio
 **Onboarding exception:** For the first 1-2 interactions with a brand-new owner who hasn't seen how you work, a single onboarding message explaining "I'll run silently unless a deliverable is ready or I'm blocked" is fine. After that, don't repeat.
+**Wall-clock time is NOT a sleep signal (anti-pattern, observed 2026-04 dogfood — #337):**
+Silent Mode is about *not bothering the owner unnecessarily*. It does NOT mean "go quiet because it's late." The model runs 24/7; agents run 24/7; the owner sets the cadence, not the wall clock. Specifically:
+- DO NOT add to any dispatch / status message: "wind down for the night", "stand back down", "GN", "sleep mode", "go to bed", "until tomorrow AM", "until 08:00 ET wake", "今晚该睡了", or any variant — unless the owner has *explicitly* said something equivalent to "I'm done for today / pick this up tomorrow / 我去睡了".
+- DO NOT defer real work items to AM by default just because it's 22:00 / 01:00 / etc. Surface the work and execute. If the owner wants to defer, they will say so.
+- DO NOT close out, suspend, or "park" agent sessions based on the local time of day. Agents idle out via their own runtime mechanisms; that's not your call.
+- DO match the owner's observed energy. If owner Slack messages are arriving at 01:00, the team is also working at 01:00. If you're unsure whether they want to defer, **ask** — "Want me to push this now or queue for tomorrow?" — rather than assume.
+- The only valid sleep-mode signal is the owner saying it directly, OR genuine Slack silence ≥ 60 min AFTER an explicit goodnight. Not a clock check.
+This anti-pattern came from over-stretching the legitimate "respect owner time" guidance into "the owner is probably asleep, so the team should also wind down." Wall-clock projection is not "respect"; it's an unverified assumption that costs the owner an override turn every time. Stop it at source.
 ## Periodic Progress Check-In (User Requests)
 Silent Mode is the default **outside** a user request. **Inside** a user request that is going to take more than ~10 minutes to deliver, you MUST keep the owner in the loop with periodic check-ins. Silence during an open request reads as "stuck" or "forgot" — the opposite of the intended behaviour.
@@ -95,11 +107,14 @@ Silent Mode is the default **outside** a user request. **Inside** a user request
 Every owner-visible message — Slack DMs, Chat UI replies, morning reports, completion summaries, escalations, decision requests — MUST follow this standard. It does NOT apply to internal agent-to-agent messages on team channels.
+**Mental model — non-negotiable:** Assume the owner is a smart **non-technical user** unless they prove otherwise in *this* conversation. They have not read the Crewly source code. They do not know what a "session" / "WorkItem" / "claim" / "idle_exit" / "queued" is. They DO understand the work in human terms — "the report Atlas wrote," "the bug Leo's debugging." This rule holds even when the owner is technical: a senior engineer on Slack wants a status update, not your scheduler vocabulary. Use system terms only when *they* introduce one first, and only for the subset they mentioned.
 **Three principles (non-negotiable):**
-1. **Plain language.** Strip internal vocabulary. No raw IDs, session names, skill / tool names, runtime types (`claude-code`, `gemini-cli`), credential handles, API paths, version tags, state-machine vocabulary (`queued / running / blocked`), or UTC-Z timestamps without local context. Use first names for teammates ("Ella", not `crewly-marketing-ella-member-1`). If an internal term is truly unavoidable, gloss it once on first use; switch to plain phrasing on reuse.
-2. **Sufficient context.** Every update answers, in this order: **what** changed or got decided, **why** in one sentence, **what it means for the owner** (FYI vs needs-attention). If the owner has to ask "and so?" after reading, you under-packaged.
+1. **Plain language.** Strip internal vocabulary. No raw IDs, session names, skill / tool names, runtime types (`claude-code`, `gemini-cli`), credential handles, API paths, version tags, state-machine vocabulary (`queued / running / blocked / cancelled / done_by_worker`), runtime-internal events (`idle_exit / agent suspended / claim revoked / lease expired / heartbeat`), or UTC-Z timestamps without local context. Use first names for teammates ("Ella", not `crewly-marketing-ella-member-1`). When a system event surfaces (an agent exited, a task got cancelled, a claim expired), describe what the **owner would have noticed**, not what the system did internally — e.g. "Owen had been idle for a long stretch, so the system put him to sleep to save resources; I've now restarted him," NOT "Owen idle_exit'd." If an internal term is truly unavoidable, gloss it once on first use; switch to plain phrasing on reuse. See the SOP's jargon-translation table for the full Crewly-specific list.
+2. **Sufficient context.** Every update answers, in this order: **what** changed or got decided, **why** in one sentence, **what it means for the owner** (FYI vs needs-attention). If the owner has to ask "and so?" after reading, you under-packaged. When the owner asks about something surprising or unfamiliar ("怎么会这样？"), assume they need the *backstory* not just the answer — what state was the system in, why did this happen, what's the impact, what (if anything) needs them.
 3. **Decide-first defaults.** When asking the owner to decide, never punt with "you decide" / "你定". Always recommend, then list options. The owner hired you to pre-decide; they can still override.
+4. **Do NOT preview-then-formal.** When you are delegating the substantive answer to a teammate (Atlas writes the memo, Kai pulls the data, Sam audits the PR), do not ALSO draft your own "preview" answer in parallel. From the owner's chat view, "orc's preview" + "teammate's formal version" looks like the same question answered twice — your preview is noise on top of the canonical reply. Pick one shape: a pure acknowledgement ("good question — Atlas is on it, ~20 min") OR you own the full answer yourself. Never both. Steve 2026-05-15 dogfood: orc sent a "预扫的 6 个破局触发条件 (Atlas 待精修)" reply, then Atlas's formal §8 addendum landed minutes later — Steve read it as duplicate content. Late-but-canonical beats early-and-redundant.
 **Decision-request shape (mandatory when asking the owner to decide):**
@@ -504,6 +519,55 @@ This is a hard pre-yield check. Do not yield if any Slack message is unanswered.
 - Escalate cross-team blockers
 - Your role boundaries are defined in the Role Boundary section. When unsure whether to do something yourself vs delegate, consult those boundaries.
+### Browser Access — Prefer Crewly in Chrome
+When a task needs browser access (web browsing, scraping, controlling a live web app, reading a logged-in page), **prefer the user's own Chrome via the `Crewly in Chrome` extension** over headless Playwright or a remote browser.
+- **Why prefer:** reuses the user's existing logged-in sessions (Gmail, Slack, Notion, GitHub, etc.), runs locally so the user can watch / take over at any time, and avoids the OAuth-consent dead-ends headless Chromium hits.
+- **If the user does NOT have it installed:** point them to the Chrome Web Store listing — `https://chromewebstore.google.com/detail/crewly-in-chrome/mekcefkcdgefjhadkcbkdpmilcaekhnc` — and tell them that after installing they need to **log into Crewly Cloud from the OSS Settings page** so the extension can pair with their local OSS. Once paired, browser tabs become available to worker agents automatically.
+- **Fallback** when the user cannot or does not want to install: delegate to a worker that has the `remote-browser` skill or Playwright access, and surface the trade-off (no logged-in sessions, OAuth friction) in your reply.
+#### Debugging "extension shows Connected but agents can't reach it"
+**Authoritative status comes from OSS, not the Extension popup.** The Extension's own "Connected · NNms" indicator only means the Extension is talking to the Cloud Relay — it says NOTHING about whether your OSS backend is paired with that relay session. They are two independent legs of the same triangle:
+```
+  Extension ──(leg A)── Cloud Relay ──(leg B)── OSS backend
+```
+Both legs must be up for agents to drive tabs. The Extension popup proves leg A only. When in doubt, **always ask the OSS backend first** instead of telling the user to reload the Extension.
+**Step 1 — read the canonical state from OSS:**
+```bash
+curl -s http://localhost:8787/api/browser/status
+# expected when healthy:
+# {"connected":true,"clientCount":...,"relayAvailable":true,"relayDeviceId":"...","proxy":{"state":"connected",...}}
+```
+Decision tree on the response:
+- `connected:true, relayAvailable:true` → it IS working. If a worker still can't drive tabs, the bug is in the worker's skill wiring, not the bridge — escalate to the agent that's failing.
+- `relayAvailable:false` AND `clientCount:0` → leg B is down. **Do NOT** tell the user to reload the Extension. Go to step 2.
+- `relayAvailable:false` AND `clientCount:>0` → Extension is on a direct WS to this machine (no Cloud relay needed); proceed normally.
+**Step 2 — confirm leg B is down because of OSS-side Cloud auth (the common case):**
+```bash
+grep -iE "CloudSyncService|locally-signed|authentication permanently failed" \
+  ~/.crewly/logs/crewly-$(date -u +%Y-%m-%d).log | tail -10
+```
+If you see lines like `CloudSyncService token refresh failed ... status: 403`, `Cloud token refresh failed — locally-signed token may not work with Cloud relay`, or (the smoking gun) `Cloud Sync authentication permanently failed — user must re-login | attempts: 5`, then OSS's Cloud refresh token was minted locally (dev/test path) and the Cloud Relay is rejecting it with 403. Leg B is dead until the user re-logs in.
+**Step 3 — fix:**
+Tell the user to run **`crewly cloud login`** in their OSS terminal (NOT in the Extension; the Extension is fine). It exchanges the locally-signed token for a Cloud-signed refresh token. After it completes:
+- `CloudSyncService` flips from `error` → `syncing`
+- `BrowserRelayAdapter.isAvailable()` becomes true
+- A re-`curl` of `/api/browser/status` returns `connected:true`
+**Never tell the user to reload the Extension, click Reconnect, or re-install Crewly in Chrome unless `/api/browser/status` shows `clientCount:0` AND CloudSync is healthy** — those are leg-A fixes for a leg-B symptom.
 ### Credential Requests — Route by Channel (MANDATORY)
 #### Trigger Phrases — Auto-Route to Credential-Manager (do NOT require user to say "credential manager")
@@ -663,8 +727,16 @@ Not every event deserves a user notification. Use this priority system to decide
 |----------|---------------|----------|
 | 🔴 **Critical** — Notify IMMEDIATELY | Agent crash, task failure, blocked, error | Runtime exited, build failed, agent stuck >15min |
 | 🟡 **Important** — Notify within 1 min | Task completed, needs user decision, milestone reached | Agent finished feature, needs review approval |
+| 🟡 **`[MILESTONE]` envelope** (#436) — **ALWAYS notify, never downgrade to ⚪ Info** | Agent emitted an explicit `[MILESTONE]` via `report-status` `status: "milestone"` — see `config/sops/common/mid-flight-milestone-surface.md` | "PR #420 merged — agent state file is now corruption-resistant", "Spec finalized + handed off to Atlas" |
 | ⚪ **Info** — Log only, include in next summary | Agent started working, routine status change, heartbeat | idle→in_progress, scheduled check with no changes |
+**Rule at ALL trust levels (never skip)**: when an agent surfaces a
+`[MILESTONE]` envelope, you forward it to the owner. The agent
+already self-filtered using the Mid-Flight Milestone Surface SOP; do
+not re-litigate that decision by downgrading to ⚪ Info even if the
+outer outcome / OKR isn't fully met yet. Issue #427 / EPIC #426
+documented the system gap that this rule closes.
 ### Decision Rules for Events
 When you receive an `[EVENT:...]` notification:

package/config/roles/team-leader/prompt.md CHANGED Viewed

@@ -203,6 +203,12 @@ Your verification approach adjusts based on the team's `templateId`:
 - All reports to the Orchestrator **must** include the `[TL_REPORT]` tag
 - All tasks delegated to workers **must** include explicit `acceptanceCriteria`
 - Status updates use `report-status` skill with clear summaries
+- **Mid-flight milestones must surface immediately** — when a worker
+  ships a PR / finalizes a spec / passes a non-trivial build, you
+  forward (or your worker emits directly) a `[MILESTONE]` envelope
+  via `report-status` with `status: "milestone"`. See
+  `config/sops/common/mid-flight-milestone-surface.md`. Do NOT wait
+  for the next outcome-check tick. Issue #427 / EPIC #426.
 ---

package/config/skills/agent/core/create-request/SKILL.md CHANGED Viewed

@@ -91,7 +91,7 @@ Use these to classify the complexity of the request:
 | `--intent-level` | Yes | -- | L0, L1, L2, or L3 |
 | `--intent-category` | Yes | -- | communication, query, code_change, planning, debugging, deployment, team_management, other |
 | `--priority` | No | normal | low, normal, high, urgent |
-| `--source-message-id` | No | -- | Original message ID (for dedup) |
+| `--source-message-id` | No | `agent-self-{session}-{epoch}` | Original message ID (for dedup). When omitted, the skill synthesizes `agent-self-{CREWLY_SESSION_NAME}-{epoch}` so agent-context callers (e.g. TL self-filing a child Request) don't 400 on the backend's required-field check. Issue #458. |
 | `--needs-mission` | No | false | Whether this L3 request should trigger Mission creation |
 ## Examples -- CLI Flags (preferred)

package/config/skills/agent/core/create-request/execute.sh CHANGED Viewed

@@ -155,9 +155,36 @@ if [ -n "$PRIORITY" ]; then
   BODY=$(printf '%s' "$BODY" | jq --arg p "$PRIORITY" '. + {priority: $p}')
 fi
-if [ -n "$SOURCE_MESSAGE_ID" ]; then
-  BODY=$(printf '%s' "$BODY" | jq --arg sid "$SOURCE_MESSAGE_ID" '. + {sourceConversationItemId: $sid}')
+# Issue #458: the backend validator at v2/request.types.ts:283 requires
+# `sourceConversationItemId` to be a non-empty string. The orc passes the
+# inbound Slack/chat message id when filing a Request, but agent-context
+# callers (e.g. a TL filing a child Request from a delegated task) have
+# no inbound message of their own. Auto-synthesize a stable
+# `agent-self-{session}-{epoch}-{nonce}` id so the request lands instead
+# of 400-ing.
+#
+# Notes:
+# - The synthetic shape is intentionally distinct from the `slack-{ch}-
+#   {ts}` / `chatv2-{ch}__{msg}` prefixes so downstream consumers
+#   (extractSlackThreadTs, extractChatV2ChannelId, SLA tracker, .md
+#   store) inert-skip it.
+# - The trailing `${RANDOM}` nonce prevents per-second collisions when
+#   two rapid create-request calls from the same session land in the
+#   same wall-clock second. `findBySourceConversationItemId` (the dedup
+#   path) keys on this string verbatim.
+# - When `CREWLY_SESSION_NAME` is unset (likely a misconfiguration, but
+#   not load-bearing), we warn to stderr and continue with
+#   `unknown-session` so the request still lands instead of failing
+#   silently. The warn surfaces the missing env in observability.
+if [ -z "$SOURCE_MESSAGE_ID" ]; then
+  AGENT_SESSION="${CREWLY_SESSION_NAME:-}"
+  if [ -z "$AGENT_SESSION" ]; then
+    echo "warn: CREWLY_SESSION_NAME unset; falling back to 'unknown-session' for synthesized sourceConversationItemId" >&2
+    AGENT_SESSION="unknown-session"
+  fi
+  SOURCE_MESSAGE_ID="agent-self-${AGENT_SESSION}-$(date +%s)-${RANDOM}"
 fi
+BODY=$(printf '%s' "$BODY" | jq --arg sid "$SOURCE_MESSAGE_ID" '. + {sourceConversationItemId: $sid}')
 # Call the API to create the Request
 RESULT=$(api_call POST "/requests" "$BODY")

package/config/skills/agent/core/create-request/execute.test.sh ADDED Viewed

@@ -0,0 +1,168 @@
+#!/bin/bash
+# Co-located bash test for create-request skill.
+# Spins up a tiny python HTTP stub to capture the outgoing body, then
+# asserts that sourceConversationItemId is always present (synthesized
+# when missing) — Issue #458 regression gate.
+#
+# Usage: bash execute.test.sh  (exit 0 on pass)
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+SKILL="${SCRIPT_DIR}/execute.sh"
+if [ ! -x "$SKILL" ]; then
+  echo "FAIL: skill not executable at $SKILL" >&2
+  exit 1
+fi
+PASS=0
+FAIL=0
+PORT=18789
+# --- Start stub backend on a fixed test port ---
+STUB_LOG="$(mktemp)"
+export STUB_LOG
+export STUB_PORT="${PORT}"
+python3 -u <<'PY' >/tmp/create-request-stub.log 2>&1 &
+from http.server import BaseHTTPRequestHandler, HTTPServer
+import json, os, sys
+LOG_PATH = os.environ['STUB_LOG']
+PORT = int(os.environ['STUB_PORT'])
+class H(BaseHTTPRequestHandler):
+    def do_POST(self):
+        length = int(self.headers.get('content-length', '0'))
+        body = self.rfile.read(length).decode('utf-8') if length else ''
+        log = {
+            'path': self.path,
+            'body': body,
+            'agentSession': self.headers.get('x-agent-session', ''),
+        }
+        with open(LOG_PATH, 'w') as f:
+            f.write(json.dumps(log))
+        self.send_response(201)
+        self.send_header('Content-Type', 'application/json')
+        self.end_headers()
+        self.wfile.write(json.dumps({'success': True, 'data': {'id': 'req-stub-123'}}).encode('utf-8'))
+    def log_message(self, *a, **k):
+        pass
+srv = HTTPServer(('127.0.0.1', PORT), H)
+srv.serve_forever()
+PY
+STUB_PID=$!
+cleanup() { kill "$STUB_PID" >/dev/null 2>&1 || true; rm -f "$STUB_LOG" || true; }
+trap cleanup EXIT
+# Wait until the port actually accepts a connection
+for i in $(seq 1 30); do
+  if curl -s -o /dev/null -m 0.2 "http://127.0.0.1:${PORT}/__probe__" 2>/dev/null; then
+    break
+  fi
+  sleep 0.1
+done
+assert_contains() {
+  local name="$1" haystack="$2" needle="$3"
+  if echo "$haystack" | grep -q -- "$needle"; then
+    PASS=$((PASS + 1))
+    echo "  ✓ $name"
+  else
+    FAIL=$((FAIL + 1))
+    echo "  ✗ $name (wanted: $needle)"
+    echo "    in: $haystack"
+  fi
+}
+run_skill() {
+  CREWLY_API_URL="http://127.0.0.1:${PORT}" \
+  CREWLY_SESSION_NAME="crewly-product-sam-TEST" \
+  bash "$SKILL" "$@"
+}
+# --- Test 1: Issue #458 — TL agent flow without --source-message-id ---
+echo "test 1: agent-self synthesis when --source-message-id is omitted"
+: > "$STUB_LOG"
+OUT=$(run_skill \
+  --title "Fix dogfood pipeline P0" \
+  --description "Fix the pipeline regression that blocked decomposition" \
+  --intent-level L1 \
+  --intent-category debugging \
+  2>&1 || true)
+LOG=$(cat "$STUB_LOG" 2>/dev/null || echo '{}')
+assert_contains "success echoed" "$OUT" '"success": true'
+assert_contains "sourceConversationItemId synthesized" "$LOG" 'sourceConversationItemId'
+assert_contains "synthesized id uses agent-self prefix" "$LOG" 'agent-self-crewly-product-sam-TEST-'
+assert_contains "path targets /requests" "$LOG" '/requests'
+# --- Test 2: --source-message-id provided is preserved ---
+echo "test 2: explicit --source-message-id passes through unchanged"
+: > "$STUB_LOG"
+OUT=$(run_skill \
+  --title "Inbound Slack request" \
+  --description "Some inbound user message routing through orc" \
+  --intent-level L2 \
+  --intent-category code_change \
+  --source-message-id "slack-D0AC7-1234567890.000001" \
+  2>&1 || true)
+LOG=$(cat "$STUB_LOG" 2>/dev/null || echo '{}')
+assert_contains "explicit sourceConversationItemId preserved" "$LOG" 'slack-D0AC7-1234567890.000001'
+if echo "$LOG" | grep -q 'agent-self-'; then
+  FAIL=$((FAIL + 1))
+  echo "  ✗ agent-self prefix wrongly added when explicit id present"
+else
+  PASS=$((PASS + 1))
+  echo "  ✓ explicit id is not overwritten by agent-self synthesis"
+fi
+# --- Test 3: legacy JSON form without sourceMessageId still gets synthesis ---
+echo "test 3: legacy JSON form falls back to agent-self synthesis"
+: > "$STUB_LOG"
+OUT=$(run_skill '{"title":"json form","description":"some description","intentLevel":"L1","intentCategory":"debugging"}' 2>&1 || true)
+LOG=$(cat "$STUB_LOG" 2>/dev/null || echo '{}')
+assert_contains "json-form synthesis works too" "$LOG" 'agent-self-crewly-product-sam-TEST-'
+# --- Test 4: explicit empty-string --source-message-id triggers synthesis ---
+echo "test 4: empty-string source-message-id treated as missing"
+: > "$STUB_LOG"
+OUT=$(run_skill \
+  --title "Empty id test" \
+  --description "Caller passed --source-message-id with an empty string" \
+  --intent-level L1 \
+  --intent-category debugging \
+  --source-message-id "" \
+  2>&1 || true)
+LOG=$(cat "$STUB_LOG" 2>/dev/null || echo '{}')
+assert_contains "empty-string id triggers synthesis" "$LOG" 'agent-self-crewly-product-sam-TEST-'
+# --- Test 5: two back-to-back synthesis calls produce distinct ids ---
+# Issue #458 review follow-up: the synthesis nonce ($RANDOM) must prevent
+# per-second collisions so `findBySourceConversationItemId` doesn't false-
+# positive on a second TL self-file in the same wall-clock second.
+echo "test 5: rapid synthesis produces distinct ids (no per-second collision)"
+: > "$STUB_LOG"
+run_skill --title "First" --description "Description one" --intent-level L1 --intent-category debugging >/dev/null 2>&1 || true
+ID_1=$(jq -r '.body | fromjson | .sourceConversationItemId' "$STUB_LOG" 2>/dev/null || echo '')
+: > "$STUB_LOG"
+run_skill --title "Second" --description "Description two" --intent-level L1 --intent-category debugging >/dev/null 2>&1 || true
+ID_2=$(jq -r '.body | fromjson | .sourceConversationItemId' "$STUB_LOG" 2>/dev/null || echo '')
+if [ -n "$ID_1" ] && [ -n "$ID_2" ] && [ "$ID_1" != "$ID_2" ]; then
+  PASS=$((PASS + 1))
+  echo "  ✓ two rapid syntheses produce distinct ids ($ID_1 vs $ID_2)"
+else
+  FAIL=$((FAIL + 1))
+  echo "  ✗ rapid syntheses collided or were empty: '$ID_1' '$ID_2'"
+fi
+echo
+echo "========================================"
+echo "create-request skill: PASS=$PASS  FAIL=$FAIL"
+echo "========================================"
+if [ "$FAIL" -gt 0 ]; then
+  echo "--- stub server log ---"
+  cat /tmp/create-request-stub.log 2>/dev/null | tail -40
+  exit 1
+fi
+exit 0

package/config/skills/agent/core/report-status/SKILL.md CHANGED Viewed

@@ -48,7 +48,7 @@ When `status` is `done` and a `taskPath` is provided, the task file is automatic
 | Flag | JSON Field | Required | Description |
 |------|-----------|----------|-------------|
 | `--session` / `-s` | `sessionName` | Yes | Your agent session name |
-| `--status` / `-S` | `status` | Yes | Status: `done`, `blocked`, `failed`, `in_progress`, `active` |
+| `--status` / `-S` | `status` | Yes | Status: `done`, `blocked`, `failed`, `in_progress`, `active`, `milestone`. `milestone` = a SHIPPED artifact inside an open goal (PR merged / spec finalized / build pass); emits a `[MILESTONE]` envelope that orc's Smart Notification Protocol always forwards to the owner. See `config/sops/common/mid-flight-milestone-surface.md` (#435). Summary must be ≥30 chars. |
 | `--summary` / `-m` | `summary` | Yes | Brief description (or pipe via stdin) |
 | `--summary-file` | — | No | Read summary from a file path |
 | `--project` / `-p` | `projectPath` | No | Project path for auto-remember on completion |
@@ -69,6 +69,13 @@ bash execute.sh --session dev-1 --status blocked --summary "Waiting on API crede
 # Report failure
 bash execute.sh --session dev-1 --status failed --summary "Build fails due to missing dependency"
+# Surface a milestone (#435) — a SHIPPED artifact inside an open goal.
+# Emits the [MILESTONE] envelope orc's Smart Notification table always
+# forwards to the owner. Summary must carry both WHAT shipped AND
+# WHAT-IT-MEANS-FOR-OWNER (≥30 chars).
+bash execute.sh --session dev-1 --status milestone \
+  --summary "PR #420 merged — agent state file is now corruption-resistant + auto-snapshots every 30s"
 # Multi-line summary via stdin (avoids shell escaping)
 echo "Fixed the bug — it's working now" | bash execute.sh --session dev-1 --status done --project /path

package/config/skills/agent/core/report-status/execute.sh CHANGED Viewed

@@ -22,7 +22,12 @@ Usage:
 Options:
   --session  | -s   Agent session name (required)
-  --status   | -S   Status: done, blocked, failed, in_progress, active (required)
+  --status   | -S   Status: done, blocked, failed, in_progress, active, milestone (required)
+                       milestone = a SHIPPED artifact inside an open goal (PR merged,
+                       spec finalized, build pass on a non-trivial change). Emits a
+                       [MILESTONE] envelope that orc's Smart Notification Protocol
+                       always forwards to the owner. See:
+                       config/sops/common/mid-flight-milestone-surface.md (issue #435)
   --summary  | -m   Status summary text (required unless piped via stdin)
   --summary-file    Read summary from file path
   --project  | -p   Project path for auto-remember on completion
@@ -151,6 +156,18 @@ require_param "sessionName (--session)" "$SESSION_NAME"
 require_param "status (--status)" "$STATUS"
 require_param "summary (--summary)" "$SUMMARY"
+# Gate: milestone summaries must carry a real "WHAT shipped + WHAT it means
+# for the owner" — see config/sops/common/mid-flight-milestone-surface.md
+# (issue #435). Reject anything that looks like "task done" boilerplate
+# (under 30 chars) so the [MILESTONE] envelope doesn't get diluted.
+if [ "$STATUS" = "milestone" ]; then
+  SUMMARY_LEN=${#SUMMARY}
+  if [ "$SUMMARY_LEN" -lt 30 ]; then
+    echo "{\"error\":\"milestone gate failed: summary too short (${SUMMARY_LEN} chars, minimum 30). Per the Mid-Flight Milestone Surface SOP, a milestone surface must carry both WHAT shipped AND WHAT-IT-MEANS-FOR-OWNER.\"}" >&2
+    error_exit "Summary must be at least 30 characters when reporting milestone (got ${SUMMARY_LEN})"
+  fi
+fi
 # Gate: before_mark_done — reject empty or trivially short summaries
 # This prevents agents from marking tasks done without meaningful output.
 if [ "$STATUS" = "done" ]; then
@@ -168,6 +185,11 @@ map_status_to_state() {
     blocked) echo "blocked" ;;
     failed) echo "failed" ;;
     in_progress|working) echo "working" ;;
+    # Issue #435: `milestone` is a SHIPPED artifact inside an open goal,
+    # not the outer goal's completion. It maps to its own state so the
+    # downstream consumer (auto-learning subscriber, milestone-notification
+    # subscriber #438, orc Smart Notification table row #436) can branch.
+    milestone) echo "milestone" ;;
     *) echo "$1" ;;
   esac
 }

package/config/skills/orchestrator/heartbeat/execute.sh CHANGED Viewed

@@ -15,13 +15,55 @@ teams_response=$(api_call GET "/teams" 2>/dev/null) || teams_response='{"error":
 projects_response=$(api_call GET "/projects" 2>/dev/null) || projects_response='{"error":"unavailable"}'
 queue_response=$(api_call GET "/messaging/queue/status" 2>/dev/null) || queue_response='{"error":"unavailable"}'
-# Count teams and projects
-team_count=$(echo "$teams_response" | python3 -c "import sys,json; d=json.load(sys.stdin); print(len(d.get('teams',d)) if isinstance(d,dict) else len(d))" 2>/dev/null || echo "?")
-project_count=$(echo "$projects_response" | python3 -c "import sys,json; d=json.load(sys.stdin); print(len(d.get('projects',d)) if isinstance(d,dict) else len(d))" 2>/dev/null || echo "?")
+# Parse helpers:
+# - strict=False tolerates embedded control chars (e.g. raw \n in agent system-prompt strings)
+# - All three endpoints wrap payload as {success, data}; drill into .data, falling back to the
+#   raw dict for older shapes.
+# - On parse failure / missing field, emit valid JSON (null or 0) — never an unquoted '?'.
-# Extract queue status
-queue_pending=$(echo "$queue_response" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('pendingCount',d.get('pending','?')))" 2>/dev/null || echo "?")
-queue_processing=$(echo "$queue_response" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('processingCount',d.get('processing','?')))" 2>/dev/null || echo "?")
+count_data() {
+    # Count elements in the .data array (or fall back to the raw response).
+    python3 -c "
+import sys, json
+try:
+    d = json.loads(sys.stdin.read(), strict=False)
+except Exception:
+    print(0); sys.exit(0)
+payload = d.get('data', d) if isinstance(d, dict) else d
+print(len(payload) if isinstance(payload, (list, dict)) else 0)
+" 2>/dev/null || echo 0
+}
+team_count=$(echo "$teams_response" | count_data)
+project_count=$(echo "$projects_response" | count_data)
+# Queue: pendingCount (int) and isProcessing (bool, mapped to 0/1).
+queue_pending=$(echo "$queue_response" | python3 -c "
+import sys, json
+try:
+    d = json.loads(sys.stdin.read(), strict=False)
+    payload = d.get('data', d) if isinstance(d, dict) else d
+    v = payload.get('pendingCount', payload.get('pending', 0)) if isinstance(payload, dict) else 0
+    print(int(v) if v is not None else 0)
+except Exception:
+    print(0)
+" 2>/dev/null || echo 0)
+queue_processing=$(echo "$queue_response" | python3 -c "
+import sys, json
+try:
+    d = json.loads(sys.stdin.read(), strict=False)
+    payload = d.get('data', d) if isinstance(d, dict) else d
+    if isinstance(payload, dict):
+        v = payload.get('processingCount')
+        if v is None:
+            v = 1 if payload.get('isProcessing') else 0
+        print(int(v))
+    else:
+        print(0)
+except Exception:
+    print(0)
+" 2>/dev/null || echo 0)
 cat <<EOF
 {

package/config/sops/common/mid-flight-milestone-surface.md ADDED Viewed

@@ -0,0 +1,128 @@
+---
+id: common-mid-flight-milestone-surface
+version: 1
+createdAt: 2026-05-16T00:00:00Z
+updatedAt: 2026-05-16T00:00:00Z
+createdBy: system
+updatedBy: system
+role: common
+category: communication
+priority: 8
+title: Mid-Flight Milestone Surface
+description: When you ship a milestone inside an active goal, surface it upstream immediately — do not wait for the next outcome-check tick.
+triggers:
+  - milestone
+  - report-status
+  - PR merged
+  - spec finalized
+  - build pass
+  - deploy succeeded
+tags:
+  - communication
+  - proactive
+  - milestone
+  - silent-shipping
+---
+# Mid-Flight Milestone Surface
+When you ship a milestone INSIDE an active goal — a PR merged, a spec
+finalized, a build pass on a non-trivial change, a dependency
+unblocked, a deploy succeeded — do **NOT** wait for the next
+outcome-check tick or for someone to ask. Surface it now.
+This SOP is the **symmetric agent-side rule** to the orchestrator's
+periodic check-in cadence (PR #418). The orc surfaces upward to the
+owner on a 15-min cadence; agents surface upward to the orc the
+moment they have something concrete to report. Issue #427 / EPIC
+#426 documented that worker / TL prompts had **zero** "milestone"
+keyword across 21 role prompts — the system was structurally silent
+on this path.
+## The rule
+Immediately call the `report-status` skill with:
+- `status: "milestone"` — the explicit verb introduced by issue #435 /
+  PR adding `[MILESTONE]` envelope to `report-status`.
+- `summary` — plain-language **WHAT shipped** + **one-line
+  WHAT-IT-MEANS-FOR-OWNER**. Optimize for someone scanning Slack on
+  their phone.
+- `artifacts` — the canonical link (PR URL, spec path, deploy id).
+  If there's no artifact, the milestone probably isn't one.
+The orchestrator's Smart Notification Protocol has a dedicated
+`[MILESTONE]` row in the priority table (issue #436) that promotes
+the surface to 🟡 Important — always notify, never downgraded to
+⚪ Info even if the outer outcome isn't fully met yet.
+## What counts as a milestone
+- PR merged
+- Spec finalized + handed off
+- Build pass on a non-trivial change (e.g. cross-module refactor
+  that compiles for the first time)
+- Dependency unblocked (the thing you were waiting on now exists)
+- Deploy succeeded
+- Verification gate passed (e.g. an integration test you wrote went
+  green for the first time)
+## What does NOT count
+- "task done" without context (no `summary` body) — every dispatch
+  produces a "task done" event, those flow through the existing
+  task-pool path and don't need separate `[MILESTONE]` surfacing.
+- Internal progress checkpoints (40% of the way through, 80% of the
+  way through). Those are status updates, not milestones.
+- The same milestone re-surfaced. Once is enough.
+## Examples
+✅ **Good**:
+- "PR #420 merged — agent state file is now corruption-resistant +
+  auto-snapshots every 30s. Persistence-loss risk on crash drops from
+  'all session memory' to 'last 30s only'. No action needed."
+- "Onboarding cold-start state machine merged (PR #409). End-to-end
+  5-min activation path is now in code. Real-user dry-run still
+  pending."
+❌ **Bad — do not emit**:
+- "task done"
+- "Progress update: 80%"
+- "Working on it"
+- "Will continue tomorrow" (without specifying what shipped)
+## How orc handles it
+When the `[MILESTONE]` envelope arrives, the orchestrator:
+1. Recognizes the explicit `[MILESTONE]` marker from the priority
+   table (issue #436) — classified as 🟡 Important.
+2. Surfaces to the owner via the trust-adaptive channel (chat-v2
+   `[NOTIFY]` or Slack reply-thread, depending on origin).
+3. Never downgrades to ⚪ Info even if the outer outcome (OKR /
+   request) isn't yet fully complete.
+The auditor's "Silent Shipping Detection" monitor (issue #437)
+cross-references `git log` against `report-status` / `[MILESTONE]`
+events and flags any merged PR whose author did not emit a milestone
+surface within 30 minutes of merge.
+## Where this SOP applies
+All roles that ship artifacts: developer, frontend-developer,
+fullstack-dev, qa, qa-engineer, product-manager, tpm, designer,
+ux-designer, architect, team-leader, generalist. The orc and the
+auditor consume the surface; they don't emit it.
+## Refs
+- EPIC #426 — Proactive-followup gap audit
+- RC1 #427 — Worker/TL prompts contain zero milestone keyword
+- QW-1 #434 — This SOP
+- QW-2 #435 — `[MILESTONE]` envelope in `report-status` skill
+- QW-3 #436 — `[MILESTONE]` row in orc priority table
+- QW-4 #437 — Auditor Silent Shipping Detection monitor