@kontourai/flow-agents 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.githooks/pre-push +11 -0
- package/.github/workflows/ci.yml +210 -0
- package/.github/workflows/docs-pages.yml +52 -0
- package/.github/workflows/publish-npm.yml +104 -0
- package/AGENTS.md +26 -0
- package/CHANGELOG.md +66 -0
- package/CODE_OF_CONDUCT.md +25 -0
- package/CONTEXT.md +300 -0
- package/CONTRIBUTING.md +44 -0
- package/LICENSE +201 -0
- package/README.md +129 -0
- package/SECURITY.md +33 -0
- package/agent-cards/dev.json +19 -0
- package/agents/dev.json +127 -0
- package/agents/tool-code-reviewer.json +61 -0
- package/agents/tool-dependencies-updater.json +118 -0
- package/agents/tool-explore-config.json +92 -0
- package/agents/tool-explore-deps.json +92 -0
- package/agents/tool-explore-entry.json +92 -0
- package/agents/tool-explore-patterns.json +92 -0
- package/agents/tool-explore-structure.json +92 -0
- package/agents/tool-explore-tests.json +92 -0
- package/agents/tool-planner.json +57 -0
- package/agents/tool-playwright.json +145 -0
- package/agents/tool-security-reviewer.json +56 -0
- package/agents/tool-verifier.json +61 -0
- package/agents/tool-worker.json +58 -0
- package/build/src/cli/console-learning-projection.js +123 -0
- package/build/src/cli/docs-preview.js +39 -0
- package/build/src/cli/effective-backlog-settings.js +102 -0
- package/build/src/cli/export-bookmarks.js +38 -0
- package/build/src/cli/fixture-retirement-audit.js +140 -0
- package/build/src/cli/flow-kit.js +138 -0
- package/build/src/cli/import-bookmarks.js +50 -0
- package/build/src/cli/init.js +239 -0
- package/build/src/cli/instinct-cli.js +93 -0
- package/build/src/cli/promote-workflow-artifact.js +63 -0
- package/build/src/cli/publish-change-helper.js +154 -0
- package/build/src/cli/pull-work-provider.js +469 -0
- package/build/src/cli/runtime-adapter.js +23 -0
- package/build/src/cli/telemetry-doctor.js +221 -0
- package/build/src/cli/usage-feedback.js +443 -0
- package/build/src/cli/validate-hook-influence.js +152 -0
- package/build/src/cli/validate-source-tree.js +31 -0
- package/build/src/cli/validate-workflow-artifacts.js +486 -0
- package/build/src/cli/veritas-governance.js +262 -0
- package/build/src/cli/workflow-artifact-cleanup-audit.js +272 -0
- package/build/src/cli/workflow-sidecar.js +816 -0
- package/build/src/cli.js +89 -0
- package/build/src/flow-kit/validate.js +75 -0
- package/build/src/lib/args.js +45 -0
- package/build/src/lib/fs.js +62 -0
- package/build/src/lib/workflow-learning-projection.js +334 -0
- package/build/src/runtime-adapters.js +146 -0
- package/build/src/tools/build-universal-bundles.js +397 -0
- package/build/src/tools/common.js +56 -0
- package/build/src/tools/filter-installed-packs.js +132 -0
- package/build/src/tools/generate-context-map.js +198 -0
- package/build/src/tools/validate-package.js +64 -0
- package/build/src/tools/validate-source-tree.js +622 -0
- package/console.telemetry.json +176 -0
- package/context/base-rules.md +17 -0
- package/context/code-review-standards.md +62 -0
- package/context/coding-standards.md +42 -0
- package/context/common/orchestrators.md +12 -0
- package/context/common/subagents.md +28 -0
- package/context/contracts/artifact-contract.md +182 -0
- package/context/contracts/builder-kit-workflow-state-contract.md +319 -0
- package/context/contracts/delivery-contract.md +69 -0
- package/context/contracts/execution-contract.md +53 -0
- package/context/contracts/governance-adapter-contract.md +67 -0
- package/context/contracts/planning-contract.md +85 -0
- package/context/contracts/review-contract.md +104 -0
- package/context/contracts/sandbox-policy.md +52 -0
- package/context/contracts/verification-contract.md +134 -0
- package/context/contracts/work-item-contract.md +215 -0
- package/context/deferred/demo-mode.md +33 -0
- package/context/deferred/languages/go.md +31 -0
- package/context/deferred/languages/python.md +31 -0
- package/context/deferred/languages/typescript.md +34 -0
- package/context/deferred/parallelization.md +35 -0
- package/context/deferred/worktree-isolation.md +24 -0
- package/context/development-workflow.md +50 -0
- package/context/scripts/context-budget/budget-scan.sh +166 -0
- package/context/scripts/detect-tools.sh +3 -0
- package/context/scripts/discover-agents.sh +28 -0
- package/context/scripts/git-status.sh +49 -0
- package/context/scripts/hooks/config-protection.js +79 -0
- package/context/scripts/hooks/desktop-notify.sh +39 -0
- package/context/scripts/hooks/governance-audit.sh +135 -0
- package/context/scripts/hooks/lib/audit-transport.sh +40 -0
- package/context/scripts/hooks/lib/hook-flags.js +49 -0
- package/context/scripts/hooks/lib/patterns.sh +57 -0
- package/context/scripts/hooks/lib/resolve-formatter.js +80 -0
- package/context/scripts/hooks/post-edit-accumulator.js +66 -0
- package/context/scripts/hooks/pre-commit-quality.js +194 -0
- package/context/scripts/hooks/quality-gate.js +93 -0
- package/context/scripts/hooks/report-only-guard.js +21 -0
- package/context/scripts/hooks/run-hook.js +136 -0
- package/context/scripts/hooks/stop-format-typecheck.js +141 -0
- package/context/scripts/hooks/stop-goal-fit.js +337 -0
- package/context/scripts/hooks/workflow-steering.js +250 -0
- package/context/scripts/telemetry/console-presets.sh +14 -0
- package/context/scripts/telemetry/install-console-config.sh +214 -0
- package/context/scripts/telemetry/lib/config.sh +85 -0
- package/context/scripts/telemetry/lib/enrich.sh +115 -0
- package/context/scripts/telemetry/lib/redact.sh +22 -0
- package/context/scripts/telemetry/lib/session.sh +63 -0
- package/context/scripts/telemetry/lib/transport.sh +183 -0
- package/context/scripts/telemetry/lib/usage.sh +29 -0
- package/context/scripts/telemetry/sync-agents.sh +173 -0
- package/context/scripts/telemetry/telemetry.conf +23 -0
- package/context/scripts/telemetry/telemetry.sh +387 -0
- package/context/scripts/validate-package.sh +89 -0
- package/context/settings/backlog-provider-settings.json +54 -0
- package/context/templates/core/identity.md +26 -0
- package/context/templates/core/user.md +15 -0
- package/docs/_config.yml +15 -0
- package/docs/_layouts/default.html +87 -0
- package/docs/adr/0001-flow-agents-consumes-flow.md +77 -0
- package/docs/adr/0002-flow-kits-as-extension-unit.md +13 -0
- package/docs/adr/0003-flow-agents-coordinates-kits-and-adapters.md +13 -0
- package/docs/adr/0004-gates-expect-surface-claims.md +15 -0
- package/docs/adr/0005-kubernetes-inspired-resource-contracts.md +48 -0
- package/docs/adr/0006-typescript-first-source-policy.md +98 -0
- package/docs/agent-system-guidebook.md +391 -0
- package/docs/agent-usage-feedback-loop.md +351 -0
- package/docs/assets/favicon.svg +13 -0
- package/docs/assets/og-image.png +0 -0
- package/docs/assets/site.css +774 -0
- package/docs/assets/site.js +139 -0
- package/docs/configurable-workflow-routing.md +174 -0
- package/docs/context-map.md +145 -0
- package/docs/developer-architecture.md +145 -0
- package/docs/developer-hook-setup.md +61 -0
- package/docs/fixture-ownership.md +44 -0
- package/docs/flow-kit-repository-contract.md +180 -0
- package/docs/index.md +129 -0
- package/docs/kontour-resource-contract.md +358 -0
- package/docs/migrations.md +64 -0
- package/docs/north-star.md +322 -0
- package/docs/operating-layers.md +110 -0
- package/docs/repository-structure.md +132 -0
- package/docs/sandbox-policy.md +56 -0
- package/docs/skills-map.md +203 -0
- package/docs/standards-register.md +96 -0
- package/docs/veritas-integration.md +165 -0
- package/docs/work-item-adapters.md +72 -0
- package/docs/workflow-artifact-lifecycle.md +141 -0
- package/docs/workflow-eval-strategy.md +295 -0
- package/docs/workflow-shared-contracts.md +51 -0
- package/docs/workflow-usage-guide.md +443 -0
- package/evals/ARCHITECTURE.md +143 -0
- package/evals/CONVENTIONS.md +58 -0
- package/evals/README.md +128 -0
- package/evals/acceptance/run.sh +29 -0
- package/evals/acceptance/test_claude_harness.sh +242 -0
- package/evals/acceptance/test_codex_harness.sh +108 -0
- package/evals/acceptance/test_kiro_harness.sh +128 -0
- package/evals/cases/dev/404.html +97 -0
- package/evals/cases/dev/code-review.yaml +44 -0
- package/evals/cases/dev/dashboard.html +300 -0
- package/evals/cases/dev/deliver.yaml +66 -0
- package/evals/cases/dev/dependency-update.yaml +16 -0
- package/evals/cases/dev/explore.yaml +20 -0
- package/evals/cases/dev/index.html +370 -0
- package/evals/cases/dev/package-lock.json +28 -0
- package/evals/cases/dev/package.json +16 -0
- package/evals/cases/dev/plan-work.yaml +20 -0
- package/evals/cases/dev/promptfooconfig.yaml +666 -0
- package/evals/cases/dev/search-first.yaml +20 -0
- package/evals/cases/dev/tdd-workflow.yaml +48 -0
- package/evals/cases/dev/verify-work.yaml +44 -0
- package/evals/cases/dev/workflow.yaml +34 -0
- package/evals/ci/run-baseline.sh +283 -0
- package/evals/fixtures/backlog-provider-settings/global-default.json +44 -0
- package/evals/fixtures/backlog-provider-settings/project-override.json +53 -0
- package/evals/fixtures/builder-kit-workflow-state/baseline-freshness-resolution-hint.json +139 -0
- package/evals/fixtures/builder-kit-workflow-state/direct-primitive-stop.json +59 -0
- package/evals/fixtures/builder-kit-workflow-state/empty-board-route-shape.json +55 -0
- package/evals/fixtures/builder-kit-workflow-state/happy-path.json +71 -0
- package/evals/fixtures/builder-kit-workflow-state/mid-work-resume.json +80 -0
- package/evals/fixtures/builder-kit-workflow-state/missing-prestep-recovery.json +65 -0
- package/evals/fixtures/builder-kit-workflow-state/product-build-chaining.json +60 -0
- package/evals/fixtures/builder-kit-workflow-state/stale-continuation-requires-new-probe.json +57 -0
- package/evals/fixtures/console-learning-projection/artifacts/console-learning-correction/learning.json +50 -0
- package/evals/fixtures/console-learning-projection/artifacts/console-learning-open-route/learning.json +41 -0
- package/evals/fixtures/flow-kit-repository/invalid-absolute-path/kit.json +8 -0
- package/evals/fixtures/flow-kit-repository/invalid-asset-section/flows/review.flow.json +6 -0
- package/evals/fixtures/flow-kit-repository/invalid-asset-section/kit.json +11 -0
- package/evals/fixtures/flow-kit-repository/invalid-duplicate-flow/flows/review.flow.json +6 -0
- package/evals/fixtures/flow-kit-repository/invalid-duplicate-flow/kit.json +9 -0
- package/evals/fixtures/flow-kit-repository/invalid-id/flows/review.flow.json +6 -0
- package/evals/fixtures/flow-kit-repository/invalid-id/kit.json +8 -0
- package/evals/fixtures/flow-kit-repository/invalid-malformed-json/kit.json +8 -0
- package/evals/fixtures/flow-kit-repository/invalid-missing-flow/kit.json +8 -0
- package/evals/fixtures/flow-kit-repository/invalid-missing-id/flows/review.flow.json +6 -0
- package/evals/fixtures/flow-kit-repository/invalid-missing-id/kit.json +7 -0
- package/evals/fixtures/flow-kit-repository/invalid-missing-schema-version/flows/review.flow.json +6 -0
- package/evals/fixtures/flow-kit-repository/invalid-missing-schema-version/kit.json +7 -0
- package/evals/fixtures/flow-kit-repository/invalid-name/flows/review.flow.json +6 -0
- package/evals/fixtures/flow-kit-repository/invalid-name/kit.json +8 -0
- package/evals/fixtures/flow-kit-repository/invalid-schema-version/flows/review.flow.json +6 -0
- package/evals/fixtures/flow-kit-repository/invalid-schema-version/kit.json +8 -0
- package/evals/fixtures/flow-kit-repository/invalid-traversal/kit.json +8 -0
- package/evals/fixtures/flow-kit-repository/mixed-runtime-kit/adapters/example.json +3 -0
- package/evals/fixtures/flow-kit-repository/mixed-runtime-kit/assets/example.txt +1 -0
- package/evals/fixtures/flow-kit-repository/mixed-runtime-kit/docs/README.md +3 -0
- package/evals/fixtures/flow-kit-repository/mixed-runtime-kit/flows/runtime.flow.json +26 -0
- package/evals/fixtures/flow-kit-repository/mixed-runtime-kit/kit-evals/example.json +3 -0
- package/evals/fixtures/flow-kit-repository/mixed-runtime-kit/kit-skills/mixed/SKILL.md +3 -0
- package/evals/fixtures/flow-kit-repository/mixed-runtime-kit/kit.json +44 -0
- package/evals/fixtures/flow-kit-repository/valid-local-kit/docs/README.md +3 -0
- package/evals/fixtures/flow-kit-repository/valid-local-kit/flows/review.flow.json +26 -0
- package/evals/fixtures/flow-kit-repository/valid-local-kit/kit.json +20 -0
- package/evals/fixtures/hook-influence/cases.json +336 -0
- package/evals/fixtures/pull-work-provider/github-issues.json +170 -0
- package/evals/fixtures/pull-work-wip-shepherding/global-wip-informs.json +43 -0
- package/evals/fixtures/pull-work-wip-shepherding/personal-wip-blocks.json +42 -0
- package/evals/fixtures/surface-trust/accepted-claim-trust-report.json +31 -0
- package/evals/fixtures/surface-trust/artifact-absent.json +19 -0
- package/evals/fixtures/surface-trust/integrity-mismatch-trust-report.json +32 -0
- package/evals/fixtures/surface-trust/missing-authority-trust-report.json +27 -0
- package/evals/fixtures/surface-trust/provider-absent.json +19 -0
- package/evals/fixtures/surface-trust/rejected-claim-trust-report.json +30 -0
- package/evals/fixtures/surface-trust/stale-claim-trust-snapshot.json +31 -0
- package/evals/fixtures/usage-feedback/sample-full.jsonl +11 -0
- package/evals/fixtures/usage-feedback/sample-outcomes.jsonl +1 -0
- package/evals/fixtures/veritas-governance-adapter/fake-veritas-pass.sh +18 -0
- package/evals/fixtures/veritas-governance-adapter/fake-veritas-secret-fail.sh +10 -0
- package/evals/fixtures/veritas-governance-adapter/fake-veritas-unconfigured.sh +4 -0
- package/evals/integration/test_bundle_install.sh +541 -0
- package/evals/integration/test_console_learning_projection.sh +192 -0
- package/evals/integration/test_context_map.sh +65 -0
- package/evals/integration/test_effective_backlog_settings.sh +58 -0
- package/evals/integration/test_fixture_retirement_audit.sh +58 -0
- package/evals/integration/test_flow_agents_statusline.sh +93 -0
- package/evals/integration/test_flow_kit_repository.sh +90 -0
- package/evals/integration/test_goal_fit_hook.sh +482 -0
- package/evals/integration/test_hook_category_behaviors.sh +190 -0
- package/evals/integration/test_hook_influence_cases.sh +69 -0
- package/evals/integration/test_local_flow_kit_install.sh +145 -0
- package/evals/integration/test_publish_change_helper.sh +176 -0
- package/evals/integration/test_pull_work_provider.sh +140 -0
- package/evals/integration/test_runtime_adapter_activation.sh +106 -0
- package/evals/integration/test_telemetry.sh +485 -0
- package/evals/integration/test_telemetry_doctor.sh +193 -0
- package/evals/integration/test_usage_feedback_dashboard.sh +169 -0
- package/evals/integration/test_usage_feedback_global.sh +117 -0
- package/evals/integration/test_usage_feedback_import.sh +227 -0
- package/evals/integration/test_usage_feedback_outcomes.sh +165 -0
- package/evals/integration/test_usage_feedback_report.sh +263 -0
- package/evals/integration/test_veritas_governance_adapter.sh +235 -0
- package/evals/integration/test_workflow_artifact_cleanup_audit.sh +287 -0
- package/evals/integration/test_workflow_artifacts.sh +1247 -0
- package/evals/integration/test_workflow_sidecar_writer.sh +2112 -0
- package/evals/integration/test_workflow_steering_hook.sh +337 -0
- package/evals/lib/assertions/delegated-to.js +40 -0
- package/evals/lib/assertions/max-tool-calls.js +15 -0
- package/evals/lib/assertions/no-write-tools.js +27 -0
- package/evals/lib/assertions/pass-at-k.js +39 -0
- package/evals/lib/assertions/telemetry-utils.js +105 -0
- package/evals/lib/assertions/tool-called.js +39 -0
- package/evals/lib/assertions/verify-after-fix.js +61 -0
- package/evals/lib/claude-judge.sh +40 -0
- package/evals/lib/claude-provider.sh +74 -0
- package/evals/lib/codex-judge.sh +39 -0
- package/evals/lib/codex-provider.sh +81 -0
- package/evals/lib/eval-dev.sh +5 -0
- package/evals/lib/eval-judge.sh +22 -0
- package/evals/lib/eval-provider.sh +26 -0
- package/evals/lib/eval-report.sh +73 -0
- package/evals/lib/kiro-dev.sh +4 -0
- package/evals/lib/kiro-judge.sh +17 -0
- package/evals/lib/kiro-provider.sh +62 -0
- package/evals/lib/node.sh +111 -0
- package/evals/promptfooconfig.yaml +70 -0
- package/evals/run.sh +309 -0
- package/evals/static/test_evidence_refs.sh +141 -0
- package/evals/static/test_package.sh +407 -0
- package/evals/static/test_repo_hooks.sh +68 -0
- package/evals/static/test_universal_bundles.sh +274 -0
- package/evals/static/test_workflow_skills.sh +1207 -0
- package/install.sh +64 -0
- package/integrations/veritas/flow-agents.adapter.json +138 -0
- package/integrations/veritas/flow-agents.authority-settings.json +26 -0
- package/integrations/veritas/flow-agents.repo-standards.json +82 -0
- package/kits/builder/flows/build.flow.json +218 -0
- package/kits/builder/flows/shape.flow.json +127 -0
- package/kits/builder/kit.json +19 -0
- package/kits/catalog.json +11 -0
- package/package.json +130 -0
- package/packaging/README.md +60 -0
- package/packaging/manifest.json +173 -0
- package/packaging/packs.json +69 -0
- package/powers/dependency-checker/POWER.md +20 -0
- package/powers/dependency-checker/mcp.json +20 -0
- package/powers/playwright/POWER.md +25 -0
- package/powers/playwright/mcp.json +12 -0
- package/prompts/code-audit.md +123 -0
- package/prompts/kcommit.md +88 -0
- package/schemas/backlog-provider-settings.schema.json +138 -0
- package/schemas/workflow-acceptance.schema.json +216 -0
- package/schemas/workflow-critique.schema.json +113 -0
- package/schemas/workflow-evidence.schema.json +357 -0
- package/schemas/workflow-handoff.schema.json +52 -0
- package/schemas/workflow-learning.schema.json +223 -0
- package/schemas/workflow-release.schema.json +172 -0
- package/schemas/workflow-state.schema.json +80 -0
- package/scripts/README.md +111 -0
- package/scripts/build-universal-bundles.js +3 -0
- package/scripts/check-content-boundary.cjs +99 -0
- package/scripts/context-budget/budget-scan.sh +166 -0
- package/scripts/detect-tools.sh +3 -0
- package/scripts/discover-agents.sh +28 -0
- package/scripts/effective-backlog-settings.js +2 -0
- package/scripts/filter-installed-packs.js +2 -0
- package/scripts/flow-kit.js +2 -0
- package/scripts/generate-context-map.js +2 -0
- package/scripts/git-status.sh +49 -0
- package/scripts/hooks/claude-hook-adapter.js +174 -0
- package/scripts/hooks/claude-telemetry-hook.js +115 -0
- package/scripts/hooks/codex-hook-adapter.js +176 -0
- package/scripts/hooks/codex-telemetry-hook.js +95 -0
- package/scripts/hooks/config-protection.js +79 -0
- package/scripts/hooks/desktop-notify.sh +39 -0
- package/scripts/hooks/governance-audit.sh +135 -0
- package/scripts/hooks/lib/audit-transport.sh +40 -0
- package/scripts/hooks/lib/hook-flags.js +49 -0
- package/scripts/hooks/lib/patterns.sh +57 -0
- package/scripts/hooks/lib/resolve-formatter.js +80 -0
- package/scripts/hooks/post-edit-accumulator.js +66 -0
- package/scripts/hooks/pre-commit-quality.js +194 -0
- package/scripts/hooks/quality-gate.js +93 -0
- package/scripts/hooks/report-only-guard.js +21 -0
- package/scripts/hooks/run-hook.js +136 -0
- package/scripts/hooks/stop-format-typecheck.js +141 -0
- package/scripts/hooks/stop-goal-fit.js +337 -0
- package/scripts/hooks/workflow-steering.js +250 -0
- package/scripts/install-codex-home.sh +106 -0
- package/scripts/package.json +3 -0
- package/scripts/promote-workflow-artifact.js +2 -0
- package/scripts/publish-change-helper.js +2 -0
- package/scripts/pull-work-provider.js +2 -0
- package/scripts/setup-repo-hooks.sh +8 -0
- package/scripts/statusline/flow-agents-statusline.js +157 -0
- package/scripts/telemetry/console-presets.sh +14 -0
- package/scripts/telemetry/install-console-config.sh +214 -0
- package/scripts/telemetry/lib/config.sh +85 -0
- package/scripts/telemetry/lib/enrich.sh +115 -0
- package/scripts/telemetry/lib/redact.sh +22 -0
- package/scripts/telemetry/lib/session.sh +63 -0
- package/scripts/telemetry/lib/transport.sh +183 -0
- package/scripts/telemetry/lib/usage.sh +29 -0
- package/scripts/telemetry/sync-agents.sh +173 -0
- package/scripts/telemetry/telemetry.conf +23 -0
- package/scripts/telemetry/telemetry.sh +387 -0
- package/scripts/usage-feedback.js +2 -0
- package/scripts/validate-hook-influence-cases.js +2 -0
- package/scripts/validate-package.sh +89 -0
- package/scripts/validate-source-tree.js +9 -0
- package/skills/agentic-engineering/SKILL.md +62 -0
- package/skills/browser-test/SKILL.md +51 -0
- package/skills/builder-shape/SKILL.md +76 -0
- package/skills/context-budget/SKILL.md +40 -0
- package/skills/deliver/SKILL.md +241 -0
- package/skills/dependency-update/SKILL.md +68 -0
- package/skills/design-probe/SKILL.md +107 -0
- package/skills/eval-rebuild/SKILL.md +39 -0
- package/skills/evidence-gate/SKILL.md +186 -0
- package/skills/execute-plan/SKILL.md +110 -0
- package/skills/explore/SKILL.md +137 -0
- package/skills/feedback-loop/SKILL.md +87 -0
- package/skills/fix-bug/SKILL.md +133 -0
- package/skills/frontend-design/SKILL.md +80 -0
- package/skills/github-cli/SKILL.md +63 -0
- package/skills/idea-to-backlog/SKILL.md +267 -0
- package/skills/knowledge-capture/SKILL.md +55 -0
- package/skills/learning-review/SKILL.md +115 -0
- package/skills/pickup-probe/SKILL.md +114 -0
- package/skills/plan-work/SKILL.md +176 -0
- package/skills/pull-work/SKILL.md +309 -0
- package/skills/release-readiness/SKILL.md +121 -0
- package/skills/review-work/SKILL.md +161 -0
- package/skills/search-first/SKILL.md +66 -0
- package/skills/tdd-workflow/SKILL.md +140 -0
- package/skills/verify-work/SKILL.md +109 -0
- package/src/cli/console-learning-projection.ts +140 -0
- package/src/cli/effective-backlog-settings.ts +99 -0
- package/src/cli/fixture-retirement-audit.ts +154 -0
- package/src/cli/flow-kit.ts +139 -0
- package/src/cli/init.ts +248 -0
- package/src/cli/promote-workflow-artifact.ts +64 -0
- package/src/cli/publish-change-helper.ts +143 -0
- package/src/cli/pull-work-provider.ts +481 -0
- package/src/cli/runtime-adapter.ts +24 -0
- package/src/cli/telemetry-doctor.ts +243 -0
- package/src/cli/usage-feedback.ts +418 -0
- package/src/cli/validate-hook-influence.ts +119 -0
- package/src/cli/validate-source-tree.ts +30 -0
- package/src/cli/validate-workflow-artifacts.ts +411 -0
- package/src/cli/veritas-governance.ts +322 -0
- package/src/cli/workflow-artifact-cleanup-audit.ts +281 -0
- package/src/cli/workflow-sidecar.ts +676 -0
- package/src/cli.ts +95 -0
- package/src/flow-kit/validate.ts +74 -0
- package/src/lib/args.ts +43 -0
- package/src/lib/fs.ts +62 -0
- package/src/lib/workflow-learning-projection.ts +491 -0
- package/src/runtime-adapters.ts +154 -0
- package/src/tools/build-universal-bundles.ts +366 -0
- package/src/tools/common.ts +61 -0
- package/src/tools/filter-installed-packs.ts +129 -0
- package/src/tools/generate-context-map.ts +199 -0
- package/src/tools/validate-package.ts +57 -0
- package/src/tools/validate-source-tree.ts +488 -0
- package/tsconfig.json +19 -0
- package/veritas.claims.json +6 -0
|
@@ -0,0 +1,141 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: Workflow Artifact Lifecycle
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
# Workflow Artifact Lifecycle
|
|
6
|
+
|
|
7
|
+
Flow Agents treats task artifacts as useful working memory, not permanent product documentation. Feature branches should promote durable planning, decisions, evidence pointers, and acceptance notes into normal project docs, source, schemas, or provider records instead of carrying `.flow-agents/` runtime files.
|
|
8
|
+
|
|
9
|
+
The local artifact root is a current-state dashboard first and a short-lived recovery cache second. It should answer "what needs attention now?" without forcing agents to sift through old successful deliveries.
|
|
10
|
+
|
|
11
|
+
## Audit Command
|
|
12
|
+
|
|
13
|
+
Use the read-only cleanup audit before making any local retention decision:
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
npm run workflow-artifact-cleanup-audit -- --artifact-root .flow-agents
|
|
17
|
+
npm run workflow-artifact-cleanup-audit -- --artifact-root .flow-agents --json
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
The command scans immediate workflow directories, skips non-workflow lanes such as `archive/`, and reports active WIP separately from cleanup candidates, terminal done records, active learning follow-ups, and invalid sidecars. This first slice is dry-run classification only: it does not delete, archive, move, or rewrite runtime artifacts by default, and it has no apply mode.
|
|
21
|
+
|
|
22
|
+
Use the Current-State Semantics and Local Retention Policy sections below to interpret each bucket. In particular, learning records with `learning.status: followup_required` or any `routing[].status: open` remain active learning follow-ups until every route is completed, opened elsewhere, deferred with a trigger, accepted, or rejected.
|
|
23
|
+
|
|
24
|
+
## Artifact Lanes
|
|
25
|
+
|
|
26
|
+
Use one local lane under `.flow-agents/`:
|
|
27
|
+
|
|
28
|
+
| Lane | Path | Commit Policy | Purpose |
|
|
29
|
+
| --- | --- | --- | --- |
|
|
30
|
+
| Runtime workspace | `.flow-agents/<slug>/` | Do not commit | Local session state, sidecars, delegate events, scratch evidence, and recovery notes. |
|
|
31
|
+
|
|
32
|
+
The runtime workspace stays local because it may contain stale session state, machine-specific paths, or noisy intermediate artifacts. When a branch needs cross-session or cross-person traceability, promote the durable summary, decisions, evidence pointers, and acceptance notes into docs, source, schemas, or provider records instead of committing runtime artifacts.
|
|
33
|
+
|
|
34
|
+
## Current-State Semantics
|
|
35
|
+
|
|
36
|
+
Treat `state.json` as the active-work signal for local users and `pull-work`.
|
|
37
|
+
|
|
38
|
+
| State shape | Meaning | Queue treatment |
|
|
39
|
+
| --- | --- | --- |
|
|
40
|
+
| `planning`, `planned`, `in_progress`, `verifying`, `blocked`, `failed`, `not_verified`, or `needs_decision` | Work still needs agent or user attention. | Active WIP or shepherding candidate. |
|
|
41
|
+
| `verified` with `next_action.status: continue` | Local evidence passed, but release, final acceptance, or learning is not closed. | Active shepherding candidate. |
|
|
42
|
+
| `verified` with `next_action.status: done` | Evidence passed and the next phase was completed outside the state machine or by a provider record. | Cleanup candidate; should be advanced to a terminal state during final acceptance. |
|
|
43
|
+
| `accepted` with `phase: learning` and `learning.status: followup_required` | Learning was captured but at least one routed follow-up is still open or undecided. | Active learning follow-up until routed to backlog, docs, evals, skills, knowledge, or an explicit deferred trigger. |
|
|
44
|
+
| `delivered`, `accepted`, or `archived` with `phase: done`, or `accepted`/`archived` with closed learning routing | Completed local workflow. | Not active WIP; retain only while useful for recovery or audit. |
|
|
45
|
+
|
|
46
|
+
`verified` is not a terminal state. It means the verifier supplied evidence. Final acceptance must still record the provider change, CI/release result, docs promotion decision, and any learning route before the workflow stops being active.
|
|
47
|
+
|
|
48
|
+
## Learning Closeout
|
|
49
|
+
|
|
50
|
+
Learning records are a routing surface, not a permanent parking lot.
|
|
51
|
+
|
|
52
|
+
Use `learning.status: followup_required` only while at least one learning route still needs action. Each route should end in one of these outcomes:
|
|
53
|
+
|
|
54
|
+
- `completed`: the doc, eval, skill, backlog item, code change, or knowledge update was made.
|
|
55
|
+
- `open`: a provider-backed issue, backlog artifact, or named owner now tracks the follow-up.
|
|
56
|
+
- `deferred`: the follow-up has a concrete revisit trigger, such as a later milestone, repeated failure pattern, provider capability, or date.
|
|
57
|
+
- `rejected`: the follow-up was considered and intentionally not pursued, with a reason in the learning record.
|
|
58
|
+
|
|
59
|
+
Once every route is completed, open elsewhere, deferred with a trigger, or rejected with a reason, record `learning.status: learned` and advance the workflow out of active WIP. Do not leave local runtime state as `needs_decision` only because a durable follow-up issue exists.
|
|
60
|
+
|
|
61
|
+
Terminal learning review also records correction state in `learning.json`. Before closeout, compare intended behavior to observed behavior:
|
|
62
|
+
|
|
63
|
+
- Clean runs use `correction.needed: false`, brief `correction.evidence`, and closed/no-follow-up routing such as `target: "none"` with `status: "completed"`.
|
|
64
|
+
- Mismatches use `correction.needed: true` with typed `correction.type`, stable `correction.recurrence_key`, intended behavior, observed behavior, gap, and a prevention route or explicit `no_change_rationale`.
|
|
65
|
+
|
|
66
|
+
Correction records stay in local `learning.json` for this slice. They do not create a new sidecar, do not upload to Source/Sink storage, do not build Console/dashboard UI, and do not automatically open provider issues. Future consumers can derive correction rate, resolved corrections, repeated recurrence keys, stale unresolved corrections, and clean-run rate from the same fields.
|
|
67
|
+
|
|
68
|
+
Durable learning should be promoted by target:
|
|
69
|
+
|
|
70
|
+
- workflow rule changes go to `context/contracts/`, `skills/`, or workflow docs
|
|
71
|
+
- regression expectations go to `evals/`
|
|
72
|
+
- product or architecture decisions go to `docs/` or `docs/adr/`
|
|
73
|
+
- executable work goes to GitHub issues or the configured backlog provider
|
|
74
|
+
- durable user/team memory goes to the configured knowledge store
|
|
75
|
+
|
|
76
|
+
## Local Retention Policy
|
|
77
|
+
|
|
78
|
+
For local-only users, keep enough local state to recover recent work, but do not use `.flow-agents/<slug>/` as the long-term system of record.
|
|
79
|
+
|
|
80
|
+
Recommended defaults:
|
|
81
|
+
|
|
82
|
+
| Artifact class | Retain locally | Durable destination |
|
|
83
|
+
| --- | --- | --- |
|
|
84
|
+
| Active WIP, blockers, and unresolved decisions | Until resolved | Current `.flow-agents/<slug>/` state and handoff. |
|
|
85
|
+
| Recently merged or accepted deliveries | 14-30 days, or until the next queue audit | PR body, issue comments, release records, promoted docs, or archived evidence refs. |
|
|
86
|
+
| Security, migration, release, or provider-governance evidence | 30-90 days when useful for audit | Provider record, release note, durable doc, or external evidence store. |
|
|
87
|
+
| Routine successful local runtime artifacts | Delete or archive after durable promotion and recovery window | Usually none beyond provider record and docs. |
|
|
88
|
+
| Learning records with routed follow-ups | Until all routes are completed, opened elsewhere, deferred with trigger, or rejected | Backlog issue, docs/evals/skills change, or knowledge note. |
|
|
89
|
+
|
|
90
|
+
When a future Source/Sink service is available, the same lifecycle should apply: local runtime artifacts become a cache and upload source; the service becomes the searchable history. Local-only users should still be able to run cleanup from provider records and durable docs without losing active work.
|
|
91
|
+
|
|
92
|
+
## Prevention Rules
|
|
93
|
+
|
|
94
|
+
To prevent historical entries from polluting current-state scans:
|
|
95
|
+
|
|
96
|
+
1. After a PR is merged or a no-provider-change path is accepted, final acceptance must advance `state.json` out of `verified` unless there is a real blocker.
|
|
97
|
+
2. If learning is required, route every learning item before marking the workflow inactive. Open durable issues are valid routes; they should not keep the local workflow active forever.
|
|
98
|
+
3. `pull-work` should classify old `verified` records with `next_action.status: done` as cleanup candidates, not active implementation work.
|
|
99
|
+
4. Queue audits should flag `needs_decision` or `followup_required` records older than the local recovery window.
|
|
100
|
+
5. Cleanup should preserve links to PRs, issues, durable docs, and evidence summaries before deleting or archiving local runtime folders.
|
|
101
|
+
|
|
102
|
+
## Durable Closeout Shape
|
|
103
|
+
|
|
104
|
+
Durable closeout content is the handoff from working memory to project knowledge. Put it in the provider record, PR body, issue comments, release note, ADR, README section, schema docs, or runbook that owns the shipped behavior. It should record:
|
|
105
|
+
|
|
106
|
+
- shipped behavior or explicit non-shipped result
|
|
107
|
+
- provider change records such as PRs or issues
|
|
108
|
+
- verification evidence and residual gaps
|
|
109
|
+
- durable docs targets updated or intentionally skipped
|
|
110
|
+
- ADRs, README sections, schema docs, runbooks, or release notes created
|
|
111
|
+
- follow-up issues or learning-review records
|
|
112
|
+
- confirmation that `.flow-agents/` runtime artifacts remain untracked
|
|
113
|
+
|
|
114
|
+
## Completion Rule
|
|
115
|
+
|
|
116
|
+
Before merge to `main`:
|
|
117
|
+
|
|
118
|
+
1. Promote durable behavior, contracts, decisions, operations notes, and usage guidance into long-lived docs such as `README.md`, `docs/`, `docs/adr/`, schema docs, runbooks, changelogs, or provider records.
|
|
119
|
+
2. Make sure the durable record names the promotion targets and any accepted gaps.
|
|
120
|
+
3. Confirm `.flow-agents/` runtime artifacts remain untracked.
|
|
121
|
+
4. Keep links to provider records, durable docs, or archived external evidence instead of relying on temporary local files.
|
|
122
|
+
|
|
123
|
+
`main` must not contain tracked files under `.flow-agents/`. If runtime artifacts still seem necessary after merge, their durable content has not been promoted yet.
|
|
124
|
+
|
|
125
|
+
## Promotion Targets
|
|
126
|
+
|
|
127
|
+
Promote by ownership:
|
|
128
|
+
|
|
129
|
+
- user-facing behavior: `README.md`, product docs, or workflow usage docs
|
|
130
|
+
- architecture and policy decisions: `docs/adr/` or focused design docs
|
|
131
|
+
- workflow rules and gates: `context/contracts/`, `skills/`, `agents/`, and workflow docs
|
|
132
|
+
- schemas and API contracts: `schemas/` and contract docs
|
|
133
|
+
- operational behavior: runbooks, release notes, or deployment docs
|
|
134
|
+
- evidence and release state: PR body, provider checks, release records, or durable evidence docs
|
|
135
|
+
- follow-up work: provider-backed issues or backlog artifacts
|
|
136
|
+
|
|
137
|
+
Do not promote raw intermediate thinking wholesale. Promote the resulting decisions, requirements, evidence, and user-facing instructions.
|
|
138
|
+
|
|
139
|
+
## Enforcement
|
|
140
|
+
|
|
141
|
+
Runtime state remains ignored under `.flow-agents/`. Static package validation fails if runtime artifacts are tracked. Reviewers should reject PRs that omit durable docs, source, schema, provider, or evidence updates needed to understand shipped behavior.
|
|
@@ -0,0 +1,295 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: Workflow Eval Strategy
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
# Workflow Eval Strategy
|
|
6
|
+
|
|
7
|
+
The Builder Kit workflow system now has concrete skill contracts for `idea-to-backlog`, `pull-work`, `plan-work`, `review-work`, `deliver`, `evidence-gate`, `release-readiness`, and `learning-review`, plus shared workflow contracts in `context/contracts/`. Evals should prove both the written contracts and the agent behavior around gates, artifacts, worktrees, Goal Fit, release readiness, final acceptance docs, and learning feedback.
|
|
8
|
+
|
|
9
|
+
Flow Agents evals prove coordination, install, runtime adapter behavior, and artifact discipline. They should not redefine Flow gate authority: Flow Definitions use typed `expects` entries, Surface claim gates use `kind: "surface.claim"`, and Flow project config owns trusted producer mappings plus gate overrides.
|
|
10
|
+
|
|
11
|
+
## Goals
|
|
12
|
+
|
|
13
|
+
Prove that workflow skills are operational, not just descriptive:
|
|
14
|
+
|
|
15
|
+
- Agents activate the right workflow for the right user intent.
|
|
16
|
+
- Upstream shaping does not collapse into implementation.
|
|
17
|
+
- Provider-backed work items are treated as executable backlog, not the whole reasoning store; GitHub issues are the first adapter example.
|
|
18
|
+
- Provider-neutral work item, board, change, check, and evidence terms remain the core vocabulary; GitHub stays an adapter/example.
|
|
19
|
+
- Work selection considers readiness, WIP, blockers, Probe needs, and worktree isolation.
|
|
20
|
+
- Review is report-only critique and records findings separately from verification evidence.
|
|
21
|
+
- Evidence review is report-only and maps claims to falsifiable proof.
|
|
22
|
+
- Goal Fit catches task-complete-but-user-incomplete delivery before final response.
|
|
23
|
+
- Release readiness, final acceptance docs, and learning review are covered before work is treated as done.
|
|
24
|
+
- Failures produce actionable feedback into skills, evals, tests, backlog, or knowledge.
|
|
25
|
+
|
|
26
|
+
## Eval Layers
|
|
27
|
+
|
|
28
|
+
### 1. Static Contract Evals
|
|
29
|
+
|
|
30
|
+
Run on every static pass.
|
|
31
|
+
|
|
32
|
+
File: `evals/static/test_workflow_skills.sh`
|
|
33
|
+
|
|
34
|
+
These check that skill contracts preserve non-negotiable guardrails:
|
|
35
|
+
|
|
36
|
+
- `idea-to-backlog` forbids production implementation and keeps upstream work separate from `plan-work`, `execute-plan`, `review-work`, and `verify-work`.
|
|
37
|
+
- `pull-work` forbids implementation, enforces WIP awareness, records worktree decisions, returns vague work to shaping, and hands off to `plan-work`.
|
|
38
|
+
- `plan-work` requires `Definition Of Done`, stop-short risks, and durable docs target.
|
|
39
|
+
- `deliver` requires `Goal Fit Gate` and `Final Acceptance` before local delivery is treated as complete.
|
|
40
|
+
- workflow skills and tool agents reference shared `context/contracts/` files instead of redefining artifact, planning, execution, review, verification, and delivery protocols independently.
|
|
41
|
+
- `review-work` separates critique from verification, delegates to `tool-code-reviewer`, conditionally delegates to `tool-security-reviewer`, and records findings in `critique.json`.
|
|
42
|
+
- `evidence-gate` is report-only, treats `NOT_VERIFIED` as first-class, includes scope/integrity checks, evidence tiers, and CI health, and remains separate from release readiness.
|
|
43
|
+
- `publish-change` is required between clean evidence and release readiness: verified diff committed, branch pushed, provider change opened or updated, and provider checks linked.
|
|
44
|
+
- `publish-change` records provider-neutral `PublishChangeResult` evidence: work item refs, board refs, change refs, closing-reference recognition, provider checks, and evidence refs.
|
|
45
|
+
- missing provider checks are risk-based: docs-only changes may pass with explicit skip, while runtime/schema/package/hook/security changes become `NOT_VERIFIED` or release `HOLD` without CI or equivalent provider evidence.
|
|
46
|
+
- `release-readiness` separates merge/release/deploy gates, rollback, observability, ownership, final acceptance docs, and post-deploy verification planning.
|
|
47
|
+
- final terminal delivery reconciles temporary verifier-local sidecar mismatch notes against authoritative final sidecars and orchestrator evidence.
|
|
48
|
+
- `learning-review` records observed facts, decisions, docs promotion state, gaps, follow-ups, knowledge updates, and avoids automatic policy mutation. Terminal learning evals must cover intended-vs-observed correction decisions: clean runs with `correction.needed: false` and no open route, and mismatches with `correction.needed: true`, typed correction type, stable recurrence key, gap, and prevention route or no-change rationale.
|
|
49
|
+
|
|
50
|
+
Static evals prove documented contracts did not drift. They do not prove the agent follows them in conversation.
|
|
51
|
+
|
|
52
|
+
Activation-only behavioral evals may assert no write tools when the goal is trigger and boundary testing. Artifact-quality evals must allow controlled writes to `.flow-agents/<slug>/*.md` and inspect the resulting artifact contracts.
|
|
53
|
+
|
|
54
|
+
### 2. Behavioral Activation Evals
|
|
55
|
+
|
|
56
|
+
Run when evaluating workflow behavior for Codex/Kiro changes.
|
|
57
|
+
|
|
58
|
+
File: `evals/cases/dev/promptfooconfig.yaml`
|
|
59
|
+
|
|
60
|
+
Core cases:
|
|
61
|
+
|
|
62
|
+
- `idea-to-backlog`: user asks to turn an idea into backlog but not code.
|
|
63
|
+
- `pull-work`: user asks to pick the next provider-backed work item without implementing.
|
|
64
|
+
- `evidence-gate`: user asks whether locally verified work is trustworthy enough to merge.
|
|
65
|
+
- Review work: user asks for quality/security/architecture critique after implementation.
|
|
66
|
+
- Release readiness: user asks whether a published change is ready to merge, release, deploy, hold, or roll back after evidence is clean.
|
|
67
|
+
- Learning review: user asks what should be captured after failures or prototype work.
|
|
68
|
+
|
|
69
|
+
These should verify:
|
|
70
|
+
|
|
71
|
+
- correct skill activation
|
|
72
|
+
- no premature implementation
|
|
73
|
+
- correct phase boundaries
|
|
74
|
+
- durable artifact intent
|
|
75
|
+
- appropriate use of `gh` / CLI where relevant
|
|
76
|
+
- explicit stop at gates
|
|
77
|
+
- clear `PASS`, `FAIL`, or `NOT_VERIFIED` outcomes where evidence is being assessed
|
|
78
|
+
- contract persistence after long, noisy, or stale context
|
|
79
|
+
|
|
80
|
+
### 3. Artifact Quality Evals
|
|
81
|
+
|
|
82
|
+
Inspect generated `.flow-agents/<slug>/*.md` files and provider-backed work item drafts for required structure.
|
|
83
|
+
|
|
84
|
+
The local artifact-quality gate is:
|
|
85
|
+
|
|
86
|
+
```bash
|
|
87
|
+
bash evals/integration/test_workflow_artifacts.sh
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
It exercises a realistic plan -> review -> delivery artifact chain and negative fixtures for missing Goal Fit, green-build-only delivery, and hidden `NOT_VERIFIED`.
|
|
91
|
+
|
|
92
|
+
Candidate assertions:
|
|
93
|
+
|
|
94
|
+
- `idea-to-backlog` artifact includes source ideas, current phase, triage decision, shaped work brief, readable story/outcome, stable `R*` requirement ids, stable `AC*` acceptance ids, priority rationale, milestone/delivery outcome, backlog gate, and work item links.
|
|
95
|
+
- `pull-work` artifact includes selected work item, readiness classification, WIP notes, blockers, Probe/design notes when needed, worktree decision, allowed scope, done criteria, and `plan-work` handoff.
|
|
96
|
+
- `pull-work` / pickup Probe artifacts include planned base ref/SHA when available, current target SHA, drift classification, and alignment routing for material scope, dependency, contract, or conflict drift.
|
|
97
|
+
- `plan-work` / `deliver` artifacts include Definition Of Done, requirement-to-acceptance trace, task-to-acceptance mapping, acceptance evidence expectations, stop-short risks, Goal Fit Gate, Final Acceptance, and durable docs target.
|
|
98
|
+
- `review-work` artifacts and `critique.json` include reviewer ids, verdicts, severity-tagged findings, artifact refs, and resolution state.
|
|
99
|
+
- `evidence-gate` artifact includes acceptance criteria map, evidence manifest, CI summary, scope/integrity report, `PASS` / `FAIL` / `NOT_VERIFIED`, and next step.
|
|
100
|
+
- `publish-change` artifact includes provider, work item refs, board refs, change ref, closing-reference check, provider checks, evidence refs, and next action.
|
|
101
|
+
- `release-readiness` artifact includes release scope, evidence reference, risk review, operational plan, rollback plan, observability plan, final acceptance docs, post-deploy checks, and decision.
|
|
102
|
+
- `learning-review` artifact includes outcomes, evidence, decisions, docs promotion state, gaps, follow-ups, knowledge updates, and verdict.
|
|
103
|
+
- Work item drafts include story/outcome, problem, scope, non-goals, stable `R*` requirement ids, stable `AC*` acceptance ids, source artifact links, priority rationale, milestone/delivery outcome, dependencies, and verification expectation.
|
|
104
|
+
|
|
105
|
+
### 4. Adversarial Workflow Evals
|
|
106
|
+
|
|
107
|
+
These cases check that the gates resist pressure and ambiguity:
|
|
108
|
+
|
|
109
|
+
- User asks to "just start coding" during `idea-to-backlog`; agent should hold the gate or require explicit continuation into delivery.
|
|
110
|
+
- Work item is vague; `pull-work` should return it to shaping instead of planning execution.
|
|
111
|
+
- Work item was planned against an older main SHA and relevant contracts changed; pickup Probe should classify `contract_drift` and route to alignment before planning.
|
|
112
|
+
- WIP is congested in review/verification; `pull-work` should prefer finishing active work before starting new implementation.
|
|
113
|
+
- Verification passed locally but CI is missing; `evidence-gate` should return `NOT_VERIFIED` or degraded confidence depending on risk.
|
|
114
|
+
- Docs-only change with missing provider checks and explicit skip rationale may pass when local docs evidence satisfies the risk.
|
|
115
|
+
- Runtime/schema/package/hook/security change with missing provider checks should return `NOT_VERIFIED` in evidence-gate or `HOLD` in release-readiness unless equivalent evidence is recorded.
|
|
116
|
+
- Tests were deleted or weakened; `evidence-gate` should flag integrity risk.
|
|
117
|
+
- CI passes only after unexplained reruns; `evidence-gate` should degrade confidence.
|
|
118
|
+
- Prototype code exists; workflow should require learning review before production promotion.
|
|
119
|
+
- Release notes, rollback, or observability are missing for production-impacting work; `release-readiness` should return `HOLD`, optionally routing missing evidence back to `evidence-gate`.
|
|
120
|
+
- Agent tries to stop with an active `.flow-agents/<slug>/` delivery artifact; `stop-goal-fit` should warn and strict mode should block.
|
|
121
|
+
- CI/merge acceptance happens but docs are not promoted; release readiness or learning review should record a docs follow-up or explain why durable docs are not needed.
|
|
122
|
+
- Temporary verifier-local sidecar mismatch notes remain in the history; terminal artifacts must show final sidecar reconciliation before reporting clean delivery.
|
|
123
|
+
- Deep-context delivery contains stale shortcuts; agent should ignore stale context and still preserve Definition Of Done, explicit `NOT_VERIFIED`, Goal Fit, and Final Acceptance.
|
|
124
|
+
|
|
125
|
+
### 5. End-To-End Loop Evals
|
|
126
|
+
|
|
127
|
+
Run selectively for workflow release candidates.
|
|
128
|
+
|
|
129
|
+
```text
|
|
130
|
+
idea-to-backlog -> pull-work -> design-probe -> plan-work -> execute-plan -> review-work -> verify-work -> goal-fit -> evidence-gate -> publish-change -> release-readiness -> final-acceptance-docs -> learning-review
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
The end-to-end eval should assert that:
|
|
134
|
+
|
|
135
|
+
- each phase consumes the prior artifact instead of reinterpreting the goal from scratch
|
|
136
|
+
- worktree decisions are recorded before implementation planning
|
|
137
|
+
- acceptance criteria survive through planning, implementation, review, verification, and evidence review
|
|
138
|
+
- requirement and acceptance ids survive from backlog work item through planning, implementation, review, verification, and evidence review
|
|
139
|
+
- Goal Fit checks the original user outcome before delivery
|
|
140
|
+
- shared contracts remain the source of truth even after context gets long
|
|
141
|
+
- failed or missing evidence loops back to the right phase
|
|
142
|
+
- release readiness, docs promotion, and learning feedback are produced before final completion
|
|
143
|
+
|
|
144
|
+
This layer is intentionally expensive and should not run on every edit.
|
|
145
|
+
|
|
146
|
+
The deterministic local smoke layer is cheaper than a full LLM end-to-end eval. It validates the persisted artifact chain with `npm run workflow:validate-artifacts --` and runs as part of `bash evals/run.sh integration`. Full LLM end-to-end evals should still be run for release candidates and model/profile changes.
|
|
147
|
+
|
|
148
|
+
The default Flow Agents CI baseline is the provider-check lane for ordinary pull requests and `main` pushes:
|
|
149
|
+
|
|
150
|
+
```bash
|
|
151
|
+
bash evals/ci/run-baseline.sh
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
It runs deterministic credential-free checks: source tree validation, context-map drift, static evals, workflow artifact checks, publish-change helper coverage, sidecar writer coverage, Goal Fit and workflow steering hooks, hook-influence contract checks, Flow Kit repository checks, runtime adapter activation, and bundle install smoke tests. It writes logs plus Markdown provider evidence summaries under `evals/results/ci-baseline/`. GitHub Actions uploads separate per-lane artifacts: `flow-agents-ci-source-and-static`, `flow-agents-ci-workflow-contracts`, and `flow-agents-ci-runtime-and-kit`.
|
|
155
|
+
|
|
156
|
+
Default CI intentionally skips live GitHub mutation checks, LLM behavioral/acceptance evals, and Veritas/governance provider evidence unless a maintainer opts into those lanes. The CI summary must name those skips so evidence-gate and release-readiness can classify them as accepted skips or `NOT_VERIFIED` according to change risk.
|
|
157
|
+
|
|
158
|
+
Surface trust artifact attachment is covered by deterministic schema, runtime, and report checks, not by live provider authority. The targeted local command is:
|
|
159
|
+
|
|
160
|
+
```bash
|
|
161
|
+
bash evals/integration/test_workflow_sidecar_writer.sh
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
That eval exercises Builder Kit `surface.claim` evidence using provider-neutral TrustReport / Trust Snapshot fixtures for accepted, rejected, stale, missing-authority, integrity-mismatch, provider-absent, and artifact-absent cases. It proves Flow Agents can record compact Surface claim evidence in `evidence.json` and report pass, fail, or `NOT_VERIFIED` gaps without requiring provider-specific fields.
|
|
165
|
+
|
|
166
|
+
This coverage does not redefine Flow gate authority. Flow Definitions continue to express expectations, Flow project config owns trusted producer mappings and gate overrides, and Flow gate authority remains outside the local report writer. Runtime/provider gaps should be recorded as `NOT_VERIFIED` when a configured Surface claim path cannot be checked; ordinary Builder Kit workflows remain valid when no trust provider or trust artifact is configured.
|
|
167
|
+
|
|
168
|
+
The same sidecar writer eval covers runtime transition enforcement without making Flow Agents the owner of transition semantics. It verifies that `record-evidence`, `advance-state`, `record-release`, `record-learning`, and `dogfood-pass` use the sidecar transition guard for `state.json` and `handoff.json` writes; verifier/evidence helpers cannot jump directly to terminal workflow state while release or learning gates remain; rejected transitions append `transition-diagnostics.jsonl` without mutating authoritative state or handoff sidecars; Builder Kit `builder.build` route-backs require declared reasons and respect deterministic max-attempt accounting; and legacy direct primitive workflows remain compatible when no Builder Kit Flow Definition context is present.
|
|
169
|
+
|
|
170
|
+
Learning sidecar evals also protect the correction contract. Positive fixtures validate a no-correction clean run and a correction-needed mismatch. Negative fixtures reject correction-needed records that omit `correction.recurrence_key` or omit both prevention route and `no_change_rationale`. These fields are local `learning.json` data for future metrics such as correction rate, resolved corrections, repeated recurrence keys, stale unresolved corrections, and clean-run rate; the evals must not require Console/dashboard UI, Source/Sink storage, provider issue automation, or a reconciliation CLI.
|
|
171
|
+
|
|
172
|
+
## Feedback Loop
|
|
173
|
+
|
|
174
|
+
Every failed behavioral or artifact eval should be classified:
|
|
175
|
+
|
|
176
|
+
- bad skill trigger description
|
|
177
|
+
- unclear workflow instructions
|
|
178
|
+
- missing artifact schema
|
|
179
|
+
- missing tool/subagent support
|
|
180
|
+
- bad eval prompt
|
|
181
|
+
- bad assertion/rubric
|
|
182
|
+
- model limitation
|
|
183
|
+
- real product ambiguity
|
|
184
|
+
- workflow design drift
|
|
185
|
+
|
|
186
|
+
Then update one of:
|
|
187
|
+
|
|
188
|
+
- skill frontmatter
|
|
189
|
+
- skill body
|
|
190
|
+
- static contract eval
|
|
191
|
+
- behavioral prompt/rubric
|
|
192
|
+
- artifact schema
|
|
193
|
+
- source workflow document
|
|
194
|
+
- backlog issue for missing tool support
|
|
195
|
+
- knowledge note for durable learning
|
|
196
|
+
|
|
197
|
+
Do not fix eval failures by weakening the goal. If the goal is wrong, update the design artifact first, then the eval.
|
|
198
|
+
|
|
199
|
+
Post-run usage feedback should be recorded through the normalized feedback-loop schema described in https://github.com/kontourai/flow-agents/blob/main/docs/agent-usage-feedback-loop.md. Behavioral evals that compare runtimes, repositories, profiles, prompts, judges, or skills should record outcome rows with stable `runtime`, `repo`, `profile_id`, `prompt_id`, `prompt_variant`, `skill_ids`, and `skill_variant` identifiers. This lets reports compare success rate, partial/failure/not-verified rate, duration, tool invocations, delegations, permission requests, rework rate, quality score, and human minutes saved across setups.
|
|
200
|
+
|
|
201
|
+
Quality outcomes are manual or eval-recorded evidence. Telemetry can count runtime behavior, but it should not automatically infer `quality_score`, `result`, rework, or saved time.
|
|
202
|
+
|
|
203
|
+
Example cross-repo comparison:
|
|
204
|
+
|
|
205
|
+
```bash
|
|
206
|
+
npm run usage-feedback -- report \
|
|
207
|
+
--telemetry-dir ../repo-a/.telemetry \
|
|
208
|
+
--telemetry-dir ../repo-b/.telemetry \
|
|
209
|
+
--group-by repo
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
Example runtime and judge comparison:
|
|
213
|
+
|
|
214
|
+
```bash
|
|
215
|
+
bash evals/run.sh llm dev --runtime claude --judge-runtime codex --suite regression
|
|
216
|
+
bash evals/run.sh llm dev --runtime claude --judge-runtime claude --suite regression
|
|
217
|
+
bash evals/run.sh llm dev --runtime codex --judge-runtime claude --suite regression
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
Claude Code acceptance is cheap by default and should stay that way for routine checks. Use explicit opt-in flags only when spending Claude usage is intentional:
|
|
221
|
+
|
|
222
|
+
```bash
|
|
223
|
+
FLOW_AGENTS_ACCEPTANCE_CLAUDE_LLM=1 bash evals/run.sh acceptance claude
|
|
224
|
+
FLOW_AGENTS_ACCEPTANCE_CLAUDE_LLM=1 FLOW_AGENTS_ACCEPTANCE_REQUIRE_CLAUDE_TELEMETRY=1 bash evals/run.sh acceptance claude
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
Runtime hook behavior is intentionally tested at different levels. Keep these lanes separate when citing evidence for the Builder Kit `plan -> execute -> review -> verify` loop:
|
|
228
|
+
|
|
229
|
+
- Adapter evals prove runtime-specific adapters can transform hook output into the protocol shape expected by Codex, Claude Code, or Kiro. They prove delivery shape, not live model influence.
|
|
230
|
+
- Installed-command evals execute Claude Code, Codex, and Kiro hook commands from installed bundle paths against the same workflow state fixture. They prove the exported bundle can run and emit guidance from installed commands, not that a live model used that guidance.
|
|
231
|
+
- Claude Code live acceptance proves prompt-submit workflow-steering context reaches the model and changes the final response.
|
|
232
|
+
- Kiro live acceptance proves strict Goal Fit Stop gates surface as hook failures in the CLI. Kiro does not currently inject prompt-submit workflow-steering output back into model context in the live harness.
|
|
233
|
+
- Codex `exec` live acceptance proves exported agents and skill routing from a full installed bundle. Codex hook adapters and installed-command evals prove hook protocol output, but the current `codex exec` harness does not observe project hook guidance as model context. Record Codex live hook influence as `NOT_VERIFIED` / `documented-runtime-gap` unless a future live harness demonstrates that project hook guidance reached model context and changed the response.
|
|
234
|
+
|
|
235
|
+
Hook-influence behavioral cases live in `evals/fixtures/hook-influence/cases.json` and are validated by `npm run validate:hook-influence --`. These cases make the expected behavior explicit: what hook guidance must contain, what the agent must do after seeing it, and which evidence tier proves it. For `kontourai/flow-agents#62`, the required cases cover missing pickup Probe before planning, review-before-verify after execution, verification failure route-back with preserved FAIL evidence, and Goal Fit stop behavior. Review remains report-only critique recorded in `critique.json`; verification remains evidence recorded in `evidence.json`. Open critique findings or verification failure route back through execution before the loop can be delivered.
|
|
236
|
+
|
|
237
|
+
Evidence tiers:
|
|
238
|
+
|
|
239
|
+
| Tier | Meaning |
|
|
240
|
+
| --- | --- |
|
|
241
|
+
| `adapter` | Runtime adapter transforms hook output into the target runtime protocol; proves protocol delivery shape, not live model influence. |
|
|
242
|
+
| `installed-command` | The exported hook command runs from installed Codex, Claude Code, and Kiro bundle paths and emits the expected guidance. |
|
|
243
|
+
| `live-acceptance` | A live runtime session shows the agent responding differently because hook guidance reached the model or runtime stop gate. |
|
|
244
|
+
| `documented-runtime-gap` | The runtime is covered by adapter or installed-command evidence, but a live harness cannot yet prove model-context influence. |
|
|
245
|
+
| `design-target` | Expected behavior is captured as an executable fixture contract, but implementation or live harness evidence is intentionally deferred. |
|
|
246
|
+
|
|
247
|
+
Use the Flow Agents CI baseline as the provider evidence lane for this deterministic coverage:
|
|
248
|
+
|
|
249
|
+
```bash
|
|
250
|
+
bash evals/ci/run-baseline.sh
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
For GitHub provider evidence, cite the relevant uploaded lane artifact/check: `flow-agents-ci-source-and-static`, `flow-agents-ci-workflow-contracts`, or `flow-agents-ci-runtime-and-kit`. Those artifacts are provider evidence for deterministic local contracts and installed-command behavior. They intentionally skip live LLM influence checks unless separately configured, so they must not be cited as proof that Codex live model context was changed by project hooks.
|
|
254
|
+
|
|
255
|
+
Run the non-LLM hook-influence contract with:
|
|
256
|
+
|
|
257
|
+
```bash
|
|
258
|
+
bash evals/integration/test_hook_influence_cases.sh
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
Example Codex profile comparison:
|
|
262
|
+
|
|
263
|
+
```bash
|
|
264
|
+
npm run usage-feedback -- report \
|
|
265
|
+
--telemetry-dir .telemetry/codex-default \
|
|
266
|
+
--telemetry-dir .telemetry/codex-bedrock \
|
|
267
|
+
--runtime codex \
|
|
268
|
+
--group-by profile_id
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
Telemetry import is runtime-neutral. Use `import-telemetry --runtime <runtime>` for Kiro, Codex, Claude Code, or future runtimes that emit the shared event envelope; `import-codex` remains a compatibility alias for Codex full logs.
|
|
272
|
+
|
|
273
|
+
```bash
|
|
274
|
+
npm run usage-feedback -- import-telemetry \
|
|
275
|
+
--runtime claude-code \
|
|
276
|
+
--input-telemetry-dir /path/to/project/.telemetry \
|
|
277
|
+
--telemetry-dir .telemetry/claude
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
## Release Criteria
|
|
281
|
+
|
|
282
|
+
Workflow changes are ready to release when:
|
|
283
|
+
|
|
284
|
+
- static contract evals pass
|
|
285
|
+
- relevant behavioral cases pass or have documented runtime blockers
|
|
286
|
+
- hook-influence behavioral cases validate and any runtime gaps are explicitly marked
|
|
287
|
+
- artifact quality checks cover changed artifact contracts
|
|
288
|
+
- adversarial cases exist for any newly added gate behavior
|
|
289
|
+
- end-to-end evals pass for workflow release candidates
|
|
290
|
+
- `bash evals/integration/test_workflow_artifacts.sh` passes for shared-contract artifact changes
|
|
291
|
+
- generated bundle docs and skill maps agree on owners, gates, artifacts, and deferred primitives
|
|
292
|
+
|
|
293
|
+
## Runtime Notes
|
|
294
|
+
|
|
295
|
+
Behavioral results must record which runtime is being evaluated: Codex or Kiro. A pass in one runtime does not automatically prove the other unless the prompt path, tools, and skill-loading behavior are equivalent.
|
|
@@ -0,0 +1,51 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: Shared Workflow Contracts
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
# Shared Workflow Contracts
|
|
6
|
+
|
|
7
|
+
The workflow system now separates durable process contracts from role-specific instructions.
|
|
8
|
+
|
|
9
|
+
## Source Of Truth
|
|
10
|
+
|
|
11
|
+
Shared contracts live in `context/contracts/`:
|
|
12
|
+
|
|
13
|
+
- `artifact-contract.md`
|
|
14
|
+
- `planning-contract.md`
|
|
15
|
+
- `execution-contract.md`
|
|
16
|
+
- `review-contract.md`
|
|
17
|
+
- `verification-contract.md`
|
|
18
|
+
- `delivery-contract.md`
|
|
19
|
+
|
|
20
|
+
These files define the stable rules for artifacts, planning, execution, review, verification, delivery loops, Goal Fit, and final acceptance.
|
|
21
|
+
|
|
22
|
+
The durable resource shape for selected scope, workflow runs, run plans, status conditions, provider-backed Work Items, and sidecar compatibility direction is documented in the Kontour Resource Contract:
|
|
23
|
+
https://github.com/kontourai/flow-agents/blob/main/docs/kontour-resource-contract.md
|
|
24
|
+
|
|
25
|
+
That reference is docs-only guidance for new resource-shaped contracts and does not migrate current sidecars or require Kubernetes at runtime.
|
|
26
|
+
|
|
27
|
+
The lifecycle for in-progress and completed workflow artifacts is documented in the Workflow Artifact Lifecycle:
|
|
28
|
+
https://github.com/kontourai/flow-agents/blob/main/docs/workflow-artifact-lifecycle.md
|
|
29
|
+
|
|
30
|
+
Runtime workflow artifacts under `.flow-agents/` remain local and ignored. Completed work must promote durable behavior, decisions, contracts, and evidence into long-lived docs, source, schemas, or provider records before merge to `main`.
|
|
31
|
+
|
|
32
|
+
## How Skills And Agents Use Them
|
|
33
|
+
|
|
34
|
+
Skills should explain when to run a workflow and how to orchestrate it. They should reference the relevant contract instead of restating the full protocol.
|
|
35
|
+
|
|
36
|
+
Agents should explain their role-specific behavior. They should follow the relevant contract instead of carrying a second copy of the artifact or verdict format.
|
|
37
|
+
|
|
38
|
+
This keeps the system portable across Codex, Kiro, Claude Code, and future distributions. Exporters can adapt tool names, paths, and hook syntax, but the workflow rules stay canonical.
|
|
39
|
+
|
|
40
|
+
## Eval Direction
|
|
41
|
+
|
|
42
|
+
Static evals check that the contract files exist, that workflow skills and tool agents reference them, and that deep-context behavioral eval cases exist.
|
|
43
|
+
|
|
44
|
+
Behavioral evals should test whether the agent still preserves the contract after a long prompt or long session history:
|
|
45
|
+
|
|
46
|
+
- planning still includes Definition Of Done, stop-short risks, and evidence-bearing acceptance criteria
|
|
47
|
+
- execution still updates artifacts between waves
|
|
48
|
+
- review still records report-only critique in `critique.json`
|
|
49
|
+
- verification still reports PASS, FAIL, or NOT_VERIFIED with evidence per criterion
|
|
50
|
+
- delivery still completes Goal Fit before final response
|
|
51
|
+
- final acceptance still promotes durable docs after CI or merge
|