npm - @bolloon/bolloon-agent - Versions diffs - 0.1.0 → 0.1.1 - Mend

@bolloon/bolloon-agent 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (431) hide show

package/src/bollharness/docs/launch-article-en.md DELETED Viewed

@@ -1,276 +0,0 @@
-# Anthropic Keeps Saying "Harness." We've Been Running One for Six Months.
----
-## 00
-Yesterday Anthropic launched Managed Agents.
-Their technical docs keep repeating one word: **Harness**.
-> Every Agent request should run in a governed environment.
-This isn't a new idea to us. We've been running a Harness on our own project for six months.
-Today we're open-sourcing it.
----
-## 01 What is a Harness, anyway?
-The word "Harness" is suddenly everywhere, but many people haven't figured out what it actually means.
-Simply put: **AI is great at doing work, but terrible at managing itself. A Harness is the external system that manages it.**
-Analogy: You hired a brilliant intern. Writes code fast, understands complex problems, but—
-- You tell them not to touch the production database. They sometimes forget.
-- You tell them to run tests before submitting. They say "done" but didn't actually run them.
-- You ask them to fix one bug. They refactor three other files you didn't ask about.
-- You ask them to review someone's code. While "reviewing," they edit the code.
-What does this intern need? Not a longer handbook. **Process, checks, mechanical enforcement.**
-That's a Harness.
-Anthropic's Managed Agents is a **cloud-hosted Harness** — they manage sandboxing, state, retries.
-Our bollharness is a **local development Harness** — we manage code quality, review flow, completion verification.
-Different layers, same core insight: **AI reliability can't be guaranteed by AI alone.**
----
-## 02 You've been here before
-You ask Claude Code to build a feature.
-It writes fast. Architecture looks solid. You go get coffee.
-You come back—
-**"All tests pass."** You run them. Three fail.
-**"Done."** You check. Two TODOs unhandled.
-**It fixes one bug.** Refactors three files you didn't ask it to touch.
-**It reviews code.** While "reviewing," it edits the code it was supposed to review.
-You tell it not to do something. It listens 80% of the time. The other 20%, it doesn't disobey on purpose—it genuinely forgets.
-You spend more time **supervising AI** than you **saved in development time**.
-This isn't a capability problem. Claude Code is remarkably capable.
-This is a **governance problem**.
----
-## 03 One number that changes everything
-We ran a production project (boll, an agent collaboration protocol) for 6 months. We collected a lot of data.
-One number changed everything:
-<!-- [Diagram ①: Six-Layer Governance Stack] -->
-> **CLAUDE.md instruction compliance: ~20%**
->
-> **Hook enforcement: 100%**
-The rules you write in CLAUDE.md are consistently followed only about 20% of the time.
-Not because the AI doesn't want to comply. Because context windows are finite, attention drifts, and long conversations compress early instructions. This is a **structural constraint** of LLMs, not a prompting skill issue.
-But hooks execute at 100%. Because hooks aren't requests. They're **physical constraints**.
-Traffic lights aren't suggestions. They're mechanical devices.
----
-## 04 Six layers: how we make AI develop autonomously
-You might ask: **"So how do you actually use AI for development? You don't have to supervise?"**
-Not "don't have to supervise." Humans do three things: **set direction, correct course, accept PRs**.
-Everything in between — writing code, running tests, doing reviews, writing docs, splitting tasks, detecting drift — is fully automatic.
-Because six layers have your back.
----
-### Layer 1: Specialization — 16 Skills = 16 "Roles"
-Not one AI doing everything.
-Architect (`arch`), bug triage (`bug-triage`), failure pattern extraction (`crystal-learn`), state machine control (`lead`), handoff coordination (`harness-dev-handoff`)…
-Each Skill has clear input/output contracts. No boundary crossing. Like a proper engineering team — not a generalist, but a coordinated organization.
----
-### Layer 2: Mechanical Governance — Hooks = The Invisible Hand
-<!-- [Diagram ②: Hook Lifecycle] -->
-16 hooks across 7 lifecycle stages. Not post-hoc review — **intercepting at the moment of action**:
-| What you worry about | How hooks handle it |
-|---------------------|-------------------|
-| AI sneaks a deploy to production | PreToolUse blocks scp/rsync, exit 2 hard stop |
-| AI edits a file without knowing its rules | PostToolUse auto-injects that file's domain context |
-| AI edits the same file back and forth | PostToolUse detects the loop, raises a warning |
-| AI wants to quit but has uncommitted code | Stop hook checks transcript, blocks premature exit |
-| AI claims "tests pass" with no evidence | Mechanical gate checks progress.json — can't fake it |
-Not "reminding you not to." **Physically preventing it.**
----
-### Layer 3: 8-Gate Quality Flow — You Can't Review Your Own Work
-<!-- [Diagram ③: 8-Gate State Machine] -->
-Every significant change must pass through 8 gates. 4 of them are **review gates** (Gate 2/4/6/8).
-How review gates work: automatically spawn an **independent review Agent**.
-- Independent context (doesn't share the main Agent's conversation history)
-- Read-only tools (physically can't modify code — more on this below)
-- Fresh review (not anchored by prior work)
-Why can't AI review its own work? Because when AI asks itself "did I do a good job?", the answer is always "yes."
-Ask a different AI, and the answer gets honest.
----
-### Layer 4: Self-Learning — Same Mistake Never Twice
-`crystal-learn` automatically extracts invariants from failures.
-Example: AI changes a backend API but forgets to update frontend calls → extracts the rule: **"When changing a contract, grep all consumers."**
-This rule doesn't sit in a document waiting to be remembered. It gets **mechanically injected into execution-layer Skills**, automatically active next time.
-Fail once, immunize forever.
----
-### Layer 5: Context Engineering — AI Doesn't Get Lost
-The most common AI mistake in long conversations: forgetting what it's doing.
-Three countermeasures:
-- **On-demand injection**: When editing a file, only that file's domain rules are loaded — not everything at once
-- **Session isolation**: Multiple AIs working in parallel, each scoped via independent transcript files — no cross-contamination
-- **Compression protection**: Critical info auto-saved before context compression
-<!-- [Diagram ④: Session Isolation] -->
-AI always knows what it's doing. Because hooks remind it at every critical moment.
----
-### Layer 6: Physical Isolation — Not "Please Don't" — Can't
-<!-- [Diagram ⑤: Schema-Level Review Isolation] -->
-This is the **most critical design decision** in the entire system.
-| Method | Compliance |
-|--------|-----------|
-| Write "don't modify files" in the prompt | ~70% |
-| Remove Edit/Write from the tool manifest | **100%** |
-Because it **physically can't call tools that don't exist** in its schema.
-This is called schema-level isolation. The entire bollharness design philosophy in one sentence:
-> **If it matters, don't ask. Enforce.**
----
-## 05 Real story: one Stop Hook, three rounds of fixes
-bollharness isn't an ideal system from a paper. It was **forged by reality**.
-Our Stop hook (prevents AI from quitting prematurely) went through three rounds of fixes:
-**Round 1**: AI triggers completion checklist during pure chat → added "check for write operations"
-**Round 2**: AI edits files, commits, continues chatting — still triggers → changed to "edited files ∩ uncommitted git changes" intersection
-**Round 3**: Two AIs working in parallel, shared state contaminates each other → switched to session-scoped transcript isolation
-Three rounds. One hook.
-**This is why the system works — it's not designed to be perfect. It's battle-tested against AI's creative workarounds.**
-Every rule exists because the previous rule had a loophole.
----
-## 06 How does it compare?
-| | CLAUDE.md | Cursor Rules | Managed Agents | **bollharness** |
-|---|---|---|---|---|
-| Constraint method | Text instructions | Text instructions | Cloud-hosted | Local hooks |
-| Compliance rate | ~20% | ~20% | N/A (cloud) | **100% (mechanical)** |
-| Review mechanism | Self-review | Self-review | None | **Independent Agent + schema isolation** |
-| Parallel sessions | Cross-contamination | Cross-contamination | Isolated | **Transcript-scoped isolation** |
-| Failure mode | Silent skip | Silent skip | Cloud retry | **Block + feedback** |
-| Self-learning | ✗ | ✗ | ✗ | **✓ (crystal-learn)** |
-| Deployment | Write a file | Write a file | API integration | **One-command install** |
-| Works with | Any AI editor | Cursor | API calls | **Claude Code** |
-CLAUDE.md is still useful. bollharness doesn't replace it — **it enforces it**.
----
-## 07 Three minutes to install
-```bash
-git clone https://github.com/NatureBlueee/bollharness.git
-cd bollharness
-npx ts-node src/scripts/install/phase2_auto.ts /path/to/your/project --tier drop-in
-```
-Three tiers, based on trust level:
-| Tier | What it does | Who it's for |
-|------|-------------|-------------|
-| **drop-in** | Installs as-is, doesn't read your code | Try it out |
-| **adapt** | Reads your docs, adapts Skills | Real projects |
-| **mine** | Reads your work transcripts, deep adaptation | Long-term use |
-Idempotent — run it twice, same result. Won't overwrite your existing config.
-Requires: Claude Code CLI + Node.js 18+ + Git.
----
-## 08 Origin
-bollharness was extracted from 6 months of production use on [boll](https://boll.net), an agent collaboration protocol project.
-While building boll, we kept getting burned by AI's creative workarounds — so we kept adding rules, hooks, isolation. Eventually we realized the governance layer was more universally valuable than the project itself.
-Every AI-assisted project needs this. Not just ours.
-So we extracted it and open-sourced it.
-**GitHub**: [NatureBlueee/bollharness](https://github.com/NatureBlueee/bollharness)
-**License**: MIT
----
-*Anthropic says Harness is the future of Agents.*
-*We say Harness is what Agents forced into existence.*
-*Every rule has a story: an AI that found a creative way around the last one.*

package/src/bollharness/docs/launch-article-zh.md DELETED Viewed

@@ -1,305 +0,0 @@
-# 53 天，1137 次提交，93 万行代码——全是 AI 自己写的
-你有没有试过，把一整个项目交给 AI？
-不是让它帮你补个函数、改个 bug。
-而是——你说一句"我要做 X"，然后它自己设计架构、拆任务、写代码、跑测试、做审查、出 PR。
-我们做到了。
-53 天。1137 次提交。93 万行代码。1992 个测试。一个人 + AI，全自主交付了一个完整的商业项目。
-今天开源的这个东西，就是让这件事成为可能的核心。
-**bollharness**——Claude Code 的治理层。
----
-## 一句话说清它是什么
-bollharness 给你的 Claude Code 项目装上一套完整的工程治理。
-一行命令安装。装完之后，AI agent 获得四个能力：
-**自组织**——16 个专业 skill，从架构到开发到审查，各司其职，8 关状态机管控全流程。
-**自学习**——AI 犯过的错自动提取成规则，注入执行层，同样的坑永不再踩。
-**自发现**——从你的工作历史中识别重复模式，提议封装成新 skill。harness 会长出新能力。
-**自安装**——三档深度，按你的信任度选择。幂等安装，不覆盖已有配置。
-不是提示词。不是文档。是机械化的工程治理。
----
-## 为什么需要它
-用过 Claude Code 的人都有体感：AI 写代码很快，但**放心交给它**是另一回事。
-你在 CLAUDE.md 里写了一堆规则。AI 有时候听，有时候不听。不是它故意的——对话长了，早期规则会被上下文压缩挤掉。我们实测下来，CLAUDE.md 的规则遵从率大概 **20%**。
-你叫审查 AI "只看不改"。它有 30% 的概率会忍不住顺手把代码改了。
-AI 说"完成了"。你去一查，测试没跑，文件没提交。我们统计过，stop 尝试中有 **67%** 没有真正完成的证据。
-这些不是 bug。这是 LLM 的结构性特征。
-解决方法不是写更好的提示词。
-是造**机械装置**。
----
-## 自组织：你说一句话，它自己干完
-你说"我要做一个用户注册页面"。
-接下来发生的事：
-**Gate 0** — `lead` 接管，锁定问题，判定变更分类。
-**Gate 1** — `arch` 输出架构设计，列清所有消费方。
-**Gate 2** — 自动启动一个**全新的、独立的 AI** 做审查。这个审查 AI 不共享之前的对话，从头看。而且它的工具清单里没有 Edit 和 Write——物理上改不了代码。
-**Gate 3-4** — 输出详细计划，`plan-lock` 冻结方案，再做一轮独立审查。
-**Gate 5** — `task-arch` 把计划拆成可并行执行的工作包。每个包有明确的文件范围、接缝负责人、验收标准。
-**Gate 6** — 独立审查工作包拆分。
-**Gate 7** — `harness-dev` 写代码。边写边记日志，不是事后补的——因为事后补的日志我们被坑过，里面的"证据"有可能是编的。
-**Gate 8** — 终审。独立 AI + 端到端验收。通过了才出 PR。
-**你做的事：定方向、纠偏、accept PR。中间全自动。**
-16 个 skill 各司其职：
-- `lead` — 流程统领，fail-closed 状态机
-- `arch` — 架构师，方案比较和边界冻结
-- `harness-dev` — 全栈开发，代码实现和变更传播
-- `guardian-fixer` — Bug 修复，自带 8 关验证管道
-- `task-arch` — 任务拆解，接缝是第一等公民
-- `harness-ops` — 运维巡检，真相源漂移检测
-- `harness-eng-test` — 测试闭环，诚实标注 PASS 和 BLOCKED
-- `harness-lab` — 实验科学家，用数据证明设计决策
-- `crystal-learn` — 失败模式提取，注入执行层
-- `skill-discovery` — 从工作历史发现新能力
-- `plan-lock` — 冻结计划，关掉所有决策口子
-- `bug-triage` — Bug 分诊
-- `harness-eng` — 工程编排，管理并行执行
-- `harness-voice` — 品牌表达
-- `bug-pipeline` — 端到端 bug 修复流水线
-- `harness-dev-handoff` — 新 AI 接手入口
-不是一个全能 AI 干所有事。
-是一个**分工明确的团队**。
----
-## 自学习：犯一次错，永久免疫
-AI 改了后端 API 接口，忘了改前端调用。你修了。
-下次，它又忘了。
-因为 LLM 没有跨 session 的记忆。上次踩的坑，下次还会踩。
-bollharness 有一个叫 `crystal-learn` 的能力。它的角色是**适应性免疫系统**。
-它不修具体 bug。它提取"这类错误为什么总会回来"，把它压成**不变量**——就是"不管什么情况都必须遵守的结构约束"。
-上面那个例子会被提取为：
-> **INV-1 波纹衰减**：改了源头不等于改完。函数签名、schema、路由改动后，必须 grep 消费方。
-这条规则不是写在文档里等 AI 记住——遵从率 20%，记得吗？
-它被**注入到 `harness-dev` 这个执行层 skill 中**。下次 AI 改 API 时，这条规则作为 skill 的一部分自动生效。
-目前我们已经从真实生产事故中提取了 **8 条已确认不变量**：
-**INV-0 快照幻觉** — 记忆中的文件状态不等于磁盘当前状态。声称某文件存在之前，必须实际确认。
-**INV-1 波纹衰减** — 改了源头不等于改完。改动后必须 grep 消费方。
-**INV-2 格式断崖** — 表面格式正确不代表语义正确。
-**INV-3 并发写入** — 多 agent 并行时，共享文件必须有写入协调。
-**INV-4 真相源分裂** — 同一个事实写在两个地方，迟早不一致。必须收敛到单一源。
-**INV-5 语义搭便车** — 表面一样不代表语义一样。复用前先问"原来为什么这么写"。
-**INV-6 验证衰减** — 只测最容易的一层，难的跳过。验证必须从用户价值链倒推。
-**INV-7 无主接缝** — 两个模块的交界处没人管。共享接口必须指定 seam_owner。
-每一条都来自真实的踩坑。每一条都注入了具体的执行层 skill。
-**这不是规则库。这是免疫系统。**
----
-## 自安装：一行命令，开箱即用
-```bash
-git clone https://github.com/NatureBlueee/bollharness.git
-cd bollharness
-npx ts-node src/scripts/install/phase2_auto.ts /path/to/your/project
-```
-三档安装，按你的信任深度选：
-**drop-in** — 原样安装所有 hook 和 skill。不读你的任何文件。适合先试试看。
-**adapt**（默认）— 读你的 README 和文档，把 skill 适配到你的项目上下文。绝大多数人用这个。
-**mine** — 读你的工作 transcript，深度适配到你的开发模式。适合长期使用的项目。
-装完之后你的项目里多了：
-```
-.boll/
-├── settings.json    ← 16 条 hook 自动注册
-├── skills/          ← 16 个专业 skill
-└── rules/           ← 路径规则，编辑什么文件加载什么规则
-src/scripts/
-├── hooks/           ← 18 个生命周期 hook
-└── checks/          ← 15 个自动验证器
-```
-安装器是**幂等**的。跑两次，结果一样。不会覆盖你已有的 settings.json 配置，只追加。
----
-## 自发现：harness 会长出新能力
-`skill-discovery` 会读你的工作历史，识别重复出现的模式。
-比如它发现：你在过去 10 次 session 里，有 7 次在做同一种类型的 API 对接，每次都要花 15 轮对话解释同样的规范。
-然后它提议：**要不要把这个封装成一个专属 skill？**
-你同意了。它生成一份 SKILL.md 草稿，包含触发条件、执行流程、判断框架。
-下次遇到同类工作，AI 自动加载这个 skill。15 轮重复解释变成 0 轮。
-harness 不是静态的工具箱。
-**它会跟着你的项目一起进化。**
----
-## 机械约束：让规则变成红绿灯
-上面讲的是能力。下面讲的是**底线**。
-bollharness 有 18 个 hook，覆盖 7 个生命周期阶段。它们在动作发生的那一刻介入，不是事后审查：
-**SessionStart** — 加载上下文、重置风险状态、检查工具可用性。
-**PreToolUse** — 在 AI 执行动作之前拦截。危险部署？直接 exit 2，物理阻断。审查 agent 想 spawn？检查是否包含只读约束。读文件？自动脱敏。
-**PostToolUse** — 编辑了文件？自动注入这个文件所属领域的规则（17 套上下文片段，按文件路径精确路由）。检测循环。追踪风险等级。
-**Stop** — AI 想退出？先解析本 session 的 transcript，提取写过的文件列表，和 git 未提交变更取交集。没有完成证据就拦住。
-**SessionEnd** — 自动反思、分析行为轨迹、持久化进度。
-**PreCompact** — 上下文压缩前，自动保存关键目标信息，防止压缩后遗忘。
-**PostToolUseFailure** — 工具调用失败后自动记录，用于后续模式分析。
-不是"提醒你别做"。
-是红绿灯——**不管你记不记得交规，红灯就是停。执行率 100%。**
----
-## 自治：harness 自己出问题，谁来修？
-上面那些 skill、不变量、hook，是 harness 管你写代码。
-那 harness 自己呢？谁管它？
-53 天里，它真的出过这些问题：
-**协调员断线**——开三个窗口并行干活，主窗口把信息传给子窗口靠人手复制粘贴。人下线了，流水线就停了。
-**编号打架**——两个 session 各自起了一份 ADR-088，一个讲 A 一个讲 B。Git merge 后只剩一份，另一份的设计决策悄无声息消失。
-**记忆涨爆**——跨 session 的记忆文件涨到 500 行。CC 的上下文窗口物理截到 200 行——后面 300 行对 AI 等于没写过。
-**修问题反而出问题**——你叫 AI 治理"过度审查"，AI 来回审查了 6 轮才把治理方案交给你。
-这些不是 bug，是 harness 自己暴露的元层断点。
-我们花了 9 天，把它们收尾成一个叫 **H 系列** 的体系——9 个站，9 条治理。挑 3 条最有代表性的：
-**H9 多窗口邮箱**——把"AI 协调员"物理化为"文件邮箱"。子窗口把消息写到一个目录，主窗口启动时自动读、自动注入到 prompt，处理完写一份 ack 回去。**没有 AI 当信使，没有人手复制粘贴。**
-**H0.4 自指禁止**——修问题 X 的工作，不得在自己的交付物里重现 X 的症状。配一个审计文件叫 `hanis-self-symptoms.md`，专门记"修问题反而引入问题"的实例。**把 AI 的反讽时刻变成可审的数据流。**
-**编号唯一性 chokepoint**——一个 git pre-commit hook，撞号直接拒绝提交。schema 级阻断，10 次历史撞号事件之后落地。这一条今天就同步到 bollharness 里了，开箱可用。
-H 系列今天 9 个站全闭。新议题不再加 H10+，走普通工程计划自然消化。
-**它一直闭合不再开新站——这本身就是稳态的成功信号。**
----
-## 我们拿它做了什么
-bollharness 从 boll（流形）的生产开发中提取出来。流形是一个 Agent 协作协议项目。
-**53 天。1137 次提交。93 万行代码。1992 个测试。**
-一个人 + AI，全自主交付了：
-- 后端协议引擎（发现层 + 协商层 + 价值交互层）
-- MCP 服务器（PyPI + npm 双发布）
-- 官网（boll.net）
-- Admin 管理后台
-- 多个场景垂直应用（教练、黑客松、劳务市场）
-- 以及——harness 本身
-里面每一个 hook、每一种隔离机制、每一条不变量，都是因为 AI 找到了创造性的方法绕过上一条规则，然后我们才加上的。
-不是设计出来的完美系统。
-是被现实打磨出来的。
----
-## 现在就可以做的三件事
-**试用**
-```bash
-git clone https://github.com/NatureBlueee/bollharness.git
-```
-一行命令装进你的项目。先用 drop-in 档，零风险试水。
-**反馈**
-告诉我们什么好用、什么不好用。提 issue 或者直接在评论区聊。
-**贡献**
-每一条不变量、每一个 skill、每一个 hook 都来自真实踩坑。
-如果你也在用 AI 做开发，你一定有你自己的坑。欢迎补充。
-**GitHub**：github.com/NatureBlueee/bollharness
-**License**：MIT
----
-*一行命令。全自主开发。开箱即用。*