compound-agent 1.7.6 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +45 -1
- package/README.md +70 -47
- package/bin/ca +32 -0
- package/package.json +19 -78
- package/scripts/postinstall.cjs +221 -0
- package/dist/cli.d.ts +0 -1
- package/dist/cli.js +0 -13158
- package/dist/cli.js.map +0 -1
- package/dist/index.d.ts +0 -3730
- package/dist/index.js +0 -3240
- package/dist/index.js.map +0 -1
- package/docs/research/AgenticAiCodebaseGuide.md +0 -1206
- package/docs/research/BuildingACCompilerAnthropic.md +0 -116
- package/docs/research/HarnessEngineeringOpenAi.md +0 -220
- package/docs/research/code-review/systematic-review-methodology.md +0 -409
- package/docs/research/index.md +0 -76
- package/docs/research/learning-systems/knowledge-compounding-for-agents.md +0 -695
- package/docs/research/property-testing/property-based-testing-and-invariants.md +0 -742
- package/docs/research/scenario-testing/advanced-and-emerging.md +0 -470
- package/docs/research/scenario-testing/core-foundations.md +0 -507
- package/docs/research/scenario-testing/domain-specific-and-human-factors.md +0 -474
- package/docs/research/security/auth-patterns.md +0 -138
- package/docs/research/security/data-exposure.md +0 -185
- package/docs/research/security/dependency-security.md +0 -91
- package/docs/research/security/injection-patterns.md +0 -249
- package/docs/research/security/overview.md +0 -81
- package/docs/research/security/secrets-checklist.md +0 -92
- package/docs/research/security/secure-coding-failure.md +0 -297
- package/docs/research/software_architecture/01-science-of-decomposition.md +0 -615
- package/docs/research/software_architecture/02-architecture-under-uncertainty.md +0 -649
- package/docs/research/software_architecture/03-emergent-behavior-in-composed-systems.md +0 -644
- package/docs/research/spec_design/decision_theory_specifications_and_multi_criteria_tradeoffs.md +0 -0
- package/docs/research/spec_design/design_by_contract.md +0 -251
- package/docs/research/spec_design/domain_driven_design_strategic_modeling.md +0 -183
- package/docs/research/spec_design/formal_specification_methods.md +0 -161
- package/docs/research/spec_design/logic_and_proof_theory_under_the_curry_howard_correspondence.md +0 -250
- package/docs/research/spec_design/natural_language_formal_semantics_abuguity_in_specifications.md +0 -259
- package/docs/research/spec_design/requirements_engineering.md +0 -234
- package/docs/research/spec_design/systems_engineering_specifications_emergent_behavior_interface_contracts.md +0 -149
- package/docs/research/spec_design/what_is_this_about.md +0 -305
- package/docs/research/tdd/test-driven-development-methodology.md +0 -547
- package/docs/research/test-optimization-strategies.md +0 -401
- package/scripts/postinstall.mjs +0 -102
|
@@ -1,116 +0,0 @@
|
|
|
1
|
-
# Building a C compiler with a team of parallel Claudes
|
|
2
|
-
*Author: Nicholas Carlini – Safeguards team, Anthropic*
|
|
3
|
-
*Published: Feb 05, 2026* [page:1]
|
|
4
|
-
|
|
5
|
-
We tasked Opus 4.6 using agent teams to build a C Compiler, and then (mostly) walked away. Here's what it taught us about the future of autonomous software development.
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
I've been experimenting with a new approach to supervising language models that we’re calling "agent teams."
|
|
10
|
-
|
|
11
|
-
With agent teams, multiple Claude instances work in parallel on a shared codebase without active human intervention. This approach dramatically expands the scope of what's achievable with LLM agents.
|
|
12
|
-
|
|
13
|
-
To stress test it, I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.
|
|
14
|
-
|
|
15
|
-
The compiler is an interesting artifact on its own, but I focus here on what I learned about designing harnesses for long-running autonomous agent teams: how to write tests that keep agents on track without human oversight, how to structure work so multiple agents can make progress in parallel, and where this approach hits its ceiling.
|
|
16
|
-
|
|
17
|
-
## Enabling long-running Claudes
|
|
18
|
-
Existing agent scaffolds like Claude Code require an operator to be online and available to work jointly. If you ask for a solution to a long and complex problem, the model may solve part of it, but eventually it will stop and wait for continued input—a question, a status update, or a request for clarification.
|
|
19
|
-
|
|
20
|
-
To elicit sustained, autonomous progress, I built a harness that sticks Claude in a simple loop (if you’ve seen Ralph-loop, this should look familiar). When it finishes one task, it immediately picks up the next. (Run this in a container, not your actual machine).
|
|
21
|
-
|
|
22
|
-
```bash
|
|
23
|
-
#!/bin/bash
|
|
24
|
-
|
|
25
|
-
while true; do
|
|
26
|
-
COMMIT=$(git rev-parse --short=6 HEAD)
|
|
27
|
-
LOGFILE="agent_logs/agent_${COMMIT}.log"
|
|
28
|
-
|
|
29
|
-
claude --dangerously-skip-permissions \
|
|
30
|
-
-p "$(cat AGENT_PROMPT.md)" \
|
|
31
|
-
--model claude-opus-X-Y &> "$LOGFILE"
|
|
32
|
-
done
|
|
33
|
-
```
|
|
34
|
-
|
|
35
|
-
In the agent prompt, I tell Claude what problem to solve and ask it to approach the problem by breaking it into small pieces, tracking what it’s working on, figuring out what to work on next, and to effectively keep going until it’s perfect. (On this last point, Claude has no choice. The loop runs forever—although in one instance, I did see Claude ```bash pkill -9 bash``` on accident, thus killing itself and ending the loop. Whoops!).
|
|
36
|
-
|
|
37
|
-
## Running Claude in parallel
|
|
38
|
-
Running multiple instances in parallel can address two weaknesses of a single-agent harness:
|
|
39
|
-
|
|
40
|
-
- One Claude Code session can only do one thing at a time. Especially as the scope of a project expands, debugging multiple issues in parallel is far more efficient.
|
|
41
|
-
- Running multiple Claude agents allows for specialization. While a few agents are tasked to solve the actual problem at hand, other specialized agents can be invoked to (for example) maintain documentation, keep an eye on code quality, or solve specialized sub-tasks.
|
|
42
|
-
|
|
43
|
-
My implementation of parallel Claude is bare-bones. A new bare git repo is created, and for each agent, a Docker container is spun up with the repo mounted to /upstream. Each agent clones a local copy to /workspace, and when it's done, pushes from its own local container to upstream.
|
|
44
|
-
|
|
45
|
-
To prevent two agents from trying to solve the same problem at the same time, the harness uses a simple synchronization algorithm:
|
|
46
|
-
|
|
47
|
-
1. Claude takes a "lock" on a task by writing a text file to current_tasks/ (e.g., one agent might lock current_tasks/parse_if_statement.txt, while another locks current_tasks/codegen_function_definition.txt). If two agents try to claim the same task, git's synchronization forces the second agent to pick a different one.
|
|
48
|
-
2. Claude works on the task, then pulls from upstream, merges changes from other agents, pushes its changes, and removes the lock. Merge conflicts are frequent, but Claude is smart enough to figure that out.
|
|
49
|
-
3. The infinite agent-generation-loop spawns a new Claude Code session in a fresh container, and the cycle repeats.
|
|
50
|
-
|
|
51
|
-
This is a very early research prototype. I haven’t yet implemented any other method for communication between agents, nor do I enforce any process for managing high-level goals. I don’t use an orchestration agent.
|
|
52
|
-
|
|
53
|
-
Instead, I leave it up to each Claude agent to decide how to act. In most cases, Claude picks up the “next most obvious” problem. When stuck on a bug, Claude will often maintain a running doc of failed approaches and remaining tasks. In the git repository of the project, you can read through the history and watch it take out locks on various tasks.
|
|
54
|
-
|
|
55
|
-
## Lessons from programming with Claude agent teams
|
|
56
|
-
The scaffolding runs Claude in a loop, but that loop is only useful if Claude can tell how to make progress. Most of my effort went into designing the environment around Claude—the tests, the environment, the feedback—so that it could orient itself without me. These are the approaches I’ve found most helpful when orchestrating multiple Claude instances.
|
|
57
|
-
|
|
58
|
-
### Write extremely high-quality tests
|
|
59
|
-
Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.
|
|
60
|
-
|
|
61
|
-
For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.
|
|
62
|
-
|
|
63
|
-
### Put yourself in Claude’s shoes
|
|
64
|
-
I had to constantly remind myself that I was writing this test harness for Claude and not for myself, which meant rethinking many of my assumptions about how tests should communicate results.
|
|
65
|
-
|
|
66
|
-
For example, each agent is dropped into a fresh container with no context and will spend significant time orienting itself, especially on large projects. Before we even reach the tests, to help Claude help itself, I included instructions to maintain extensive READMEs and progress files that should be updated frequently with the current status.
|
|
67
|
-
|
|
68
|
-
I also kept in mind the fact that language models have inherent limitations, which, in this case, needed to be designed around. These include:
|
|
69
|
-
|
|
70
|
-
- Context window pollution: The test harness should not print thousands of useless bytes. At most, it should print a few lines of output and log all important information to a file so Claude can find it when needed. Logfiles should be easy to process automatically: if there are errors, Claude should write ERROR and put the reason on the same line so grep will find it. It helps to pre-compute aggregate summary statistics so Claude doesn't have to recompute them.
|
|
71
|
-
|
|
72
|
-
- Time blindness: Claude can't tell time and, left alone, will happily spend hours running tests instead of making progress. The harness prints incremental progress infrequently (to avoid polluting context) and includes a default ```bash --fast``` option that runs a 1% or 10% random sample. This subsample is deterministic per-agent but random across VMs, so Claude still covers all files but each agent can perfectly identify regressions.
|
|
73
|
-
|
|
74
|
-
### Make parallelism easy
|
|
75
|
-
When there are many distinct failing tests, parallelization is trivial: each agent picks a different failing test to work on. After the test suite reached a 99% pass rate, each agent worked on getting a different small open-source project (e.g., SQlite, Redis, libjpeg, MQuickJS, Lua) to compile.
|
|
76
|
-
|
|
77
|
-
But when agents started to compile the Linux kernel, they got stuck. Unlike a test suite with hundreds of independent tests, compiling the Linux kernel is one giant task. Every agent would hit the same bug, fix that bug, and then overwrite each other's changes. Having 16 agents running didn't help because each was stuck solving the same task.
|
|
78
|
-
|
|
79
|
-
The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC, and only the remaining files with Claude's C Compiler. If the kernel worked, then the problem wasn’t in Claude’s subset of the files. If it broke, then it could further refine by re-compiling some of these files with GCC. This let each agent work in parallel, fixing different bugs in different files, until Claude's compiler could eventually compile all files. (After this worked, it was still necessary to apply delta debugging techniques to find pairs of files that failed together but worked independently.)
|
|
80
|
-
|
|
81
|
-
### Multiple agent roles
|
|
82
|
-
Parallelism also enables specialization. LLM-written code frequently re-implements existing functionality, so I tasked one agent with coalescing any duplicate code it found. I put another in charge of improving the performance of the compiler itself, and a third I made responsible for outputting efficient compiled code. I asked another agent to critique the design of the project from the perspective of a Rust developer, and make structural changes to the project to improve the overall code quality, and another to work on documentation.
|
|
83
|
-
|
|
84
|
-
## Stress testing the limits of agent teams
|
|
85
|
-
This project was designed as a capability benchmark. I am interested in stress-testing the limits of what LLMs can just barely achieve today in order to help us prepare for what models will reliably achieve in the future.
|
|
86
|
-
|
|
87
|
-
I’ve been using the C Compiler project as a benchmark across the entire Claude 4 model series. As I did with prior projects, I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.
|
|
88
|
-
|
|
89
|
-
Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.
|
|
90
|
-
|
|
91
|
-
### Evaluation
|
|
92
|
-
Over nearly 2,000 Claude Code sessions across two weeks, Opus 4.6 consumed 2 billion input tokens and generated 140 million output tokens, a total cost just under $20,000. Compared to even the most expensive Claude Max plans, this was an extremely expensive project. But that total is a fraction of what it would cost me to produce this myself—let alone an entire team.
|
|
93
|
-
|
|
94
|
-
This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis, and has a 99% pass rate on most compiler test suites including the GCC torture test suite. It also passes the developer's ultimate litmus test: it can compile and run Doom.
|
|
95
|
-
|
|
96
|
-
The compiler, however, is not without limitations. These include:
|
|
97
|
-
|
|
98
|
-
- It lacks the 16-bit x86 compiler that is necessary to boot Linux out of real mode. For this, it calls out to GCC (the x86_32 and x86_64 compilers are its own).
|
|
99
|
-
- It does not have its own assembler and linker; these are the very last bits that Claude started automating and are still somewhat buggy. The demo video was produced with a GCC assembler and linker.
|
|
100
|
-
- The compiler successfully builds many projects, but not all. It's not yet a drop-in replacement for a real compiler.
|
|
101
|
-
- The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
|
|
102
|
-
- The Rust code quality is reasonable, but is nowhere near the quality of what an expert Rust programmer might produce.
|
|
103
|
-
- The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
|
|
104
|
-
|
|
105
|
-
As one particularly challenging example, Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase (This is only the case for x86. For ARM or RISC-V, Claude’s compiler can compile completely by itself.)
|
|
106
|
-
|
|
107
|
-
The source code for the compiler is available. Download it, read through the code, and try it on your favorite C projects. I’ve consistently found the best way to understand what language models can do is to push them to their limits, and then study where they start to break down. Over the coming days, I’ll continue having Claude push new changes if you want to follow along with Claude’s continued attempts at addressing these limitations.
|
|
108
|
-
|
|
109
|
-
## Looking forward
|
|
110
|
-
Each generation of language models opens up new ways of working with them. Early models were useful for tab-completion in IDEs. Before long, models could complete a function body from its docstring. The launch of Claude Code brought agents into the mainstream and enabled developers to pair-program with Claude. But each of these products operates under the assumption that a user defines a task, an LLM runs for a few seconds or minutes and returns an answer, and then the user provides a follow-up.
|
|
111
|
-
|
|
112
|
-
Agent teams show the possibility of implementing entire, complex projects autonomously. This allows us, as users of these tools, to become more ambitious with our goals.
|
|
113
|
-
|
|
114
|
-
We are still early, and fully autonomous development comes with real risks. When a human sits with Claude during development, they can ensure consistent quality and catch errors in real time. For autonomous systems, it is easy to see tests pass and assume the job is done, when this is rarely the case. I used to work in penetration testing, exploiting vulnerabilities in products produced by large companies, and the thought of programmers deploying software they’ve never personally verified is a real concern.
|
|
115
|
-
|
|
116
|
-
So, while this experiment excites me, it also leaves me feeling uneasy. Building this compiler has been some of the most fun I’ve had recently, but I did not expect this to be anywhere near possible so early in 2026. The rapid progress in both language models and the scaffolds we use to interact with them opens the door to writing an enormous amount of new code. I expect the positive applications to outweigh the negative, but we’re entering a new world which will require new strategies to navigate safely.
|
|
@@ -1,220 +0,0 @@
|
|
|
1
|
-
# Harness engineering: leveraging Codex in an agent-first world
|
|
2
|
-
|
|
3
|
-
*Enforcing architecture and taste*
|
|
4
|
-
**February 11, 2026**
|
|
5
|
-
Engineering
|
|
6
|
-
By Ryan Lopopolo, Member of the Technical Staff
|
|
7
|
-
|
|
8
|
-
---
|
|
9
|
-
|
|
10
|
-
Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with *0 lines of manually-written code*.
|
|
11
|
-
|
|
12
|
-
The product has internal daily users and external alpha testers. It ships, deploys, breaks, and gets fixed. What’s different is that every line of code—application logic, tests, CI configuration, documentation, observability, and internal tooling—has been written by Codex. We estimate that we built this in about 1/10th the time it would have taken to write the code by hand.
|
|
13
|
-
|
|
14
|
-
## Humans steer. Agents execute.
|
|
15
|
-
|
|
16
|
-
We intentionally chose this constraint so we would build what was necessary to increase engineering velocity by orders of magnitude. We had weeks to ship what ended up being a million lines of code. To do that, we needed to understand what changes when a software engineering team’s primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow Codex agents to do reliable work.
|
|
17
|
-
|
|
18
|
-
This post is about what we learned by building a brand new product with a team of agents—what broke, what compounded, and how to maximize our one truly scarce resource: human time and attention.
|
|
19
|
-
|
|
20
|
-
## We started with an empty git repository
|
|
21
|
-
|
|
22
|
-
The first commit to an empty repository landed in late August 2025.
|
|
23
|
-
|
|
24
|
-
The initial scaffold—repository structure, CI configuration, formatting rules, package manager setup, and application framework—was generated by Codex CLI using GPT‑5, guided by a small set of existing templates. Even the initial `AGENTS.md` file that directs agents how to work in the repository was itself written by Codex.
|
|
25
|
-
|
|
26
|
-
There was no pre-existing human-written code to anchor the system. From the beginning, the repository was shaped by the agent.
|
|
27
|
-
|
|
28
|
-
Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.
|
|
29
|
-
|
|
30
|
-
Throughout the development process, humans never directly contributed any code. This became a core philosophy for the team: *no manually-written code*.
|
|
31
|
-
|
|
32
|
-
## Redefining the role of the engineer
|
|
33
|
-
|
|
34
|
-
The lack of hands-on human coding introduced a different kind of engineering work, focused on systems, scaffolding, and leverage.
|
|
35
|
-
|
|
36
|
-
Early progress was slower than we expected, not because Codex was incapable, but because the environment was underspecified. The agent lacked the tools, abstractions, and internal structure required to make progress toward high-level goals. The primary job of our engineering team became enabling the agents to do useful work.
|
|
37
|
-
|
|
38
|
-
In practice, this meant working depth-first: breaking down larger goals into smaller building blocks (design, code, review, test, etc), prompting the agent to construct those blocks, and using them to unlock more complex tasks. When something failed, the fix was almost never “try harder.” Because the only way to make progress was to get Codex to do the work, human engineers always stepped into the task and asked: “what capability is missing, and how do we make it both legible and enforceable for the agent?”
|
|
39
|
-
|
|
40
|
-
Humans interact with the system almost entirely through prompts: an engineer describes a task, runs the agent, and allows it to open a pull request. To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a [Ralph Wiggum Loop](https://ghuntley.com/loop/)). Codex uses our standard development tools directly (`gh`, local scripts, and repository-embedded skills) to gather context without humans copying and pasting into the CLI.
|
|
41
|
-
|
|
42
|
-
Humans may review pull requests, but aren’t required to. Over time, we’ve pushed almost all review effort towards being handled agent-to-agent.
|
|
43
|
-
|
|
44
|
-
## Increasing application legibility
|
|
45
|
-
|
|
46
|
-
As code throughput increased, our bottleneck became human QA capacity. Because the fixed constraint has been human time and attention, we’ve worked to add more capabilities to the agent by making things like the application UI, logs, and app metrics themselves directly legible to Codex.
|
|
47
|
-
|
|
48
|
-
For example, we made the app bootable per git worktree, so Codex could launch and drive one instance per change. We also wired the Chrome DevTools Protocol into the agent runtime and created skills for working with DOM snapshots, screenshots, and navigation. This enabled Codex to reproduce bugs, validate fixes, and reason about UI behavior directly.
|
|
49
|
-
|
|
50
|
-
We did the same for observability tooling. Logs, metrics, and traces are exposed to Codex via a local observability stack that’s ephemeral for any given worktree. Codex works on a fully isolated version of that app—including its logs and metrics, which get torn down once that task is complete. Agents can query logs with LogQL and metrics with PromQL. With this context available, prompts like “ensure service startup completes in under 800ms” or “no span in these four critical user journeys exceeds two seconds” become tractable.
|
|
51
|
-
|
|
52
|
-
We regularly see single Codex runs work on a single task for upwards of six hours (often while the humans are sleeping).
|
|
53
|
-
|
|
54
|
-
## We made repository knowledge the system of record
|
|
55
|
-
|
|
56
|
-
Context management is one of the biggest challenges in making agents effective at large and complex tasks. One of the earliest lessons we learned was simple: *give Codex a map, not a 1,000-page instruction manual.*
|
|
57
|
-
|
|
58
|
-
We tried the “one big [`AGENTS.md`](https://agents.md/)” approach. It failed in predictable ways:
|
|
59
|
-
|
|
60
|
-
- Context is a scarce resource. A giant instruction file crowds out the task, the code, and the relevant docs—so the agent either misses key constraints or starts optimizing for the wrong ones.
|
|
61
|
-
- Too much guidance becomes *non-guidance*. When everything is “important,” nothing is. Agents end up pattern-matching locally instead of navigating intentionally.
|
|
62
|
-
- It rots instantly. A monolithic manual turns into a graveyard of stale rules. Agents can’t tell what’s still true, humans stop maintaining it, and the file quietly becomes an attractive nuisance.
|
|
63
|
-
- It’s hard to verify. A single blob doesn’t lend itself to mechanical checks (coverage, freshness, ownership, cross-links), so drift is inevitable.
|
|
64
|
-
|
|
65
|
-
So instead of treating `AGENTS.md` as the encyclopedia, we treat it as *the table of contents*.
|
|
66
|
-
|
|
67
|
-
The repository’s knowledge base lives in a structured `docs/` directory treated as the system of record. A short `AGENTS.md` (roughly 100 lines) is injected into context and serves primarily as a map, with pointers to deeper sources of truth elsewhere.
|
|
68
|
-
|
|
69
|
-
```text
|
|
70
|
-
1 AGENTS.md
|
|
71
|
-
2 ARCHITECTURE.md
|
|
72
|
-
3 docs/
|
|
73
|
-
4 ├── design-docs/
|
|
74
|
-
5 │ ├── index.md
|
|
75
|
-
6 │ ├── core-beliefs.md
|
|
76
|
-
7 │ └── ...
|
|
77
|
-
8 ├── exec-plans/
|
|
78
|
-
9 │ ├── active/
|
|
79
|
-
10 │ ├── completed/
|
|
80
|
-
11 │ └── tech-debt-tracker.md
|
|
81
|
-
12 ├── generated/
|
|
82
|
-
13 │ └── db-schema.md
|
|
83
|
-
14 ├── product-specs/
|
|
84
|
-
15 │ ├── index.md
|
|
85
|
-
16 │ ├── new-user-onboarding.md
|
|
86
|
-
17 │ └── ...
|
|
87
|
-
18 ├── references/
|
|
88
|
-
19 │ ├── design-system-reference-llms.txt
|
|
89
|
-
20 │ ├── nixpacks-llms.txt
|
|
90
|
-
21 │ ├── uv-llms.txt
|
|
91
|
-
22 │ └── ...
|
|
92
|
-
23 ├── DESIGN.md
|
|
93
|
-
24 ├── FRONTEND.md
|
|
94
|
-
25 ├── PLANS.md
|
|
95
|
-
26 ├── PRODUCT_SENSE.md
|
|
96
|
-
27 ├── QUALITY_SCORE.md
|
|
97
|
-
28 ├── RELIABILITY.md
|
|
98
|
-
29 └── SECURITY.md
|
|
99
|
-
```
|
|
100
|
-
|
|
101
|
-
Design documentation is catalogued and indexed, including verification status and a set of core beliefs that define agent-first operating principles. Architecture documentation provides a top-level map of domains and package layering. A quality document grades each product domain and architectural layer, tracking gaps over time.
|
|
102
|
-
|
|
103
|
-
Plans are treated as first-class artifacts. Ephemeral lightweight plans are used for small changes, while complex work is captured in execution plans with progress and decision logs that are checked into the repository. Active plans, completed plans, and known technical debt are all versioned and co-located, allowing agents to operate without relying on external context.
|
|
104
|
-
|
|
105
|
-
This enables progressive disclosure: agents start with a small, stable entry point and are taught where to look next, rather than being overwhelmed up front.
|
|
106
|
-
|
|
107
|
-
We enforce this mechanically. Dedicated linters and CI jobs validate that the knowledge base is up to date, cross-linked, and structured correctly. A recurring “doc-gardening” agent scans for stale or obsolete documentation that does not reflect the real code behavior and opens fix-up pull requests.
|
|
108
|
-
|
|
109
|
-
Agent legibility is the goal
|
|
110
|
-
As the codebase evolved, Codex’s framework for design decisions needed to evolve, too.
|
|
111
|
-
|
|
112
|
-
Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility. In the same way teams aim to improve navigability of their code for new engineering hires, our human engineers’ goal was making it possible for an agent to reason about the full business domain directly from the repository itself.
|
|
113
|
-
|
|
114
|
-
From the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist. Knowledge that lives in Google Docs, chat threads, or people’s heads are not accessible to the system. Repository-local, versioned artifacts (e.g., code, markdown, schemas, executable plans) are all it can see.
|
|
115
|
-
|
|
116
|
-
We learned that we needed to push more and more context into the repo over time. That Slack discussion that aligned the team on an architectural pattern? If it isn’t discoverable to the agent, it’s illegible in the same way it would be unknown to a new hire joining three months later.
|
|
117
|
-
|
|
118
|
-
Giving Codex more context means organizing and exposing the right information so the agent can reason over it, rather than overwhelming it with ad-hoc instructions. In the same way you would onboard a new teammate on product principles, engineering norms, and team culture (emoji preferences included), giving the agent this information leads to better-aligned output.
|
|
119
|
-
|
|
120
|
-
This framing clarified many tradeoffs. We favored dependencies and abstractions that could be fully internalized and reasoned about in-repo. Technologies often described as “boring” tend to be easier for agents to model due to composability, api stability, and representation in the training set. In some cases, it was cheaper to have the agent reimplement subsets of functionality than to work around opaque upstream behavior from public libraries. For example, rather than pulling in a generic p-limit-style package, we implemented our own map-with-concurrency helper: it’s tightly integrated with our OpenTelemetry instrumentation, has 100% test coverage, and behaves exactly the way our runtime expects.
|
|
121
|
-
|
|
122
|
-
Pulling more of the system into a form the agent can inspect, validate, and modify directly increases leverage—not just for Codex, but for other agents (e.g. Aardvark) that are working on the codebase as well.
|
|
123
|
-
|
|
124
|
-
Enforcing architecture and taste
|
|
125
|
-
Documentation alone doesn’t keep a fully agent-generated codebase coherent. By enforcing invariants, not micromanaging implementations, we let agents ship fast without undermining the foundation. For example, we require Codex to parse data shapes at the boundary, but are not prescriptive on how that happens (the model seems to like Zod, but we didn’t specify that specific library).
|
|
126
|
-
|
|
127
|
-
Agents are most effective in environments with strict boundaries and predictable structure, so we built the application around a rigid architectural model. Each business domain is divided into a fixed set of layers, with strictly validated dependency directions and a limited set of permissible edges. These constraints are enforced mechanically via custom linters (Codex-generated, of course!) and structural tests.
|
|
128
|
-
|
|
129
|
-
This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.
|
|
130
|
-
|
|
131
|
-
In practice, we enforce these rules with custom linters and structural tests, plus a small set of “taste invariants.” For example, we statically enforce structured logging, naming conventions for schemas and types, file size limits, and platform-specific reliability requirements with custom lints. Because the lints are custom, we write the error messages to inject remediation instructions into agent context.
|
|
132
|
-
|
|
133
|
-
In a human-first workflow, these rules might feel pedantic or constraining. With agents, they become multipliers: once encoded, they apply everywhere at once.
|
|
134
|
-
|
|
135
|
-
At the same time, we’re explicit about where constraints matter and where they do not. This resembles leading a large engineering platform organization: enforce boundaries centrally, allow autonomy locally. You care deeply about boundaries, correctness, and reproducibility. Within those boundaries, you allow teams—or agents—significant freedom in how solutions are expressed.
|
|
136
|
-
|
|
137
|
-
The resulting code does not always match human stylistic preferences, and that’s okay. As long as the output is correct, maintainable, and legible to future agent runs, it meets the bar.
|
|
138
|
-
|
|
139
|
-
Human taste is fed back into the system continuously. Review comments, refactoring pull requests, and user-facing bugs are captured as documentation updates or encoded directly into tooling. When documentation falls short, we promote the rule into code.
|
|
140
|
-
|
|
141
|
-
Throughput changes the merge philosophy
|
|
142
|
-
As Codex’s throughput increased, many conventional engineering norms became counterproductive.
|
|
143
|
-
|
|
144
|
-
The repository operates with minimal blocking merge gates. Pull requests are short-lived. Test flakes are often addressed with follow-up runs rather than blocking progress indefinitely. In a system where agent throughput far exceeds human attention, corrections are cheap, and waiting is expensive.
|
|
145
|
-
|
|
146
|
-
This would be irresponsible in a low-throughput environment. Here, it’s often the right tradeoff.
|
|
147
|
-
|
|
148
|
-
What “agent-generated” actually means
|
|
149
|
-
When we say the codebase is generated by Codex agents, we mean everything in the codebase.
|
|
150
|
-
|
|
151
|
-
Agents produce:
|
|
152
|
-
|
|
153
|
-
Product code and tests
|
|
154
|
-
|
|
155
|
-
CI configuration and release tooling
|
|
156
|
-
|
|
157
|
-
Internal developer tools
|
|
158
|
-
|
|
159
|
-
Documentation and design history
|
|
160
|
-
|
|
161
|
-
Evaluation harnesses
|
|
162
|
-
|
|
163
|
-
Review comments and responses
|
|
164
|
-
|
|
165
|
-
Scripts that manage the repository itself
|
|
166
|
-
|
|
167
|
-
Production dashboard definition files
|
|
168
|
-
|
|
169
|
-
Humans always remain in the loop, but work at a different layer of abstraction than we used to. We prioritize work, translate user feedback into acceptance criteria, and validate outcomes. When the agent struggles, we treat it as a signal: identify what is missing—tools, guardrails, documentation—and feed it back into the repository, always by having Codex itself write the fix.
|
|
170
|
-
|
|
171
|
-
Agents use our standard development tools directly. They pull review feedback, respond inline, push updates, and often squash and merge their own pull requests.
|
|
172
|
-
|
|
173
|
-
Increasing levels of autonomy
|
|
174
|
-
As more of the development loop was encoded directly into the system—testing, validation, review, feedback handling, and recovery—the repository recently crossed a meaningful threshold where Codex can end-to-end drive a new feature.
|
|
175
|
-
|
|
176
|
-
Given a single prompt, the agent can now:
|
|
177
|
-
|
|
178
|
-
Validate the current state of the codebase
|
|
179
|
-
|
|
180
|
-
Reproduce a reported bug
|
|
181
|
-
|
|
182
|
-
Record a video demonstrating the failure
|
|
183
|
-
|
|
184
|
-
Implement a fix
|
|
185
|
-
|
|
186
|
-
Validate the fix by driving the application
|
|
187
|
-
|
|
188
|
-
Record a second video demonstrating the resolution
|
|
189
|
-
|
|
190
|
-
Open a pull request
|
|
191
|
-
|
|
192
|
-
Respond to agent and human feedback
|
|
193
|
-
|
|
194
|
-
Detect and remediate build failures
|
|
195
|
-
|
|
196
|
-
Escalate to a human only when judgment is required
|
|
197
|
-
|
|
198
|
-
Merge the change
|
|
199
|
-
|
|
200
|
-
This behavior depends heavily on the specific structure and tooling of this repository and should not be assumed to generalize without similar investment—at least, not yet.
|
|
201
|
-
|
|
202
|
-
Entropy and garbage collection
|
|
203
|
-
Full agent autonomy also introduces novel problems. Codex replicates patterns that already exist in the repository—even uneven or suboptimal ones. Over time, this inevitably leads to drift.
|
|
204
|
-
|
|
205
|
-
Initially, humans addressed this manually. Our team used to spend every Friday (20% of the week) cleaning up “AI slop.” Unsurprisingly, that didn’t scale.
|
|
206
|
-
|
|
207
|
-
Instead, we started encoding what we call “golden principles” directly into the repository and built a recurring cleanup process. These principles are opinionated, mechanical rules that keep the codebase legible and consistent for future agent runs. For example: (1) we prefer shared utility packages over hand-rolled helpers to keep invariants centralized, and (2) we don’t probe data “YOLO-style”—we validate boundaries or rely on typed SDKs so the agent can’t accidentally build on guessed shapes. On a regular cadence, we have a set of background Codex tasks that scan for deviations, updates quality grades, and open targeted refactoring pull requests. Most of these can be reviewed in under a minute and automerged.
|
|
208
|
-
|
|
209
|
-
This functions like garbage collection. Technical debt is like a high-interest loan: it’s almost always better to pay it down continuously in small increments than to let it compound and tackle it in painful bursts. Human taste is captured once, then enforced continuously on every line of code. This also lets us catch and resolve bad patterns on a daily basis, rather than letting them spread in the code base for days or weeks.
|
|
210
|
-
|
|
211
|
-
What we’re still learning
|
|
212
|
-
This strategy has so far worked well up through internal launch and adoption at OpenAI. Building a real product for real users helped anchor our investments in reality and guide us towards long-term maintainability.
|
|
213
|
-
|
|
214
|
-
What we don’t yet know is how architectural coherence evolves over years in a fully agent-generated system. We’re still learning where human judgment adds the most leverage and how to encode that judgment so it compounds. We also don’t know how this system will evolve as models continue to become more capable over time.
|
|
215
|
-
|
|
216
|
-
What’s become clear: building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code. The tooling, abstractions, and feedback loops that keep the codebase coherent are increasingly important.
|
|
217
|
-
|
|
218
|
-
Our most difficult challenges now center on designing environments, feedback loops, and control systems that help agents accomplish our goal: build and maintain complex, reliable software at scale.
|
|
219
|
-
|
|
220
|
-
As agents like Codex take on larger portions of the software lifecycle, these questions will matter even more. We hope that sharing some early lessons helps you reason about where to invest your effort so you can just build things.
|