npm - compound-agent - Versions diffs - 1.7.6 → 2.0.0 - Mend

compound-agent 1.7.6 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (43) hide show

package/CHANGELOG.md +45 -1
package/README.md +70 -47
package/bin/ca +32 -0
package/package.json +19 -78
package/scripts/postinstall.cjs +221 -0
package/dist/cli.d.ts +0 -1
package/dist/cli.js +0 -13158
package/dist/cli.js.map +0 -1
package/dist/index.d.ts +0 -3730
package/dist/index.js +0 -3240
package/dist/index.js.map +0 -1
package/docs/research/AgenticAiCodebaseGuide.md +0 -1206
package/docs/research/BuildingACCompilerAnthropic.md +0 -116
package/docs/research/HarnessEngineeringOpenAi.md +0 -220
package/docs/research/code-review/systematic-review-methodology.md +0 -409
package/docs/research/index.md +0 -76
package/docs/research/learning-systems/knowledge-compounding-for-agents.md +0 -695
package/docs/research/property-testing/property-based-testing-and-invariants.md +0 -742
package/docs/research/scenario-testing/advanced-and-emerging.md +0 -470
package/docs/research/scenario-testing/core-foundations.md +0 -507
package/docs/research/scenario-testing/domain-specific-and-human-factors.md +0 -474
package/docs/research/security/auth-patterns.md +0 -138
package/docs/research/security/data-exposure.md +0 -185
package/docs/research/security/dependency-security.md +0 -91
package/docs/research/security/injection-patterns.md +0 -249
package/docs/research/security/overview.md +0 -81
package/docs/research/security/secrets-checklist.md +0 -92
package/docs/research/security/secure-coding-failure.md +0 -297
package/docs/research/software_architecture/01-science-of-decomposition.md +0 -615
package/docs/research/software_architecture/02-architecture-under-uncertainty.md +0 -649
package/docs/research/software_architecture/03-emergent-behavior-in-composed-systems.md +0 -644
package/docs/research/spec_design/decision_theory_specifications_and_multi_criteria_tradeoffs.md +0 -0
package/docs/research/spec_design/design_by_contract.md +0 -251
package/docs/research/spec_design/domain_driven_design_strategic_modeling.md +0 -183
package/docs/research/spec_design/formal_specification_methods.md +0 -161
package/docs/research/spec_design/logic_and_proof_theory_under_the_curry_howard_correspondence.md +0 -250
package/docs/research/spec_design/natural_language_formal_semantics_abuguity_in_specifications.md +0 -259
package/docs/research/spec_design/requirements_engineering.md +0 -234
package/docs/research/spec_design/systems_engineering_specifications_emergent_behavior_interface_contracts.md +0 -149
package/docs/research/spec_design/what_is_this_about.md +0 -305
package/docs/research/tdd/test-driven-development-methodology.md +0 -547
package/docs/research/test-optimization-strategies.md +0 -401
package/scripts/postinstall.mjs +0 -102

package/docs/research/code-review/systematic-review-methodology.md DELETED Viewed

@@ -1,409 +0,0 @@
-# Systematic Code Review Methodology
-*PhD-Level Survey for Compound Agent Review Phase*
-## Abstract
-Code review -- the systematic examination of source code by parties other than the author -- is among the most widely practiced quality assurance activities in software engineering. From Michael Fagan's formalized inspection procedure at IBM in 1976 to the lightweight, tool-mediated pull request workflows that dominate industry in 2026, code review has undergone fundamental transformations in mechanism, cadence, and purpose. Despite its near-universal adoption, the field lacks a unified methodological framework: different review approaches operate under different assumptions about what defects look like, how reviewers reason, what tools should assist them, and what "quality" means in context. This survey examines nine distinct approaches to systematic code review -- Fagan formal inspection, modern code review (MCR), defect taxonomy and severity classification, checklist-driven review and cognitive load management, automated static analysis as complement, multi-reviewer coordination and anti-patterns, time-bounded review strategies and reviewer fatigue, tool-assisted review platforms, and AI/LLM-assisted review -- synthesizing empirical findings from academic research and large-scale industrial studies spanning five decades. We present the landscape of approaches without prescribing adoption, identifying where evidence is strong, where it conflicts, and where significant gaps remain.
-## 1. Introduction
-### 1.1 Problem Statement
-Software defects discovered in production cost orders of magnitude more to fix than defects caught during development. Code review, when practiced effectively, is one of the most cost-efficient defect detection mechanisms available: Russell (1991) found that each hour of inspection prevented approximately 33 hours of subsequent maintenance work. IBM's internal data from Fagan's original studies showed defect removal rates exceeding 82% when formal inspection procedures were followed (Fagan, 1976). Yet despite five decades of research and near-universal industry adoption, code review remains poorly understood at the methodological level. Teams implement it differently, tools support it differently, and the empirical literature is fragmented across incompatible definitions of what review is, what it should accomplish, and how to measure success.
-The fundamental tension in code review methodology is between thoroughness and velocity. Formal inspection procedures maximize defect detection but impose significant process overhead. Modern lightweight review workflows minimize friction but sacrifice systematic coverage and consistency. Every methodological choice in code review -- how long to review, how many reviewers to involve, what tools to use, what to look for -- exists somewhere on this spectrum. This survey maps that spectrum.
-A secondary tension, revealed by large-scale empirical studies, is between stated and actual review outcomes. When Bacchelli and Bird (2013) asked Microsoft engineers why they conducted code review, 44% listed "finding defects" as their primary motivation. When the same researchers analyzed the content of 570 review sessions and 1,332 comments, they found that only 1 in 8 comments addressed actual defects. The dominant actual outcomes were knowledge transfer, team awareness, and identification of alternative approaches -- outcomes that are valuable but rarely cited as motivations. This gap between stated motivation and actual behavior is a recurring theme across the literature.
-### 1.2 Scope
-This survey covers methodological approaches to systematic code review as practiced in software development organizations. We address:
-- **In scope**: Formal inspection procedures, modern tool-mediated code review, defect classification and taxonomy, cognitive load management in review, static analysis as review complement, reviewer coordination and assignment, temporal constraints on review, review platform design, and AI/ML-assisted review.
-- **Out of scope**: Automated testing (which is a distinct quality assurance mechanism), pair programming (which has a different social and temporal structure), design review (which precedes implementation rather than following it), and post-release code auditing for security (which operates under different incentive structures).
-### 1.3 Key Definitions
-- **Code inspection (Fagan inspection)**: A formal, structured defect detection procedure with defined phases, roles, and entry/exit criteria, originating with Michael Fagan at IBM (1976). Also called "formal technical review."
-- **Modern Code Review (MCR)**: Tool-mediated, informal, lightweight, asynchronous code review as practiced in contemporary software development, typically via pull request workflows (Bacchelli and Bird, 2013).
-- **Review coverage**: The fraction of changed code that is examined during review, measured either as lines reviewed per lines changed or as fraction of changesets receiving at least one review comment.
-- **Inspection rate**: The speed at which code is examined during review, typically measured in lines of code per hour. Distinct from throughput (changesets reviewed per week).
-- **Defect density**: The number of defects per unit of code (per KLOC or per function point), used to evaluate the effectiveness of review at different configurations.
-- **Code review smell**: A pattern in review behavior that correlates with lower review effectiveness, analogous to code smells in production code (Dogan and Tuzun, 2022).
-- **Review anti-pattern (MCRA)**: A recurrent ineffective or counterproductive configuration in multi-reviewer code review assignments (Chouchen et al., 2021).
-## 2. Foundations
-### 2.1 Origins: Fagan Inspection and Formal Technical Review
-Michael Fagan's 1976 paper "Design and Code Inspections to Reduce Errors in Program Development" established the first systematic framework for peer review of software artifacts (Fagan, 1976). Developed at IBM Kingston, the method arose from Fagan's observation that defects were systematically introduced at predictable stages of development and could be detected most cost-efficiently at those same stages, before propagation to later phases.
-The original Fagan method defines six sequential phases:
-1. **Planning**: The moderator verifies entry criteria are met, schedules the inspection, and assembles the team. Entry criteria include completeness of the artifact under review, availability of reference materials, and required preparation time.
-2. **Overview**: The author presents the artifact to the inspection team, providing context, design rationale, and known issues. Typically 30-60 minutes. Not all instantiations of the method include this phase.
-3. **Preparation**: Each inspector independently examines the artifact before the group meeting, using checklists and their domain knowledge to identify potential defects. This phase is critical -- the majority of defects found in formal inspection are found during preparation, not the group meeting (Fagan, 1976; Ackerman et al., 1989).
-4. **Inspection meeting**: The group convenes to review the artifact, guided by a reader who paraphrases the code in the author's absence (in some instantiations) or with the author present as a resource but not a defender. The recorder documents defects. The moderator manages pace and focus. Defects are logged but not fixed during the meeting.
-5. **Rework**: The author corrects defects identified in the inspection log. No group oversight is required for rework unless defects are systemic.
-6. **Follow-up**: The moderator verifies that all defects have been addressed, either personally or via a re-inspection if rework is extensive.
-Five roles are defined: **moderator** (manages process, is not the author), **author** (passive during inspection, resource for clarification), **reader** (paraphrases code to force deliberate examination), **recorder** (documents defects), and **inspector** (domain expert who identifies defects). The moderator and reader are deliberately separated from the author to prevent defensive dynamics.
-The empirical foundation for Fagan inspection is strong by software engineering standards. IBM's own data showed inspection defect detection rates of 60-82% depending on the artifact type (Fagan, 1976). Russell (1991) synthesized IBM data across multiple projects to find that each hour of inspection prevented approximately 33 hours of later maintenance work, an ROI figure that has been widely cited but rarely replicated in controlled settings. SmartBear Software's analysis of Cisco's code review data (2006) found that inspection rates above 500 LOC/hour were associated with significantly reduced defect detection -- inspectors at higher rates found substantially fewer defects per hour, suggesting cognitive saturation.
-### 2.2 The Empirical Turn: Modern Code Review Research
-The widespread adoption of distributed version control systems and web-based code hosting platforms (GitHub, GitLab, Gerrit, Phabricator) in the 2000s enabled a new form of code review -- asynchronous, lightweight, tool-mediated, and documented in machine-readable form. This created a new empirical opportunity: rather than studying review in controlled laboratory conditions, researchers could mine the artifacts of millions of actual review sessions.
-The landmark study in this tradition is Bacchelli and Bird (2013), who analyzed 570 code review sessions at Microsoft, combining quantitative analysis of review artifacts with qualitative interviews of 17 developers. Their central finding -- that the stated motivation for code review (defect detection) diverges substantially from the actual content of review comments (knowledge transfer, style, documentation, design alternatives) -- reframed the field's understanding of what code review is for. Subsequent large-scale studies confirmed and extended this finding: Beller et al. (2014) found that 75% of review comments addressed evolvability concerns (maintainability, readability, design) rather than functional defects.
-Sadowski et al. (2018) provided the most comprehensive view of MCR at scale, analyzing approximately 9 million code reviews at Google over several years. Their key findings:
-- 80%+ of reviews completed in a single iteration (reviewer approves after one round of comments and author responses)
-- Median review latency under 4 hours for active reviewers
-- 75% of changesets involve fewer than 10 files
-- Review velocity is highly variable: a small fraction of authors produce most changesets, and a small fraction of reviewers review most code
-Rigby and Bird (2013) systematically compared code review practices across five open-source projects (Apache, Linux, Android, KDE, Book reader) and two Microsoft products, finding convergent evolution toward similar review configurations: small changesets (median 45-78 LOC), few reviewers (2-3), short review cycles (hours to days), and reviewer specialization by module area.
-### 2.3 Cognitive Foundations of Code Review
-Understanding why code review succeeds or fails at defect detection requires a cognitive model of how reviewers read and evaluate code. Do et al. (2022) conducted controlled experiments comparing explicit review strategies (using checklists, reading code systematically) against implicit strategies (reading code as developers normally would), finding that explicit strategies significantly improved defect detection but required training and imposed cognitive overhead.
-The cognitive model underlying effective code review draws on three constructs:
-- **Working memory load**: Code review requires holding the context of the code being reviewed (what it does, what it should do, what invariants it must preserve) simultaneously with the evaluation task (does this code do what it should?). Working memory is a severely limited resource (Miller, 1956: 7 +/- 2 items; more recent estimates: 4 +/- 1 chunks, Cowan, 2001). Complex code exhausts working memory faster, reducing defect detection.
-- **Change blindness**: Reviewers are susceptible to missing defects that are locally consistent but globally incorrect -- code that "looks right" within a few lines but violates a constraint that requires holding the broader system context. This is the cognitive mechanism underlying the effectiveness of checklists: they externalize the system constraints that would otherwise have to be held in working memory.
-- **Satisficing**: Human reviewers tend to stop looking for defects once they have found a few, even if more remain. This is a rational heuristic under time pressure but systematically underestimates defect density in complex code (Votta, 1993).
-The Recognition-Primed Decision (RPD) model (Klein, 1993, 1998), developed to explain expert decision-making in time-pressured environments, has been applied to code review to explain why experienced reviewers outperform novices at defect detection: experts have richer mental models of typical code patterns and typical defect patterns, enabling pattern-matching rather than exhaustive analysis.
-## 3. Taxonomy of Approaches
-The following taxonomy organizes code review methodologies along two axes: **process formality** (from highly structured formal inspection to fully informal peer review) and **automation level** (from entirely human-driven to predominantly automated analysis).
-| Approach | Process Formality | Automation Level | Primary Defect Target | Key Trade-off |
-|---|---|---|---|---|
-| Fagan Formal Inspection | Very High | Low (checklists) | Functional defects, logic errors | Thoroughness vs. overhead |
-| Modern Code Review (MCR) | Low | Moderate (tool support) | Mixed (knowledge transfer dominates) | Velocity vs. consistency |
-| Defect Taxonomy & Classification | Moderate (classification schema) | Low-Moderate | All defect types (structured) | Coverage vs. classification cost |
-| Checklist-Driven Review | Moderate | Low-Moderate | Targeted defect types | Consistency vs. blindness |
-| Static Analysis Complement | Low (human review + tool) | Very High | Syntactic, style, pattern defects | Coverage vs. false positive rate |
-| Multi-Reviewer Coordination | Low-Moderate | Moderate (recommendation) | All types (via expertise matching) | Coverage vs. coordination cost |
-| Time-Bounded Strategies | Low-Moderate | Low (time tracking) | All types (fatigue management) | Quality vs. throughput |
-| Tool-Assisted Platforms | Low (platform-defined) | High (workflow automation) | Platform-dependent | Integration vs. tool overhead |
-| AI/LLM-Assisted Review | Low-Moderate | Very High | Pattern defects, style (emerging: semantic) | Scale vs. hallucination risk |
-## 4. Analysis
-### 4.1 Fagan Formal Inspection
-**Theory and mechanism.** Fagan formal inspection treats defect detection as a production process with measurable inputs (preparation time, inspection rate, checklist completeness), measurable outputs (defects found, defects per KLOC), and controllable parameters (team size, session duration, entry/exit criteria). The six-phase procedure enforces separation between defect identification (inspection meeting) and defect correction (rework), preventing the common failure mode where group meetings devolve into debugging sessions that consume time without systematically examining the artifact.
-The mechanism by which formal inspection outperforms informal review at defect detection is well-understood: (1) mandatory preparation ensures each reviewer has examined the full artifact independently, rather than letting group dynamics concentrate attention on early sections; (2) the reader role forces deliberate line-by-line examination that normal reading skips; (3) the absence of the author from the defect-identification role prevents defensive responses that suppress defect reporting; (4) documented entry and exit criteria prevent premature approval of artifacts with known quality problems.
-**Literature evidence.** Fagan's original IBM data showed detection of 60-82% of design and coding defects before unit testing (Fagan, 1976). Across multiple projects, inspection consistently found 3.5x more defects per unit time than testing. Russell (1991) synthesized data from 11 IBM projects to find that inspection discovered 93% of all defects found during the entire software lifecycle, with each inspection hour preventing an average of 33 maintenance hours. Ackerman et al. (1989) at the Software Engineering Institute found that preparation was the most defect-productive phase, with inspectors finding 75-90% of the defects they would eventually log during individual preparation rather than the group meeting. This finding has important implications for meeting design: the group meeting's primary value is cross-inspection (catching defects that individual inspectors missed) rather than primary examination.
-The SmartBear/Cisco study (2006) provided the most cited quantitative constraints on effective inspection: reviews above 300 LOC produced significantly reduced defect detection density; inspection rates above 500 LOC/hour showed similar degradation; optimal session duration was 30-60 minutes, with effectiveness declining sharply after 90 minutes. These constraints are consistent with the cognitive working memory model: complex code exhausts working memory within minutes, and sustained attention degrades after approximately 60 minutes of focused cognitive work.
-**Implementations.** Commercial inspection management tools (Reiview, Review Board in inspection mode, SoftTech Inspection Tool) implement the full Fagan procedure including phase tracking, checklist management, and defect logging. In practice, most organizations implement "inspection-lite" variants that omit one or more phases (most commonly the reader role and/or the formal overview) while retaining the pre-meeting preparation and structured defect logging.
-**Strengths and limitations.** Strengths: highest documented defect detection rates among human-review approaches; process accountability via documented entry/exit criteria; separation of defect identification from correction prevents meeting overhead. Limitations: high process overhead (Fagan estimates 10-15% of development budget for thorough inspection); cultural resistance from developers accustomed to informal peer review; the reader role requires significant overhead and is the most commonly eliminated phase, reducing effectiveness; scaling to distributed teams is difficult because the synchronous group meeting is central to the method; the optimal configuration (team size, inspection rate, preparation time) varies by artifact type and defect density in ways that are difficult to calibrate.
-### 4.2 Modern Code Review (MCR)
-**Theory and mechanism.** Modern code review, as defined by Bacchelli and Bird (2013), is "the practice of having other team members review code changes before they are integrated into the main codebase, using specialized tools that provide a streamlined interface for browsing diffs and commenting on specific lines." Unlike formal inspection, MCR has no prescribed phases, roles, or entry/exit criteria. The review is initiated by the author (via a pull request or changeset submission), conducted asynchronously by one or more reviewers, and concluded when reviewers approve the change (or the author addresses review feedback to reviewers' satisfaction).
-The theoretical basis for MCR's effectiveness differs from formal inspection: rather than a systematic process for finding defects, MCR functions as a social and technical coordination mechanism. Reviewers bring domain expertise, institutional memory, and coding standards knowledge to evaluate whether a change is safe to integrate -- where "safe" includes correctness but also maintainability, consistency, and alignment with team conventions. The informal, conversational format enables knowledge transfer and shared ownership that formal inspection's structured procedure does not.
-**Literature evidence.** Bacchelli and Bird (2013) analyzed 570 code review sessions at Microsoft with 17 developer interviews, finding that the actual content of review comments was dominated by understanding (33%), evolvability (25%), and style (16%), with defect-related comments comprising only 12% of the total. When asked their primary motivation, 44% of developers cited "finding defects" -- creating the "motivation-outcome gap" that has driven much subsequent research.
-Sadowski et al. (2018) provided the most comprehensive MCR dataset: approximately 9 million reviews at Google across multiple years. Key findings relevant to methodology:
-- Review velocity: median latency under 4 hours for reviews completed in a single iteration
-- Changeset size: 75th percentile of changesets involve fewer than 10 files; the authors argue that small changeset size is both a cause and consequence of effective MCR
-- Reviewer load: highly skewed; a small fraction of developers review the majority of code, creating bottlenecks and knowledge concentration risks
-- Completion rate: 80%+ of reviews complete in one iteration (author addresses all feedback in a single round), suggesting that most MCR conversations converge quickly
-Rigby and Bird (2013) compared MCR practices across 7 projects (5 open-source, 2 Microsoft) and found convergent patterns: optimal changeset size of 45-78 LOC, 2-3 reviewers, review cycles of hours to days. This convergence suggests that there are natural efficiency constraints that drive review practices toward similar configurations regardless of organizational context or tool choice.
-Yang et al. (2024) synthesized 231 papers on MCR into a comprehensive taxonomy distinguishing understanding studies (what happens in review) from improvement techniques (how to make review better). Their taxonomy identified six major research directions: changeset characterization, reviewer recommendation, review comment quality, tool support, review outcome prediction, and review automation.
-**Implementations.** MCR is predominantly implemented via one of four platform types:
-- **Pre-commit gatekeeping** (Gerrit, Phabricator Differential): changes cannot be merged without explicit approval
-- **Post-commit review** (early GitHub PR model): changes merge to a branch pending review
-- **Continuous review** (Google Critique): review is an ongoing process tied to code ownership rather than changeset boundaries
-- **Hybrid** (modern GitHub PR): pre-merge gate with sophisticated tooling for discussion, CI integration, and assignment
-**Strengths and limitations.** Strengths: low friction, minimal process overhead; asynchronous operation enables distributed teams; tool integration with CI/CD provides automatic testing feedback during review; informal format enables knowledge transfer and mentorship; scales to large teams. Limitations: highly variable quality with no process guarantees; defect detection is a secondary outcome rather than a primary mechanism; reviewer load distribution is typically skewed, creating bottlenecks; review quality depends heavily on reviewer expertise and motivation; no formal entry/exit criteria can lead to rubber-stamping; absence of structured preparation means reviewers typically examine code sequentially rather than systematically.
-### 4.3 Defect Taxonomy and Severity Classification
-**Theory and mechanism.** Defect taxonomy is the systematic classification of code defects into types and subtypes, enabling targeted review strategies (checklists organized by defect type), severity prioritization (which defects to fix before merging), and defect analysis (what kinds of defects are most common in what contexts). A well-designed taxonomy enables reviewers to search for specific defect types rather than conducting unfocused examination, reducing cognitive load and improving detection rates for targeted defect classes.
-The theoretical foundation is in software defect taxonomies developed in the 1970s-1990s for formal inspection contexts (Fagan, 1976; Chillarege et al., 1992 Orthogonal Defect Classification). In the MCR context, taxonomy is adapted to classify the types of concerns that reviewers actually raise -- which, as Beller et al. (2014) showed, are predominantly evolvability concerns rather than functional defects.
-**Literature evidence.** The foundational taxonomy work for contemporary MCR is Mantyla and Lassenius (2009), who analyzed 20+ years of defect classification literature and proposed a two-level taxonomy distinguishing:
-- **Functional defects**: incorrect behavior, missing functionality, interface errors, timing/synchronization errors
-- **Evolvability defects**: maintainability issues (duplication, coupling, complexity), portability issues, documentation gaps, style violations
-Beller et al. (2014) analyzed 1.5 million review comments across 11 open-source projects to find that approximately 75% of comments addressed evolvability concerns (as defined by Mantyla and Lassenius) rather than functional defects. This ratio is remarkably consistent across projects, suggesting a fundamental characteristic of MCR rather than a project-specific phenomenon.
-Bosu et al. (2015) analyzed 1.5 million code review comments at Microsoft and classified them by usefulness from the author's perspective, finding that approximately 38% of comments were considered "useful" by authors (leading to code changes), 30% were informational (acknowledged but no code change), and 32% were ineffective (not acted upon). Severity classification within these categories showed that high-severity defect comments had higher usefulness rates but lower frequency, while style comments had high frequency but low usefulness rates per comment.
-**Implementations.** Defect taxonomy in practice takes three forms:
-- **Checklist-based classification**: Reviewers use checklists organized by defect type, implicitly applying taxonomy during review
-- **Post-review classification**: Defects found during review are categorized in an issue tracker or defect logging system, enabling trend analysis
-- **Tool-embedded classification**: Review platforms (e.g., CodeScene, SonarQube) automatically classify detected issues by type
-**Strengths and limitations.** Strengths: enables targeted review strategies; provides data for review process improvement; allows severity-based prioritization of defect fixes; enables longitudinal trend analysis of defect distribution. Limitations: classification is labor-intensive and introduces inter-rater disagreement (Mantyla and Lassenius, 2009 report Cohen's kappa of 0.5-0.7 for defect type classification, indicating moderate but imperfect agreement); classification of "real" defects (functional correctness issues) versus "style" concerns involves subjective judgments that vary by team and reviewer; taxonomy frameworks designed for one context may not transfer cleanly to modern web application review; adding classification overhead to review reduces throughput without necessarily improving defect detection.
-### 4.4 Checklist-Driven Review and Cognitive Load Management
-**Theory and mechanism.** Checklist-driven code review addresses a well-documented failure mode in human quality assurance: reviewers miss defects not because they lack the expertise to recognize them but because cognitive load prevents them from simultaneously holding domain knowledge, system context, and examination focus. Checklists externalize the "what to look for" component of review, freeing working memory for the "how does this code behave" reasoning required to apply that knowledge.
-The mechanism draws directly from the cognitive model in Section 2.3: checklists are an external memory system that compensates for working memory limits. A reviewer with a checklist item "Does this function handle null inputs?" is more likely to examine null handling than a reviewer relying on unaided memory, not because the unaided reviewer lacks knowledge of null handling but because without the explicit prompt, null handling may not surface as a salient concern given the cognitive demands of the rest of the review.
-**Literature evidence.** Do et al. (2022) conducted controlled experiments comparing explicit review strategies (structured checklists) against implicit strategies (unguided review by experienced developers) on a set of Java programs with known defects. Explicit strategy users found significantly more defects (p < 0.05) but took longer per review. The improvement was largest for defect types that were specifically listed on the checklist, consistent with the cognitive externalization mechanism. A follow-up study found that brief (5-10 item) targeted checklists outperformed comprehensive (20+ item) checklists, apparently because long checklists increase cognitive load rather than reducing it.
-The SmartBear/Cisco study (2006) provides the most cited quantitative evidence for cognitive load effects on review effectiveness. Analyzing review data from 2,500 code reviews across Cisco's communication division, they found:
-- Reviews of more than 300 LOC showed significantly reduced defect detection density
-- Inspection rates above 500 LOC/hour showed significantly reduced defect detection rate
-- Review sessions longer than 90 minutes showed significantly reduced effectiveness per additional minute
-- Reviewers who spent less than 30 minutes on a review found significantly fewer defects than those spending 30-60 minutes
-These thresholds are consistent with the cognitive model: complex code exceeds working memory within 10-15 minutes of focused reading, and sustained attention fatigues after 60-90 minutes regardless of expertise.
-**Implementations.** Checklist-driven review is implemented at three levels:
-- **Team-curated checklists**: Maintained in documentation, linked from PR templates (GitHub PR template checklists, Gitlab MR description templates)
-- **Tool-embedded checklists**: Review platforms with configurable checklist items that reviewers must explicitly check before approving
-- **Automated checklist execution**: Static analysis and linting tools that automatically verify checklist items that can be mechanically checked, reducing cognitive load for verifiable items while preserving human judgment for non-verifiable items
-**Strengths and limitations.** Strengths: measurably improves defect detection for checklist-targeted defect types; creates shared review standards that reduce inter-reviewer variability; can be partially automated for verifiable items; brief, targeted checklists have low process overhead. Limitations: checklist blindness -- reviewers who work through a checklist mechanically without engagement detect fewer defects than engaged reviewers without checklists; checklists are context-specific and require maintenance as codebases evolve; comprehensive checklists increase cognitive load rather than reducing it; checklists do not help with defect types not covered by checklist items, potentially creating false confidence.
-### 4.5 Automated Static Analysis as Complement to Human Review
-**Theory and mechanism.** Static analysis tools examine source code without executing it, applying rule-based, pattern-matching, or formal methods techniques to identify defects, style violations, security vulnerabilities, and code quality issues. The complementarity hypothesis holds that static analysis and human review detect different classes of defects: static analysis excels at syntactic patterns, type safety, resource management, and known vulnerability patterns that are reliably identifiable from code structure; human review excels at semantic correctness, design quality, and context-dependent concerns that require understanding the code's purpose.
-**Literature evidence.** Lenarduzzi et al. (2022) conducted one of the most rigorous evaluations of static analysis complementarity, analyzing how five static analysis tools (SonarQube, PMD, Checkstyle, FindBugs/SpotBugs, ErrorProne) performed on 16 open-source Java projects. Key findings:
-- Low inter-tool agreement: the five tools agreed on fewer than 20% of issues, suggesting they identify substantially different issue sets
-- Low human-SA overlap: issues identified by human reviewers overlapped with SA-identified issues in only 10-15% of cases
-- Complementarity confirmed: combining SA with human review found more total issues than either alone
-- False positive rates varied dramatically: from 10% (ErrorProne on type errors) to 60%+ (PMD on code style)
-Johnson et al. (2013) found that developers' primary complaint about static analysis tools was false positives: when false positive rates exceeded approximately 20%, developers began habitually dismissing SA warnings without reading them, eliminating the tool's value.
-**Implementations.** The static analysis ecosystem is extensive:
-- **General-purpose linters**: ESLint (JavaScript), Pylint/Flake8/Ruff (Python), Clippy (Rust), SpotBugs (Java)
-- **Semantic analysis**: SonarQube (multi-language), CodeClimate (cloud-hosted), DeepSource
-- **Security-focused**: Semgrep (pattern-matching, SAST), Snyk (dependency and code), Veracode, Checkmarx
-- **Formal verification**: Infer (Facebook/Meta, bi-abduction), Coverity (interprocedural analysis)
-- **CI/CD integrated**: All major review platforms support SA tool integration that surfaces issues as review comments
-**Strengths and limitations.** Strengths: scales to unlimited code volume with no reviewer fatigue; consistent application of rules across all code; catches specific defect classes more reliably than human review; low marginal cost per additional line reviewed. Limitations: high false positive rates reduce trust and adoption; low inter-tool agreement means no single tool provides comprehensive coverage; misses semantic defects that require understanding the code's purpose; cannot evaluate evolvability, maintainability, or code clarity -- the concerns that dominate human review.
-### 4.6 Multi-Reviewer Coordination, Reviewer Recommendation, and Anti-Patterns
-**Theory and mechanism.** The effectiveness of code review depends not only on how review is conducted but on who conducts it. Reviewer selection affects defect detection through domain expertise (reviewers familiar with the changed code area detect more defects), review workload distribution (overloaded reviewers produce lower-quality reviews), and review latency (unavailable reviewers create bottlenecks). Multi-reviewer coordination addresses these challenges through reviewer recommendation algorithms, workload balancing, and explicit management of reviewer assignment policies.
-**Literature evidence.** McIntosh et al. (2014) analyzed 9 releases of the Qt and VTK projects and found that review coverage (fraction of commits receiving review) strongly predicted post-release defect density: commits that bypassed review had 2.0-2.6x higher defect density than reviewed commits. In a follow-up study, McIntosh et al. (2016) extended the analysis to three industrial projects and found that reviewer participation quality (measured by reviewer familiarity with the changed module) was an even stronger predictor: low participation quality was associated with approximately 5x higher defect density than high participation quality.
-Bosu et al. (2015) analyzed 1.5 million code review comments from Microsoft projects and found that reviewer expertise (measured by previous experience with the changed component) was the strongest predictor of comment usefulness. Experienced reviewers provided useful comments at 2x the rate of inexperienced reviewers.
-Chouchen et al. (2021) defined the MCRA (Modern Code Review Anti-patterns) taxonomy, identifying five recurrent anti-patterns:
-1. **Ping-pong**: Repeated back-and-forth between reviewer and author without convergence
-2. **Sleeping reviews**: Reviews with no activity for extended periods
-3. **Rubber-stamping**: Approvals without meaningful engagement
-4. **Many small patches**: Highly fragmented changesets that impose review overhead disproportionate to their individual complexity
-5. **Large changesets**: Oversized reviews that exceed reviewers' cognitive capacity
-Dogan and Tuzun (2022) identified seven "code review smells" through a survey of 240 developers and analysis of review histories at two large companies. The smells -- including "Review Me Twice," "Reviewer Hoarding," "Lack of Consensus," and "Missing Context" -- were found in 72.2% of analyzed reviews, suggesting these are endemic phenomena rather than edge cases.
-**Implementations.**
-- **REVFINDER** (Thongtanunam et al., 2015): file path similarity-based recommendation
-- **WhoReview** and similar expertise-based tools: use code ownership metrics (git blame data) to recommend experts
-- **Platform-native**: GitHub CODEOWNERS files, Gerrit's automatic reviewer suggestion, GitLab's code owners feature
-- **Workload tracking**: Tools that track reviewer queue depth and availability to prevent bottlenecks
-**Strengths and limitations.** Strengths: expertise matching demonstrably improves review quality; automated recommendation reduces the overhead of manual reviewer selection; workload balancing prevents bottleneck-induced latency. Limitations: historical patterns may not predict future expertise; recommendation algorithms may concentrate review load on highly effective reviewers, creating the bottleneck they aim to prevent; over-optimization for reviewer expertise may neglect knowledge spreading.
-### 4.7 Time-Bounded Review Strategies and Reviewer Fatigue
-**Theory and mechanism.** Reviewer fatigue is the degradation of review effectiveness over time within a single session and over review load accumulation across sessions. The cognitive basis is well-established: sustained attention degrades after 60-90 minutes of demanding cognitive work (Warm et al., 2008), and working memory resources are not fully restored between reviews without rest periods. Time-bounded review strategies impose explicit constraints on session duration and review rate to prevent fatigue-induced quality degradation.
-**Literature evidence.** Bacchelli and Bird (2013) found that median review latency at Microsoft was approximately 24 hours, with significant variation. Reviews completed within 4 hours tended to be higher quality (as measured by resolution of review comments), while reviews delayed more than 24 hours showed decreased engagement.
-Meta's Nudgebot intervention (2022) provides a rare controlled experiment on review timing. Meta deployed an automated system that sent reminders to reviewers who had pending reviews exceeding a time threshold. The experiment found that Nudgebot reduced mean review latency by approximately 40% without degrading review quality, suggesting that much review delay is caused by reviewer procrastination rather than review complexity or cognitive fatigue.
-Baysal et al. (2016) analyzed the impact of reviewer workload on review quality at IBM, finding that reviewers with more than 8 concurrent pending reviews produced lower-quality reviews than those with fewer than 4 concurrent reviews. The relationship was roughly linear in the range of 4-12 concurrent reviews, with a steep dropoff above 12.
-**Implementations.**
-- **Platform-level constraints**: Review platform configuration to flag or block reviews exceeding LOC or file count thresholds
-- **Automated nudges**: Systems like Meta's Nudgebot that send reminders to reviewers with pending review queues
-- **Time-box protocols**: Team conventions specifying maximum review duration and minimum rest time between review sessions
-- **Workload dashboards**: Visibility into reviewer queue depth to enable proactive load balancing
-- **Reviewer rotation**: Policies that rotate review assignments to prevent chronic overload
-**Strengths and limitations.** Strengths: directly addresses the cognitive fatigue mechanism that degrades review quality; small changeset policies address both fatigue and the finding that large reviews are disproportionately rubber-stamped; automated nudge systems demonstrably reduce latency without quality cost. Limitations: time bounds are approximate constraints on cognitive processes that vary significantly across individuals; organizational cultures that optimize for throughput may resist time-bound constraints; fatigue effects interact with expertise in ways that simple time bounds do not capture.
-### 4.8 Tool-Assisted Code Review Platforms
-**Theory and mechanism.** Code review platforms structure the review interaction through tool design choices that affect review behavior: whether approval is pre-commit or post-commit, how comments are displayed relative to diff context, whether inline comments on specific lines are supported, how review state is represented, and how the platform integrates with CI/CD pipelines. Platform design is not neutral -- different designs systematically favor different review behaviors.
-**Literature evidence.** Sadowski et al. (2018) provide the most detailed account of a review platform at scale, describing Google's internal Critique system. Key design features: changesets are the unit of review; review state is separate from CI state; reviewers have explicit LGTM and "please change" actions; reviewer assignment is tied to code ownership; and review conversations are threaded.
-Bird et al. (2015) compared review patterns across GitHub (post-commit PR), Gerrit (pre-commit gatekeeping), and internal Microsoft tooling, finding that pre-commit gatekeeping produced higher review engagement but higher review latency than post-commit review.
-Rigby et al. (2014) analyzed Gerrit at Google and found that the explicit vote-based approval model reduced rubber-stamping compared to systems with binary approve/reject states.
-**Implementations.** The four major platform approaches:
-- **GitHub Pull Requests**: Post-commit, pre-merge. Inline comments on diff, CI integration, CODEOWNERS-based assignment.
-- **Gerrit Code Review**: Pre-commit gatekeeping. Explicit vote scores (+1/+2/-1/-2), strict ownership model.
-- **Phabricator Differential**: Pre-commit review with sophisticated diff display and inline comment threading.
-- **Google Critique** (internal): Continuous review integrated with Google's build system and ownership model.
-- **GitLab Merge Requests**: Similar to GitHub PRs with additional features for enterprise workflows.
-**Strengths and limitations.** Strengths: workflow integration reduces context-switching overhead; automated CI feedback during review enables rapid detection of mechanical failures; audit trails provide accountability; platform-enforced policies prevent casual bypass of review. Limitations: platform lock-in constrains review process design; notification fatigue reduces reviewer responsiveness; none of the major platforms provide native support for structured inspection methodologies, forcing organizations that want Fagan-like rigor to implement it on top of informal-review-optimized platforms.
-### 4.9 AI/LLM-Assisted Code Review (Emerging)
-**Theory and mechanism.** AI-assisted code review applies machine learning and large language models to automate or augment parts of the human review process. Three distinct roles have been proposed: (1) **automated review generation**, where an LLM produces review comments without human reviewer involvement; (2) **reviewer augmentation**, where an LLM suggests potential issues that human reviewers can evaluate; and (3) **author assistance**, where an LLM helps authors prepare their code for review.
-**Literature evidence.** Tufano et al. (2021, 2022) produced the first large-scale studies of neural code review, training transformer-based models on code review datasets. Their 2022 study scaled to 250K review instances and found that the code refinement task outperformed comment generation in both automated metrics and human evaluation.
-Li et al. (2022) presented CodeReviewer, trained on 4.1M code change samples. On code refinement, CodeReviewer achieved state-of-the-art performance, but human evaluation showed that only 20-30% of generated refinements were accepted without modification.
-Hellendoorn et al. (2021) evaluated LLM-based code review at Microsoft, finding less than 50% precision when deployed on real production code.
-Cihan et al. (2024), presented at ICSE 2025, evaluated a deployed AI-assisted review system tracking 1,200 pull requests over 6 months. Key findings:
-- AI system resolved 73.8% of its generated comments
-- Total PR lifecycle time increased from 5 hours 52 minutes to 8 hours 20 minutes -- a 42% increase
-- The "latency paradox": AI assistance that improves defect coverage but increases total review time
-**Implementations.** Deployed systems include:
-- **GitHub Copilot Code Review** (2024-2025): LLM-generated preliminary review comments on pull requests
-- **CodeReviewer** (Microsoft Research): Multi-task model for review generation and code refinement
-- **Qodo** (formerly CodiumAI): AI-assisted review with focus on test coverage suggestions
-- **Sourcegraph Cody**: Code intelligence with review assistance features
-**Strengths and limitations.** Strengths: scales to unlimited review volume; consistent application of learned patterns; 24/7 availability. Limitations: the latency paradox suggests that AI review may increase total review time; less than 50% precision in production settings reduces trust; semantic understanding gaps mean AI review is effective for pattern-based defects but weak on design quality and business logic correctness; hallucination risk requires verification overhead.
-## 5. Comparative Synthesis
-| Criterion | Fagan Inspection | Modern Code Review | Defect Taxonomy | Checklist-Driven | Static Analysis | Multi-Reviewer Coord. | Time-Bounded | Platform-Assisted | AI/LLM-Assisted |
-|---|---|---|---|---|---|---|---|---|---|
-| **Defect detection rate** | Very High | Moderate | Moderate (structured) | Moderate-High | High (syntactic) / Low (semantic) | Moderate-High | Moderate | Platform-dependent | Low-Moderate |
-| **Process overhead** | Very High | Low | Low-Moderate | Low-Moderate | Very Low | Moderate | Low | Low | Low |
-| **Knowledge transfer** | Low | Very High | Low | Moderate | None | Moderate | N/A | Low | Low |
-| **Scales to distributed teams** | Poorly | Well | Well | Well | Very Well | Moderately | Well | Well | Very Well |
-| **Reviewer fatigue sensitivity** | High (managed by process) | High | Low | Moderate | None | Moderate | High (addresses it) | Moderate | None |
-| **Defect type coverage** | Broad | Broad but variable | Targeted | Targeted | Narrow (pattern-based) | Broad (expertise-dependent) | N/A | Platform-dependent | Pattern-heavy |
-| **Evidence base** | Strong (1976-present) | Strong (2013-present) | Moderate | Moderate | Strong | Moderate-Strong | Moderate | Moderate | Nascent |
-| **Organizational maturity needed** | High | Low | Moderate | Low-Moderate | Low | Moderate | Low | Low | Low |
-**Key tension clusters:**
-1. **Thoroughness vs. velocity**: Formal inspection maximizes defect detection but minimizes throughput. MCR maximizes throughput but produces variable quality. Checklist-driven and time-bounded approaches represent intermediate positions.
-2. **Scale vs. depth**: Static analysis and AI/LLM-assisted review scale to unlimited code volume but provide shallow semantic analysis. Human review provides deep semantic analysis but does not scale beyond reviewer capacity. The complementarity hypothesis argues that the combination dominates either alone.
-3. **Stated vs. actual outcomes**: The literature consistently shows that code review's actual primary outcome is knowledge transfer and team awareness, not defect detection. Methodologies optimized solely for defect detection may underperform methodologies that foster conversational engagement on the dimensions that teams actually value most.
-4. **Process rigidity vs. adaptability**: Formal inspection's value comes partly from process rigidity. MCR's value comes partly from adaptability. These are fundamentally different theories of how process produces quality.
-## 6. Open Problems and Gaps
-### 6.1 The Motivation-Outcome Gap Remains Unresolved
-Bacchelli and Bird (2013) documented that developers state defect detection as their primary motivation for code review but that defect-related comments constitute only 12% of review content. Twelve years later, this gap remains unaddressed at the methodological level. Should review processes be redesigned to optimize for actual outcomes (knowledge transfer, design improvement) rather than stated motivations (defect detection)? The literature has documented the gap but not resolved the prescription.
-### 6.2 The Latency Paradox in AI-Assisted Review
-Cihan et al. (2024) found that AI-assisted review increased total PR lifecycle time by 42% despite resolving 73.8% of AI-generated comments. It is unknown whether this effect is specific to the system studied, to the current generation of LLMs, or whether it represents a fundamental trade-off. Controlled replication across different AI systems and organizational contexts is needed.
-### 6.3 Reviewer Recommendation Algorithm Evaluation
-Reviewer recommendation algorithms are evaluated primarily on recall -- do they recommend reviewers who actually ended up reviewing the changeset? This metric is circular. Recommendation algorithms optimized for expertise matching rather than historical pattern matching have not been rigorously evaluated against quality outcomes.
-### 6.4 Gender and Demographic Bias in Code Review
-Murphy-Hill et al. (2022) documented that code review interactions differ systematically by gender: women's code changes receive more scrutiny and lower initial approval rates than men's code changes with equivalent quality. No intervention has been shown to reduce these disparities without reducing review quality.
-### 6.5 Evidence Base Quality
-The empirical evidence for code review effectiveness has a significant quality problem: most studies are observational rather than experimental. Large-scale randomized controlled trials of code review methodology do not exist and may be infeasible due to the organizational disruption they would require.
-## 7. Conclusion
-Code review methodology in 2026 exists in a transitional state. The field has accumulated substantial empirical evidence about how review is practiced, why it works and fails, and where it can be augmented. What the field lacks is a unified framework that reconciles the stated motivations for review with its actual primary outcomes, and that maps methodological choices to outcome-specific effectiveness.
-Three findings recur across the literature with sufficient consistency to be treated as established:
-1. **Defect detection is not code review's primary actual outcome.** Human review time is predominantly spent on evolvability, design, and knowledge transfer, not functional defect identification.
-2. **Reviewer expertise is the strongest predictor of review quality.** Assignment policies that match reviewers to code they are genuinely familiar with produce dramatically better outcomes than random or rotation-based assignment.
-3. **Cognitive load constraints are real and measurable.** Reviews of more than 300-400 LOC, sessions longer than 90 minutes, and inspection rates above 500 LOC/hour all show measurably reduced defect detection. These constraints should be embedded in review policy rather than left to individual reviewer discipline.
-The emergence of AI-assisted review represents the most significant open question in the field. Early evidence suggests that current AI review systems create evaluation overhead that may offset their defect detection contribution. Whether future systems with improved semantic understanding will change this calculus is genuinely unknown.
-## References
-- Ackerman, A.F., Buchwald, L.S., and Lewski, F.H., "Software Inspections: An Effective Verification Process", IEEE Software, 1989 (https://doi.org/10.1109/52.28730)
-- Bacchelli, A. and Bird, C., "Expectations, Outcomes, and Challenges of Modern Code Review", ICSE 2013 (https://dl.acm.org/doi/10.5555/2486788.2486882)
-- Baysal, O. et al., "Investigating Technical and Non-Technical Factors Influencing Modern Code Review", Empirical Software Engineering, 2016 (https://doi.org/10.1007/s10664-015-9366-8)
-- Beller, M., Bacchelli, A., Zaidman, A., and Juergens, E., "Modern Code Reviews in Open-Source Projects: Which Problems Do They Fix?", MSR 2014 (https://doi.org/10.1145/2597073.2597082)
-- Bird, C. et al., "The Right Timing for Code Review", FSE 2015 (https://dl.acm.org/doi/10.1145/2786805.2786848)
-- Bosu, A., Greiler, M., and Bird, C., "Characteristics of Useful Code Reviews: An Empirical Study at Microsoft", MSR 2015 (https://doi.org/10.1109/MSR.2015.21)
-- Chillarege, R. et al., "Orthogonal Defect Classification -- A Concept for In-Process Measurements", IEEE TSE, 1992 (https://doi.org/10.1109/32.163990)
-- Cihan, B. et al., "Industrial Evaluation of AI-Assisted Code Review", ICSE 2025 (https://arxiv.org/abs/2406.01081)
-- Chouchen, M. et al., "Anti-Patterns in Modern Code Review", TSE, 2021
-- Cohen, J., "Code Review Best Practices", SmartBear, 2010 (https://smartbear.com/learn/code-review/best-practices-for-peer-code-review/)
-- Cowan, N., "The magical number 4 in short-term memory", Behavioral and Brain Sciences, 2001 (https://doi.org/10.1017/S0140525X01003922)
-- Do, H. et al., "Explicit and Implicit Strategies in Code Review", EMSE, 2022 (https://doi.org/10.1007/s10664-021-10079-5)
-- Dogan, E. and Tuzun, E., "Towards a Taxonomy of Code Review Smells", Information and Software Technology, 2022 (https://doi.org/10.1016/j.infsof.2022.106951)
-- Fagan, M.E., "Design and Code Inspections to Reduce Errors in Program Development", IBM Systems Journal, 1976 (https://doi.org/10.1147/sj.153.0182)
-- Hellendoorn, V.J. et al., "Large-Scale Study of Code Review Changes", Microsoft Research, 2021
-- Johnson, B. et al., "Why Don't Software Developers Use Static Analysis Tools to Find Bugs?", ICSE 2013 (https://dl.acm.org/doi/10.1109/ICSE.2013.6606613)
-- Klein, G.A., "A Recognition-Primed Decision (RPD) Model of Rapid Decision Making", 1993 (https://www.osti.gov/biblio/6068116)
-- Lenarduzzi, V. et al., "Are SonarQube Rules Inducing Bad Practices?", SANER 2022 (https://doi.org/10.1109/SANER53432.2022.00033)
-- Li, Z. et al., "CodeReviewer: Pre-Training for Automating Code Review Activities", ESEC/FSE 2022 (https://dl.acm.org/doi/10.1145/3540250.3549121)
-- Mantyla, M.V. and Lassenius, C., "What Types of Defects Are Really Discovered in Code Reviews?", IEEE TSE, 2009 (https://doi.org/10.1109/TSE.2008.71)
-- McIntosh, S. et al., "The Impact of Code Review Coverage and Code Review Participation on Software Quality", MSR 2014 (https://doi.org/10.1145/2597073.2597076)
-- McIntosh, S. et al., "An Empirical Study of the Impact of Modern Code Review Practices on Software Quality", EMSE, 2016 (https://doi.org/10.1007/s10664-015-9381-9)
-- Miller, G.A., "The magical number seven, plus or minus two", Psychological Review, 1956 (https://doi.org/10.1037/h0043158)
-- Murphy-Hill, E. et al., "Gender and Code Review: A Large-Scale Study", FSE 2022 (https://dl.acm.org/doi/10.1145/3540250.3549158)
-- Rigby, P.C. and Bird, C., "Convergent Contemporary Software Peer Review Practices", ESEC/FSE 2013 (https://dl.acm.org/doi/10.1145/2491411.2491428)
-- Rigby, P.C. et al., "Peer Review on Open Source Software Projects", ACM TOSEM, 2014 (https://dl.acm.org/doi/10.1145/2594458)
-- Russell, G.W., "Experience with Inspection in Ultralarge-Scale Developments", IEEE Software, 1991 (https://doi.org/10.1109/52.73752)
-- Sadowski, C. et al., "Modern Code Review: A Case Study at Google", ICSE-SEIP 2018 (https://dl.acm.org/doi/10.1145/3183519.3183525)
-- SmartBear Software, "Best Kept Secrets of Peer Code Review", 2006 (https://smartbear.com/resources/ebooks/best-kept-secrets-of-peer-code-review/)
-- Thongtanunam, P. et al., "Automatically Recommending Code Reviewers Based on Their Expertise", ASE 2015 (https://doi.org/10.1109/ASE.2015.41)
-- Tufano, R. et al., "Towards Automating Code Review Activities", ICSE 2021 (https://dl.acm.org/doi/10.1109/ICSE43902.2021.00035)
-- Tufano, R. et al., "Using Pre-Trained Models to Boost Code Review Automation", ICSE 2022 (https://dl.acm.org/doi/10.1145/3510003.3510621)
-- Votta, L.G., "Does Every Inspection Need a Meeting?", SIGSOFT FSE 1993 (https://dl.acm.org/doi/10.1145/167049.167070)
-- Warm, J.S., Parasuraman, R., and Matthews, G., "Vigilance Requires Hard Mental Work and is Stressful", Human Factors, 2008 (https://doi.org/10.1518/001872008X312152)
-- Yang, X. et al., "A Survey on Modern Code Review", arXiv:2405.18216, 2024 (https://arxiv.org/abs/2405.18216)
-## Practitioner Resources
-### Review Process Guides
-- **Google Engineering Practices: Code Review** (https://google.github.io/eng-practices/review/) -- Google's public code review guide covering reviewer and author responsibilities. Reflects practices described in Sadowski et al. (2018).
-- **SmartBear "Best Kept Secrets of Peer Code Review"** (https://smartbear.com/resources/ebooks/best-kept-secrets-of-peer-code-review/) -- The foundational practitioner guide based on Cisco's code review data.
-- **Thoughtbot Code Review Guide** (https://github.com/thoughtbot/guides/blob/main/code-review/README.md) -- Open-source review guide covering reviewer and author responsibilities.
-### Static Analysis Integration
-- **SonarQube** (https://www.sonarqube.org/) -- Multi-language static analysis platform with PR decoration for GitHub, GitLab, and Bitbucket.
-- **Semgrep** (https://semgrep.dev/) -- Lightweight pattern-based static analysis with custom rule authoring.
-- **CodeClimate** (https://codeclimate.com/) -- Cloud-hosted code quality analysis with PR integration.
-### AI-Assisted Review Tools
-- **GitHub Copilot Code Review** (https://docs.github.com/en/copilot/using-github-copilot/code-review/using-copilot-code-review) -- LLM-generated preliminary review comments on pull requests.
-- **Qodo** (https://www.qodo.ai/) -- AI-assisted review with focus on test coverage analysis.
-### Review Metrics and Analysis
-- **LinearB** (https://linearb.io/) -- Engineering analytics platform with code review-specific metrics.
-- **GitHub CODEOWNERS** (https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners) -- File-pattern-based automatic reviewer assignment.
-- **Gerrit Code Review** (https://www.gerritcodereview.com/) -- Pre-commit gatekeeping review platform with built-in reviewer suggestion.

package/docs/research/index.md DELETED Viewed

@@ -1,76 +0,0 @@
-# Research Index
-Research documents that inform project design decisions, agent methodology, and domain knowledge.
-## External Articles
-External articles that directly influenced compound-agent's architecture.
-| Document | Description |
-|----------|-------------|
-| [BuildingACCompilerAnthropic.md](BuildingACCompilerAnthropic.md) | Anthropic's article on building a C compiler with parallel Claude agent teams |
-| [HarnessEngineeringOpenAi.md](HarnessEngineeringOpenAi.md) | OpenAI's article on harness engineering and leveraging Codex in agent-first workflows |
-| [AgenticAiCodebaseGuide.md](AgenticAiCodebaseGuide.md) | Guide to building codebases optimized for agentic AI development |
-## PhD Research Surveys
-Deep research produced via `/compound:get-a-phd` for agent domain knowledge. Each follows the researcher skill template.
-### Code Review
-| Document | Target Agents |
-|----------|---------------|
-| [code-review/systematic-review-methodology.md](code-review/systematic-review-methodology.md) | review phase, security-reviewer, architecture-reviewer, performance-reviewer, simplicity-reviewer, test-coverage-reviewer |
-### Learning Systems
-| Document | Target Agents |
-|----------|---------------|
-| [learning-systems/knowledge-compounding-for-agents.md](learning-systems/knowledge-compounding-for-agents.md) | compound phase, context-analyzer, lesson-extractor, pattern-matcher, solution-writer, compounding |
-### TDD
-| Document | Target Agents |
-|----------|---------------|
-| [tdd/test-driven-development-methodology.md](tdd/test-driven-development-methodology.md) | work phase, test-writer, implementer, cct-subagent |
-### Property Testing
-| Document | Target Agents |
-|----------|---------------|
-| [property-testing/property-based-testing-and-invariants.md](property-testing/property-based-testing-and-invariants.md) | invariant-designer, property-test-generator, anti-cargo-cult-reviewer, module-boundary-reviewer, drift-detector |
-### Security
-| Document | Target Agents |
-|----------|---------------|
-| [security/secure-coding-failure.md](security/secure-coding-failure.md) | security-reviewer, security-injection, security-secrets, security-auth, security-data, security-deps |
-| [security/overview.md](security/overview.md) | security-reviewer |
-| [security/injection-patterns.md](security/injection-patterns.md) | security-reviewer, security-injection |
-| [security/secrets-checklist.md](security/secrets-checklist.md) | security-reviewer, security-secrets |
-| [security/auth-patterns.md](security/auth-patterns.md) | security-reviewer, security-auth |
-| [security/data-exposure.md](security/data-exposure.md) | security-reviewer, security-data |
-| [security/dependency-security.md](security/dependency-security.md) | security-reviewer, security-deps |
-## Existing Analysis
-| Document | Description |
-|----------|-------------|
-| [test-optimization-strategies.md](test-optimization-strategies.md) | Analysis of test suite optimization approaches |
-## How to Add Research
-Use `/compound:get-a-phd` to produce new research documents. The command:
-1. Analyzes beads epics for knowledge gaps
-2. Checks all `docs/` for existing coverage
-3. Proposes PhD topics for user confirmation
-4. Spawns parallel researcher subagents
-5. Stores output at `docs/research/<topic>/<slug>.md`
-## How Agents Use Research
-Skills and agents reference research via `## Literature` sections. Agents can also query indexed knowledge:
-```bash
-npx ca knowledge "relevant query"  # Search docs knowledge base
-npx ca search "relevant query"     # Search lessons memory
-```