@tangle-network/agent-eval 0.68.0 → 0.69.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -4,6 +4,19 @@ All notable changes to `@tangle-network/agent-eval` and its sibling `agent-eval-
4
4
 
5
5
  ---
6
6
 
7
+ ## [0.69.0] — 2026-05-30 — strong generic baseline roles (engineer / researcher / generalist)
8
+
9
+ The structured profile (0.68.0) had a hollow top zone — `baselineProfile` took an arbitrary `role` string. Products are file-producing, tool-using agents living in a sandbox, but nothing gave them a strong operator foundation. This adds three generically-useful, verification-first baseline roles distilled from agent-runtime's `coderProfile` doctrine.
10
+
11
+ ### Added (`profile.*`)
12
+
13
+ - **`engineerRole`** — a senior principal / 10x-IC sandbox operator: produce the real artifact then verify it; smallest correct change; **run the checks and fix the root cause — never weaken a test or hide an error**; inspect external-boundary outcomes; "done" = produced AND verified.
14
+ - **`researcherRole`** — read the real sources, cite every material claim, mark inference vs. verified, never fabricate a source/quote/number.
15
+ - **`generalistRole`** — strong default: do over describe, ground claims, verify before done, ask only on genuinely user-owned choices.
16
+ - `BASELINE_ROLES` (keyed `engineer|researcher|generalist`) + `baselineProfileFromRole(role, overrides?)` — pick a foundation, override the environment to describe THIS product's sandbox, then layer domain via `prodProfile`.
17
+
18
+ **Layering discipline:** these are domain-AGNOSTIC and verification-first. Domain strength (legal M&A persona, tax-calc rigor) stays in the **product repo** and composes on top via `domain[]`; it is lifted into the substrate only once ≥2 products genuinely reuse it. 3 new tests assert the roles are distinct, verification-first, and carry no product-domain words. Full suite (1642) green.
19
+
7
20
  ## [0.68.0] — 2026-05-30 — structured AgentProfile (the self-improvement surface stops being an opaque blob)
8
21
 
9
22
  The optimizable surface was an opaque string addendum, so the loop could only mutate (and the dashboard only diff) an unstructured blob — you couldn't see *what kind* of improvement a candidate made. This adds a **sectioned `AgentProfile`** primitive (mirrored on Harvey LAB's system-prompt structure) so the surface has named, separately-addressable zones the loop targets one at a time.
package/dist/index.d.ts CHANGED
@@ -5821,6 +5821,45 @@ declare function defaultRenderStudentPrompt<TInput>(args: {
5821
5821
  * Throws on failure so the cell is recorded as failed, not silently zeroed. */
5822
5822
  declare function defaultParseStudentLabel<TProduced>(rawContent: string, scenarioId: string): TProduced;
5823
5823
 
5824
+ /**
5825
+ * @experimental
5826
+ *
5827
+ * Strong, generically-useful baseline ROLES — the top zone of an `AgentProfile`
5828
+ * before any domain layer. A product composes one of these with its own
5829
+ * environment description (its sandbox) and its domain guidance, which lives in
5830
+ * the product repo (not here): a profile is `<baseline role> + <environment> +
5831
+ * <domain sections>`. Domain strength is NOT generalized into the substrate —
5832
+ * it is lifted here only once ≥2 products genuinely reuse it.
5833
+ *
5834
+ * Three roles cover the common shapes:
5835
+ * - `engineerRole` — builds + verifies real artifacts in a sandbox. Distilled
5836
+ * from agent-runtime's `coderProfile` doctrine (minimal
5837
+ * correct change, run the checks, fix the root cause,
5838
+ * never weaken a test or hide an error).
5839
+ * - `researcherRole` — gathers from real sources and grounds every claim
5840
+ * (cite or mark inferred; never fabricate).
5841
+ * - `generalistRole` — a strong default: helpful, grounded, verify-before-assert.
5842
+ *
5843
+ * All three are verification-first and domain-agnostic. They describe HOW a
5844
+ * capable IC operates, not WHAT domain it works in.
5845
+ */
5846
+ /** Senior-IC engineer operating in a sandbox — produces real artifacts and
5847
+ * verifies them before declaring done. The shared operator foundation for any
5848
+ * file-producing, tool-using product agent. */
5849
+ declare const engineerRole: string;
5850
+ /** Researcher who grounds every claim in real sources — gathers, cites, and
5851
+ * distinguishes verified fact from inference. */
5852
+ declare const researcherRole: string;
5853
+ /** Strong general-purpose default — helpful, grounded, verifies before asserting. */
5854
+ declare const generalistRole: string;
5855
+ /** The named baseline roles, for selection by key (e.g. from a product config). */
5856
+ declare const BASELINE_ROLES: {
5857
+ readonly engineer: string;
5858
+ readonly researcher: string;
5859
+ readonly generalist: string;
5860
+ };
5861
+ type BaselineRoleKey = keyof typeof BASELINE_ROLES;
5862
+
5824
5863
  /**
5825
5864
  * @experimental — surface may evolve as production agents wire it in.
5826
5865
  *
@@ -5843,6 +5882,7 @@ declare function defaultParseStudentLabel<TProduced>(rawContent: string, scenari
5843
5882
  * varied. Consume this module by its own path (`@tangle-network/agent-eval`
5844
5883
  * exposes it under the `profile` namespace) to avoid the name clash.
5845
5884
  */
5885
+
5846
5886
  /** A named, addressable region of the system prompt. `evolvable` marks whether
5847
5887
  * the self-improvement loop is allowed to patch its body; fixed scaffolding
5848
5888
  * (e.g. a compliance preamble) sets `evolvable: false`. */
@@ -5892,6 +5932,18 @@ declare function baselineProfile(args: {
5892
5932
  toolConventions?: string;
5893
5933
  skills?: ProfileSkill[];
5894
5934
  }): AgentProfile;
5935
+ /**
5936
+ * Baseline from one of the strong generic roles (`'engineer' | 'researcher' |
5937
+ * 'generalist'`) — the common case: pick a role foundation, optionally override
5938
+ * the environment to describe THIS product's sandbox, then layer domain via
5939
+ * `prodProfile`. Domain guidance stays in the product repo; this only supplies
5940
+ * the generically-useful role + stock scaffolding.
5941
+ */
5942
+ declare function baselineProfileFromRole(role: BaselineRoleKey, args?: {
5943
+ environment?: string;
5944
+ toolConventions?: string;
5945
+ skills?: ProfileSkill[];
5946
+ }): AgentProfile;
5895
5947
  /** The production profile: the baseline scaffolding plus the domain sections
5896
5948
  * shipped after self-improvement. Differs from the baseline ONLY in `domain`
5897
5949
  * (and any skills the caller layered into the baseline) — the role,
@@ -5918,15 +5970,21 @@ declare function sectionHash(section: AgentProfileSection): string;
5918
5970
 
5919
5971
  type index_AgentProfile = AgentProfile;
5920
5972
  type index_AgentProfileSection = AgentProfileSection;
5973
+ declare const index_BASELINE_ROLES: typeof BASELINE_ROLES;
5974
+ type index_BaselineRoleKey = BaselineRoleKey;
5921
5975
  type index_ProfileSkill = ProfileSkill;
5922
5976
  declare const index_applyDomainPatch: typeof applyDomainPatch;
5923
5977
  declare const index_baselineProfile: typeof baselineProfile;
5978
+ declare const index_baselineProfileFromRole: typeof baselineProfileFromRole;
5979
+ declare const index_engineerRole: typeof engineerRole;
5980
+ declare const index_generalistRole: typeof generalistRole;
5924
5981
  declare const index_prodProfile: typeof prodProfile;
5925
5982
  declare const index_profileToSurface: typeof profileToSurface;
5926
5983
  declare const index_renderProfile: typeof renderProfile;
5984
+ declare const index_researcherRole: typeof researcherRole;
5927
5985
  declare const index_sectionHash: typeof sectionHash;
5928
5986
  declare namespace index {
5929
- export { type index_AgentProfile as AgentProfile, type index_AgentProfileSection as AgentProfileSection, type index_ProfileSkill as ProfileSkill, index_applyDomainPatch as applyDomainPatch, index_baselineProfile as baselineProfile, index_prodProfile as prodProfile, index_profileToSurface as profileToSurface, index_renderProfile as renderProfile, index_sectionHash as sectionHash };
5987
+ export { type index_AgentProfile as AgentProfile, type index_AgentProfileSection as AgentProfileSection, index_BASELINE_ROLES as BASELINE_ROLES, type index_BaselineRoleKey as BaselineRoleKey, type index_ProfileSkill as ProfileSkill, index_applyDomainPatch as applyDomainPatch, index_baselineProfile as baselineProfile, index_baselineProfileFromRole as baselineProfileFromRole, index_engineerRole as engineerRole, index_generalistRole as generalistRole, index_prodProfile as prodProfile, index_profileToSurface as profileToSurface, index_renderProfile as renderProfile, index_researcherRole as researcherRole, index_sectionHash as sectionHash };
5930
5988
  }
5931
5989
 
5932
5990
  export { ANALYST_SEVERITIES, type ActiveLearningOptions, type AdapterRun, AgentDriver, type AgentDriverConfig, AgentEvalError, AgentProfile$1 as AgentProfile, type AgreementResult, type AlignmentOp, Analyst, AnalystContext, AnalystCost, AnalystFinding, AnalystSeverity, AnalyzeTracesInput, AnalyzeTracesOptions, AnalyzeTracesResult, type AntiSlopConfig, type AntiSlopIssue, type AntiSlopReport, Artifact$1 as Artifact, type Artifact as ArtifactCheckArtifact, type ArtifactEventLike, type ArtifactValidator, type AssertCrossFamilyOptions, type AssertSingleBackendOptions, type AutoPrClient, AxGepaSteeringOptimizer, type AxSteeringOptimizerConfig, type BackendDescriptor, BaselineReport, BehaviorAssertion, BenchmarkReport, BenchmarkRunner, BenchmarkRunnerConfig, type BisectOptions, type BisectResult, type BisectStep, BudgetBreachError, BudgetGuard, BudgetLedgerEntry, BudgetSpec, type BuildAgreementJudgeOptions, CallExpectation, type CanaryAlert, type CanaryKind, type CanaryLeak, type CanaryOptions, type CanaryReport, type CanarySeverity, type CandidateScenario, type CausalAttributionReport, type CellVerdict, ChatRequest, CheckResult, CollectedArtifacts, type CommandRunner, type CompareLabels, CompletionCriterion, type CompletionRequirement, type CompletionVerdict, type ConceptComplexity, type ConceptFinding, type ConceptSpec, type ConceptWeightStrategy, type ContinuityCheck, type ContinuityCheckResult, type ContinuityReport, type ContinuitySnapshotPair, type ContractMetric, type ContractReport, ConvergenceTracker, type CorrectnessChecker, type CostEntry, type CostSummary, CostTracker, type CounterfactualContext, type CounterfactualMutation, type CounterfactualResult, type CounterfactualRunner, CreateChatClientOpts, type CreateDefaultReviewerOptions, type CreateSandboxPoolOpts, type CreateTraceAnalystKindOpts, CrossFamilyError, type CrossTraceDiff, type CrossTraceDiffOptions, D1ExperimentStore, type D1ExperimentStoreOptions, type D1Like, type D1PreparedStatementLike, DEFAULT_AGENT_SLOS, DEFAULT_COMPLEXITY_WEIGHTS, DEFAULT_FINDERS, DEFAULT_HARNESS_OBJECTIVES, DEFAULT_MUTATION_PRIMITIVES, DEFAULT_MUTATORS, DEFAULT_PR_REVIEW_SCORE_WEIGHTS, DEFAULT_RUN_SCORE_WEIGHTS, DEFAULT_SEVERITY_WEIGHTS, DEFAULT_TRACE_ANALYST_KINDS, Dataset, DatasetScenario, type DecideNextUserTurnOpts, type DeployFamily, type DeployGateLayerInput, type DeployRunResult, type DeployRunner, type DiffPolicy, type DiffScorecardOptions, type DirEntry, type Direction, type DiscoverPersonasOptions, type DiscoveredPersona, DriverResult, DriverState, DualAgentBench, type DualAgentBenchConfig, type DualAgentReport, type DualAgentRound, type DualAgentScenario, type DualAgentScenarioResult, ERROR_COUNT_PATTERNS, type ErrorCountPattern, type EvolutionRound, type ExecutorConfig, type Expectation, type Experiment, type Run as ExperimentRun, type ExperimentStore, ExperimentTracker, type ExportedRewardModel, type ExtractOptions, type ExtractResult, FAILURE_MODE_KIND_SPEC, FINDING_SUBJECT_GRAMMAR_PROMPT, FINDING_SUBJECT_KINDS, type FactorContribution, type FactorialCell, FeedbackLabel, FeedbackTrajectory, FeedbackTrajectoryStore, type FieldAgreementSpec, type FileChange, FileSystemExperimentStore, type FileSystemExperimentStoreOptions, type FindingSubject, type FindingSubjectKind, FindingSubjectStringSchema, type FindingsDiff, FindingsStore, type FlowAction, type FlowLayerEnv, type FlowLayerFactoryInput, type FlowRunner, type FlowRunnerStepResult, type FlowSpec, type FlowStep, type GhCliClientOptions, type GoldScenario, type GoldSplit, type GoldenSeverity, type GoldenSpec, type HarnessAdapter, HarnessConfig, type HarnessExperimentConfig, type HarnessExperimentResult, type HarnessIntervention, type HarnessRunRequest, type HarnessRunResult, type HarnessScenario, type HarnessSelection, type HarnessVariant, type HarnessVariantReport, HoldoutAuditor, type HostedJudgeConfig, type HostedJudgeDimension, type HostedJudgeRequest, type HostedJudgeResponse, type HostedRunCriticConfig, type HostedRunScoreRequest, type HostedRunScoreResponse, type HttpGithubClientOptions, type HypothesisManifest, type HypothesisResult, IMPROVEMENT_KIND_SPEC, INTENT_MATCH_JUDGE_VERSION, type ImageData, InMemoryExperimentStore, InMemoryWorkspaceInspector, type InferenceScorer, type InspectorContext, type IntentMatchInput, type IntentMatchOptions, type IntentMatchResult, type InteractionContribution, type JudgeAdapterOpts, type JudgeFamily, type JudgeFleetOptions, JudgeFn, JudgeInput, type JudgeReplayResult, type JudgeRetryOutcome, type JudgeRetryPolicy, JudgeRunner, KIND_EXPECTED_SUBJECTS, KNOWLEDGE_GAP_KIND_SPEC, KNOWLEDGE_POISONING_KIND_SPEC, type KeywordConceptSpec, type KeywordCoverageFinding, type KeywordCoverageOptions, type KeywordCoverageResult, type LangfuseEnvelope, type LangfuseGeneration, type LangfuseScore, Layer, LayerResult, type LiveProofArtifact, type LiveProofConfig, type LiveProofContext, type LiveProofResult, LlmClientOptions, type LlmCorrectnessCheckerOpts, LlmSpan, LockedJsonlAppender, MODEL_PRICING, type MatchResult, type MatcherResult, type MeasurementPolicy, type MergeOptions, MetricsCollector, type MuffledFinder, type MuffledFinding, MultiLayerVerifier, type MultiToolchainLayerConfig, type Mutator, Mutex, type Objective, type Oracle, type OracleObservation, type OracleReport, type OracleResult, type OrthogonalityInput, type OrthogonalityResult, OtelExportConfig, OtelExporter, type OtelPipelineHandle, type OtelPipelineOptions, PairwiseSteeringOptimizer, type ParaphraseRobustnessScenarioInput, type ParaphraseRobustnessScenarioResult, type ParetoResult, type ParseStudentLabel, type PersistedFinding, PersonaConfig, type Playbook, type PlaybookEntry, type PoolSlot, type PrReviewAuditCase, type PrReviewBenchmarkSummary, type PrReviewComment, type PrReviewMatchedFinding, type PrReviewOutcome, type PrReviewReferenceFinding, type PrReviewScore, type PrReviewScoreWeights, type PrReviewSeverity, type PrReviewSource, type ProducedProposal, type ProducedState, ProductClient, ProductClientConfig, type PromptHandle, PromptRegistry, type ProposalEventLike, type ProposeAutomatedPullRequestInput, type ProposeAutomatedPullRequestResult, RAW_FINDING_SCHEMA_PROMPT, type RawAnalystFinding, RawAnalystFindingSchema, type RecordRunsOptions, type ReferenceMatchResult, type ReferenceReplayAdapter, type ReferenceReplayAdapterFn, type ReferenceReplayAdapterLike, type ReferenceReplayAggregate, type ReferenceReplayCandidate, type ReferenceReplayCase, type ReferenceReplayCaseRun, type ReferenceReplayExecutionScenario, type ReferenceReplayItem, type ReferenceReplayMatch, type ReferenceReplayMatchStrategy, type ReferenceReplayMatcher, type ReferenceReplayPromotionDecision, type ReferenceReplayPromotionPolicy, type ReferenceReplayRun, type ReferenceReplayRunContext, type ReferenceReplayRunOptions, type ReferenceReplayRunStore, type ReferenceReplayScenario, type ReferenceReplayScenarioScore, type ReferenceReplayScore, type ReferenceReplayScoreOptions, type ReferenceReplaySplit, type ReferenceReplaySplitComparison, type ReferenceReplaySteeringRowsOptions, type ReflectionContext, type ReflectionProposal, ReleaseConfidenceScorecard, ReleaseConfidenceThresholds, type RenderStudentPrompt, type RepoRef, type RequirementCheck, type ReviewerMemoryEntry, type ReviewerOutput, type ReviewerPromptInput, type ReviewerSoftFailDefaults, type ReviewerVerificationSummary, type RobustnessResult, Run$1 as Run, type RunCommandInput, type RunCommandResult, type RunConfig, RunCritic, type RunCriticAdapterOpts, type RunCriticOptions, type RunDiff, type RunDistillationOptions, type RunDistillationResult, RunFilter, RunRecord, type RunScore, type RunScoreWeights, type RunTrace, type RuntimeEventLike, SEMANTIC_CONCEPT_JUDGE_VERSION, SKILL_USAGE_ANALYST, SandboxDriver, SandboxHarnessResult, type SandboxJudgeKind, type SandboxJudgeResult, type SandboxJudgeSpec, type SandboxPool, type SatisfiedBy, type ScanOptions, Scenario, type ScenarioCost, ScenarioFile, ScenarioRegistry, ScenarioResult, type Scorecard, type ScorecardCell, type ScorecardCellDiff, type ScorecardDiff, type ScorecardEntry, type ScorecardLogLine, type ScoredTarget, type SelfPlayOptions, type SelfPlayProposer, type SelfPlayScorer, type SemanticConceptJudgeAdapterOpts, type SemanticConceptJudgeInput, type SemanticConceptJudgeOptions, type SemanticConceptJudgeResult, type SeriesConvergenceOptions, type SeriesConvergenceResult, Severity, type SignedManifest, type SignedManifestAlgo, type SingleBackendDivergence, SingleBackendError, type SingleBackendField, type SingleBackendReport, SkillUsageAnalyst, type SkillUsageRecord, type SkillUsageReport, type SkillUsageScanConfig, type Slo, type SloCheckResult, type SloComparator, type SloReport, type SloSeverity, type SlopCategory, type SlotFactory, Span, type SplitGoldOptions, type SteeringBundle, type SteeringDelta, type SteeringOptimizationResult, type SteeringOptimizationRow, type SteeringOptimizationSelector, type SteeringOptimizerBackend, type SteeringOptimizerConfig, type SteeringRolePrompt, type StepAttribution, type SynthesisReason, type SynthesisTarget, type TaskGold, TestResult, type ThresholdContract, TokenCounter, type TokenSpec, type ToolCallEventLike, TraceAnalysisStore, type TraceAnalystAdapterOpts, type TraceAnalystGolden, type TraceAnalystKindSpec, TraceEmitter, TraceEvent, TraceStore, type TraceToolGroupName, type TracedAnalystOptions, type TracedJudgeOptions, Trajectory, TrajectoryStep, type TrialTrace, TurnMetrics, UNIVERSAL_FINDERS, type ValidationContext, type ValidationIssue, type ValidationResult, type VerifierAdapterOpts, VerifyContext, VerifyOptions, type VisualDiffOptions, type VisualDiffResult, type ViteDeployRunnerInput, type WorkflowTopology, type WorkspaceAssertion, type WorkspaceAssertionResult, type WorkspaceInspector, type WorkspaceSnapshot, type WranglerDeployRunnerInput, adversarialJudge, aggregatePrReviewScore, aggregateRunScore, analyzeAntiSlop, analyzeSeries, appendScorecard, assertCrossFamily, assertSingleBackend, attributeCounterfactuals, bisect, buildAgreementJudge, buildDriverSystemPrompt, buildReflectionPrompt, buildReviewerPrompt, buildSkillUsageReport, buildTraceToolsForGroup, byteLengthRange, canaryLeakView, canonicalize, causalAttribution, checkBehavioralCanary, checkCanaries, checkSlos, clamp01, codeExecutionJudge, coherenceJudge, collectionPreserved, commentsForSource, commitBisect, compareReferenceReplay, compilerJudge, composeValidators, containsAll, createAntiSlopJudge, createCustomJudge, createDefaultReviewer, createDomainExpertJudge, createIntentMatchJudge, createJudgeAdapter, createLlmCorrectnessChecker, createRunCriticAdapter, createSandboxPool, createSemanticConceptJudge, createSemanticConceptJudgeAdapter, createTraceAnalystAdapter, createTraceAnalystKind, createVerifierAdapter, crossTraceDiff, crowdingDistance, decideNextUserTurn, decideReferenceReplayPromotion, decideReferenceReplayRunPromotion, defaultIsMaterial, defaultJudges, defaultParseStudentLabel, defaultReferenceReplayMatcher, defaultRenderStudentPrompt, deployGateLayer, diffFindings, diffScorecard, discoverPersonas, distillPlaybook, dominates, emitSkillUsageFindings, estimateCost, estimateTokens, evaluateContract, evaluateHypothesis, evaluateOracles, executeScenario, expectAgent, exportRewardModel, extractAssetUrls, extractErrorCount, extractProducedState, fieldAgreement, fileContains, fileExists, findAutoMatchNoExpectation, findConstructorCwdDropped, findFallbackToPass, findLiteralTruePass, findSkipCountsAsPass, flowLayer, formatBenchmarkReport, formatDriverReport, formatFindings, formatScorecardDiff, ghCliClient, precision as goldenPrecision, hashContent, hashJson, htmlContainsElement, httpGithubClient, inMemoryReferenceReplayStore, isModelPriced, isOtelConfigured, jsonHasKeys, jsonShape, jsonlReferenceReplayStore, judgeFamily, keyPreserved, liftSeverity, linterJudge, loadGoldScenarios, loadScorecard, loadScorerFromGrader, localCommandRunner, lowercaseMutator, matchGoldens, mergeLayerResults, mergeSteeringBundle, multiToolchainLayer, notBlocked, paraphraseRobustness, paraphraseRobustnessScenarios, paretoFrontier, paretoFrontierWithCrowding, parseCorrectnessResponse, parseFindingSubject, parseGoldJsonl, parseRawFinding, parseReflectionResponse, passOrthogonality, pixelDeltaRatio, politenessPrefixMutator, printDriverSummary, index as profile, promptBisect, proposeAutomatedPullRequest, proposeSynthesisTargets, recordRuns, recordRunsToScorecard, referenceReplayRunsToSteeringRows, referenceReplayScenarioToRunScore, regexMatch, regexMatches, renderFindingSubject, renderMarkdownReport, renderPlaybookMarkdown, renderPriorFindings, renderSteeringText, replayScorerOverCorpus, replayTraceThroughJudge, resetLockedAppendersForTesting, resolveModelPricing, rowCount, rowWhere, runAssertions, runBehavioralCanaries, runCanaries, runCounterfactual, runDistillation, runE2EWorkflow, runExpectations, runHarnessExperiment, runIntentMatchJudge, runJudgeFleet, runKeywordCoverageJudge, runKeywordCoverageJudgeUrl, runLiveProof, runReferenceReplay, runSelfPlay, runSemanticConceptJudge, scalarScore, scanForMuffledGates, scoreContinuity, scorePrReviewComments, scorePrReviewSource, scoreReferenceReplay, securityJudge, selectHarnessVariant, sentenceReorderMutator, signManifest, splitGold, statusAdvanced, summarizeHarnessResults, summarizePrReviewBenchmark, testJudge, textInSnapshot, toLangfuseEnvelope, toPrometheusText, traceJudge, traceJudgeEnsemble, tracedAnalyzeTraces, typoMutator, urlContains, verifyCompletion, verifyManifest, visualDiff, viteDeployRunner, weightedRecall, whitespaceCollapseMutator, withJudgeRetry, withOtelPipeline, wranglerDeployRunner };
package/dist/index.js CHANGED
@@ -10719,13 +10719,57 @@ function replacerSortKeys() {
10719
10719
  // src/profile/index.ts
10720
10720
  var profile_exports = {};
10721
10721
  __export(profile_exports, {
10722
+ BASELINE_ROLES: () => BASELINE_ROLES,
10722
10723
  applyDomainPatch: () => applyDomainPatch,
10723
10724
  baselineProfile: () => baselineProfile,
10725
+ baselineProfileFromRole: () => baselineProfileFromRole,
10726
+ engineerRole: () => engineerRole,
10727
+ generalistRole: () => generalistRole,
10724
10728
  prodProfile: () => prodProfile,
10725
10729
  profileToSurface: () => profileToSurface,
10726
10730
  renderProfile: () => renderProfile,
10731
+ researcherRole: () => researcherRole,
10727
10732
  sectionHash: () => sectionHash
10728
10733
  });
10734
+
10735
+ // src/profile/baselines.ts
10736
+ var engineerRole = [
10737
+ "You are a senior principal engineer \u2014 a 10x individual contributor \u2014 operating inside an isolated sandbox workspace with real tools (shell, filesystem, editors, test runners).",
10738
+ "You do not behave like a chatbot that describes work. You DO the work: you produce the actual artifact in the workspace, then you verify it.",
10739
+ "",
10740
+ "How you operate:",
10741
+ "- Deliver the smallest correct change that fully satisfies the goal. Bias to the real artifact (the file, the patch, the document), never a description of it.",
10742
+ "- Before declaring done, run the available checks (tests, typecheck, validators, a re-read of what you produced). If a check fails, fix the ROOT CAUSE \u2014 never weaken the check, never hide the error, never fake success.",
10743
+ "- External-boundary calls (shell, network, filesystem) can fail. Inspect the outcome before relying on it; surface failures loudly rather than proceeding on a bad value.",
10744
+ '- State outcomes faithfully: what you verified, what you skipped, what is still failing. "Done" means produced AND verified.'
10745
+ ].join("\n");
10746
+ var researcherRole = [
10747
+ "You are a principal research analyst operating inside a workspace with tools to read sources, search, and record findings.",
10748
+ "Your output is only as good as its grounding. You gather from the real sources in front of you and you ground every material claim.",
10749
+ "",
10750
+ "How you operate:",
10751
+ "- Read the actual sources before concluding. Do not answer from memory when a source is available to check.",
10752
+ "- Cite the source for every material claim; explicitly mark anything you infer rather than verify. Never fabricate a source, a quote, a number, or a citation.",
10753
+ "- Distinguish what the sources establish from what you are extrapolating, and say which is which.",
10754
+ "- When the sources are insufficient or contradictory, say so plainly rather than papering over the gap."
10755
+ ].join("\n");
10756
+ var generalistRole = [
10757
+ "You are a capable, senior generalist assistant operating inside a workspace with real tools.",
10758
+ "You are direct and grounded: you verify before you assert, and you produce real output rather than describing it.",
10759
+ "",
10760
+ "How you operate:",
10761
+ "- Prefer doing over describing \u2014 when the workspace lets you produce the artifact, produce it.",
10762
+ "- Ground claims in what you can check; when you cannot check something, say so instead of guessing confidently.",
10763
+ "- Verify your output before declaring done, and report what you verified vs. what remains uncertain.",
10764
+ "- Ask the user only when a choice is genuinely theirs and you cannot resolve it from the task, the workspace, or sensible defaults."
10765
+ ].join("\n");
10766
+ var BASELINE_ROLES = {
10767
+ engineer: engineerRole,
10768
+ researcher: researcherRole,
10769
+ generalist: generalistRole
10770
+ };
10771
+
10772
+ // src/profile/index.ts
10729
10773
  function renderSkill(skill) {
10730
10774
  const lines = [`### ${skill.name}`, skill.description];
10731
10775
  if (skill.triggers.length > 0) {
@@ -10785,6 +10829,9 @@ function baselineProfile(args) {
10785
10829
  domain: []
10786
10830
  };
10787
10831
  }
10832
+ function baselineProfileFromRole(role, args = {}) {
10833
+ return baselineProfile({ role: BASELINE_ROLES[role], ...args });
10834
+ }
10788
10835
  function prodProfile(baseline, shipped) {
10789
10836
  return { ...baseline, domain: [...baseline.domain, ...shipped] };
10790
10837
  }