npm - agentme - Versions diffs - 0.8.2 → 0.10.0 - Mend

agentme 0.8.2 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/.filedist-package.yml CHANGED Viewed

@@ -1,5 +1,6 @@
 sets:
-  - package: xdrs-core@0.27.0
+  - package: xdrs-core@0.28.0
+  # - package: git:https://github.com/flaviostutz/xdrs-core.git@main
     selector:
       files:
         - .xdrs/_core/**

package/.xdrs/agentme/edrs/application/018-ai-agent-development-standards.md ADDED Viewed

@@ -0,0 +1,156 @@
+---
+name: agentme-edr-policy-018-ai-agent-development-standards
+description: Defines the standard toolchain, framework, evaluation approach, and workflow patterns for building AI agents with Python and LangGraph. Use when scaffolding, reviewing, or extending AI agent projects.
+apply-to: AI agent projects built with Python
+valid-from: 2026-05-26
+---
+# agentme-edr-policy-018: AI agent development standards
+## Context and Problem Statement
+AI agent projects vary widely in how they choose frameworks, manage context, evaluate outputs, and structure workflows. Without a shared baseline, projects accumulate incompatible patterns for LLM provider abstraction, flow design, and dataset-driven testing.
+Which tools, frameworks, and design patterns should AI agent projects follow to ensure reproducibility, testability, and maintainability?
+## Decision Outcome
+**Use Python with LangGraph for flow orchestration and MLflow for experiment tracking and local evaluation.**
+### Details
+#### 01-language-and-framework
+All agent projects MUST be implemented in Python, following [agentme-edr-014](014-python-project-tooling.md) for project structure, tooling, and Makefile conventions.
+Agent flows MUST be built with **LangGraph**. Use LangGraph `StateGraph` to model each distinct workflow as an explicit directed graph with typed state.
+#### 02-llm-provider-compatibility
+Agent code MUST be compatible with both **OpenAI** and **Azure OpenAI** providers without code changes. Achieve this by:
+- Using the `langchain-openai` package which supports both providers through environment variables.
+- Selecting the provider by setting `OPENAI_API_TYPE=azure` (Azure OpenAI) or omitting it (OpenAI).
+- Never hardcoding provider-specific URLs, deployment names, or API versions in code; inject them through environment variables or a configuration object.
+Minimum required environment variable surface:
+| Variable | Purpose |
+|---|---|
+| `OPENAI_API_KEY` | API key (both providers) |
+| `OPENAI_API_BASE` / `AZURE_OPENAI_ENDPOINT` | Endpoint (Azure only) |
+| `OPENAI_API_VERSION` | API version (Azure only) |
+| `AZURE_OPENAI_DEPLOYMENT` | Deployment/model name (Azure only) |
+| `OPENAI_MODEL` | Model name (OpenAI only) |
+#### 03-observability-and-experiment-tracking
+Use **MLflow** for all agent observability and evaluation:
+- Wrap each agent run with `mlflow.start_run()` to capture traces, parameters, and metrics locally.
+- Enable LangChain auto-tracing via `mlflow.langchain.autolog()` at entry point startup.
+- Log run parameters (model name, temperature, prompt version) and output metrics (accuracy, latency, token counts) using `mlflow.log_param` / `mlflow.log_metric`.
+- Run a local MLflow tracking server with `mlflow ui` to inspect runs during development. Do not require a remote MLflow server for local development.
+#### 04-dataset-driven-accuracy-measurement
+Every agent pipeline MUST have a companion evaluation dataset and an MLflow experiment that measures accuracy against it. Datasets and evals are organized per-workflow following rule `07-workflow-structure` and rule `08-workflow-evals`.
+- Store evaluation datasets under `evals/<workflow>/` (sibling of `lib/` and `examples/`), following [agentme-edr-019](019-ml-dataset-structure.md) for structure and format. For MLflow input/output pairs, use the JSONL format described in `agentme-edr-019.04-complex-structured-datasets-must-use-jsonl`.
+- Write evaluation scripts under `evals/<workflow>/` that load the dataset, run each input through the live agent (against real LLMs, not mocks), compare outputs to expected values, and log per-sample and aggregate metrics to an MLflow experiment.
+- Add a `make eval` Makefile target in the module root Makefile (the same Makefile that sits alongside `lib/` and `examples/`) that delegates to all per-workflow eval targets.
+- Evaluation MUST run against real LLM providers, not recorded responses, to capture model drift. MLflow tracking MUST work locally without a remote server.
+#### 05-flow-documentation
+Each agent flow MUST be documented as a **Mermaid graph** in the project `README.md`. The diagram must match the LangGraph `StateGraph` definition:
+- Use `graph TD` or `graph LR` direction.
+- Label each node with its Python function name.
+- Label conditional edges with the condition expression.
+- Update the diagram whenever the graph topology changes.
+Example minimal diagram block:
+```mermaid
+graph TD
+    A[fetch_context] --> B[draft_response]
+    B --> C{verify}
+    C -->|pass| D[output]
+    C -->|fail| B
+```
+#### 06-verification-steps
+Agent flows MUST include at least one explicit verification node before producing final output:
+- Model the verification step as a dedicated LangGraph node (e.g. `verify_output`).
+- The node checks the draft output against defined acceptance criteria (schema validation, factual consistency check, rubric scoring, or LLM-as-judge call).
+- On failure, the verification node MUST route back to the relevant generation node, not silently pass through.
+- Log verification results (pass/fail, score, reason) as MLflow metrics on the current run.
+#### 07-workflow-structure
+Agent logic MUST be organized as named workflows. Each workflow is an independent LangGraph `StateGraph` with a defined start node and end node, connecting agents, states, routes, and decision nodes.
+For each workflow named `<workflow>`, create:
+```text
+lib/
+  workflows/
+    <workflow>/
+      graph.py        # StateGraph definition; entry point for the workflow
+      agents.py       # LangChain agent definitions used by this workflow
+      states.py       # Typed state dataclasses / TypedDicts
+      routes.py       # Conditional edge functions
+```
+- `graph.py` MUST define and compile the `StateGraph` and expose a `graph` object that callers invoke.
+- Additional modules (tools, prompts, schemas) MAY be added inside `lib/workflows/<workflow>/` when they are specific to that workflow. Shared utilities belong in `lib/<module>/`.
+- Each workflow MUST be documented with a Mermaid diagram in the project `README.md` following rule `05-flow-documentation`.
+#### 08-workflow-evals
+For each workflow `<workflow>` there MUST be a corresponding eval directory:
+```text
+evals/
+  <workflow>/
+    Makefile                   # eval targets for this workflow
+    dataset_<slice>/           # one folder per eval slice (see agentme-edr-019)
+    eval_<slice>.py            # evaluation script for each slice
+```
+The `evals/<workflow>/Makefile` MUST define:
+| Target | Behaviour |
+|---|---|
+| `eval` | Runs all eval slices for the workflow |
+| `eval-<slice>` | Runs one named slice (e.g. `eval-simple`, `eval-complex`) |
+Each `eval_<slice>.py` script MUST:
+- Load the dataset from `evals/<workflow>/dataset_<slice>/` following [agentme-edr-019](019-ml-dataset-structure.md).
+- Run every input through the live workflow against real LLMs.
+- Log per-sample and aggregate metrics to an MLflow experiment that runs locally.
+The module root Makefile `make eval` target MUST delegate to `eval` in every `evals/<workflow>/Makefile`.
+#### 09-local-sandbox
+When a workflow node or tool requires a **local sandbox** — an isolated environment where the agent can read files, glob-search directories, and execute shell commands — use the **[deepagents](https://github.com/deepagents/deepagents) framework** to provide that sandbox.
+**When to apply this rule**
+Use deepagents whenever ANY of the following is true for a workflow or tool:
+- The agent needs to execute shell commands or scripts in a controlled environment.
+- The agent needs to list, read, or search files across multiple directories at runtime.
+- The agent operates on user-supplied or generated file trees that must not escape a sandboxed boundary.
+**Integration requirements**
+- Initialize the sandbox at the start of the workflow run and shut it down in the same `try/finally` block.
+- Pass the sandbox handle into the LangGraph workflow state so all nodes share the same sandbox instance.
+- If the host-side code needs to pass files into the sandbox (e.g. generated config or input data), create a temporary directory with `tempfile.mkdtemp()`, write the files there, and mount it into the sandbox. Clean it up in the `finally` block.
+- Replace hand-rolled `read_file`, `search_files`, and `grep_file` tool implementations with the equivalent tools provided by deepagents.

package/.xdrs/agentme/edrs/application/019-ml-dataset-structure.md ADDED Viewed

@@ -0,0 +1,96 @@
+---
+name: agentme-edr-policy-019-ml-dataset-structure
+description: Defines the standard folder layout and file conventions for ML datasets used in AI/ML projects. Use when creating, organizing, or consuming datasets for machine learning tasks such as image labeling, document extraction, tabular data, LLM evaluation, and Q&A sets.
+apply-to: ML and AI projects that produce or consume datasets
+valid-from: 2026-05-27
+---
+# agentme-edr-policy-019: ML dataset structure
+## Context and Problem Statement
+ML projects accumulate datasets of different shapes: file-paired annotations, tabular CSVs, and structured JSONL records. Without a shared layout convention, tooling and agents cannot reliably discover schema files, consume data programmatically, or understand what a dataset contains.
+How should ML datasets be organized on disk so they are self-describing, easy to consume, and consistent across dataset types?
+## Decision Outcome
+**A standard root layout with mandatory README.md and dataset.schema.json, plus type-specific conventions for data files**
+Every dataset MUST live in its own named folder and include a README and a JSON Schema file. Data files are organized according to three dataset types, each with its own placement rule.
+### Details
+#### 01-root-structure-is-mandatory
+Every dataset MUST follow this root layout:
+```
+/[name-of-dataset]/
+    README.md
+    dataset.schema.json
+    data/              (present when dataset files are referenced by other data, or for file+annotation pairs)
+    ...                (additional files depending on dataset type)
+```
+- `README.md` MUST explain what the dataset is about, the procedures used to create it, remarks on data quality, and instructions on how to consume it with examples.
+- `dataset.schema.json` MUST be a valid [JSON Schema](https://json-schema.org/) document describing the structure of the dataset's primary data.
+- The dataset folder name MUST be lowercase, using underscores as separators (e.g. `my_dataset`).
+#### 02-file-annotation-pairs-must-use-data-folder
+Datasets where each item is a file paired with structured JSON output (e.g. image labeling, document data extraction, medical records with known features) MUST store all files inside the `data/` subfolder. Each data file MUST have a sibling JSON annotation file named with the same filename suffixed with `.json`.
+```
+/[name-of-dataset]/
+    data/
+        image1.jpg
+        image1.jpg.json
+        docu.pdf
+        docu.pdf.json
+        case-123.json
+        case-123.json.json
+    dataset.schema.json    (defines the schema for the .json annotation files)
+    README.md
+```
+Placing the annotation file next to its source file (same name + `.json`) keeps them adjacent even in large directories, making it easy to iterate pairs programmatically.
+Subdirectories inside `data/` are allowed when the number of files warrants grouping, but the `.json` sibling convention MUST be preserved at each level.
+#### 03-tabular-datasets-must-use-csv-files-at-root
+Datasets composed of column-oriented tabular data MUST place CSV files at the root of the dataset folder. All tabular files MUST conform to the schema defined in `dataset.schema.json`, which MUST describe columns as named attributes with their types.
+```
+/[name-of-dataset]/
+    samples-special.csv
+    samples-simple.csv
+    dataset.schema.json    (column definitions with types for all tabular files)
+    README.md
+```
+Multiple CSV files are allowed when they represent different slices or splits of the same schema (e.g. train/test splits, subsets by source). All files in the same dataset MUST share the same column schema.
+#### 04-complex-structured-datasets-must-use-jsonl
+Datasets with complex or heterogeneous per-record structures (e.g. LLM workflow evaluation sets, Q&A pairs, input → expected_output pairs) MUST use JSONL files (one JSON object per line) placed at the root of the dataset folder. Each line MUST conform to the schema defined in `dataset.schema.json`.
+```
+/[name-of-dataset]/
+    simple-cases-test.jsonl
+    edge-cases-test.jsonl
+    dataset.schema.json    (schema defining the structure of each line in the JSONL files)
+    README.md
+```
+Multiple JSONL files are allowed when they represent different splits or categories (e.g. easy vs. edge cases). All files in the same dataset MUST conform to the same line schema.
+#### 05-referenced-files-must-live-in-data-folder
+When any dataset type (tabular, JSONL, or annotation-pair) contains references to external files as part of the data (e.g. a JSONL record that includes a file path), those referenced files MUST be stored inside the `data/` subfolder of the dataset. Paths inside data records MUST be relative to the dataset root.
+## References
+- [JSON Schema specification](https://json-schema.org/)
+- [JSONL format](https://jsonlines.org/)

package/.xdrs/agentme/edrs/application/020-ai-agent-xdrs-knowledge-layer.md ADDED Viewed

@@ -0,0 +1,99 @@
+---
+name: agentme-edr-policy-020-ai-agent-xdrs-knowledge-layer
+description: Defines how to integrate XDRS as the runtime knowledge source of truth for AI agents — covering document placement, AGENTS.md setup, file tools, and local sandbox configuration. Apply only when the project explicitly uses XDRS to govern agent behavior.
+apply-to: AI agent projects that use XDRS as the source of truth for policies and skills
+valid-from: 2026-05-27
+---
+# agentme-edr-policy-020: AI agent XDRS knowledge layer
+## Context and Problem Statement
+AI agents need access to project-specific policies and skills at runtime to produce consistent, governed outputs. XDRS provides a file-system-based structure for capturing these decisions, but there is no standard pattern for embedding XDRS documents in agent libraries, wiring the agent to consult them, or sandboxing file access securely.
+How should an AI agent project integrate XDRS as its runtime source of truth for policies and skills?
+## Decision Outcome
+**Embed XDRS documents in `lib/data/.xdrs/`, instruct the agent to consult them via `AGENTS.md`, equip the agent with sandboxed file tools, and use the deepagents framework when a local sandbox is required.**
+This policy MUST only be applied when the project explicitly chooses XDRS as its knowledge governance layer. It is not required by [agentme-edr-018](018-ai-agent-development-standards.md) in general.
+### Details
+#### 01-xdrs-knowledge-layer
+XDRS documents are the source of truth for all policies and skills that the agent must follow during its tasks. The agent MUST consult XDRS before acting, not rely on general knowledge alone.
+**Placing XDRS documents in the library**
+- XDRS Policy and Skill documents MUST be placed at `lib/data/.xdrs/`, using the standard XDRS scope/type/subject folder structure (following `_core-adr-policy-001`).
+- They MUST be embedded in the package data manifest (e.g. `pyproject.toml` `[tool.hatch.build] include` or equivalent) so they are available at runtime.
+- When exposed through a deepagents sandbox, they MUST be mounted at `/.xdrs/` inside the sandbox (see rule `03-local-sandbox`).
+**AGENTS.md — mandatory XDRS consultation**
+Place an `AGENTS.md` file at the root of the deepagents sandbox (i.e. alongside `/.xdrs/`). This file instructs the agent to always consult XDRS before acting. Its content MUST follow the xdrs-core AGENTS.md template:
+```markdown
+# AGENTS.md
+**Purpose:** This file is intentionally brief. All decisions and working instructions are captured as Policies or Skills in the XDRS structure.
+## Policy Consultation in XDRS Is Mandatory For Every Request
+Before answering **any** request you MUST:
+1. Read the XDRS root index at `/.xdrs/index.md` to identify relevant Policies and Skills.
+2. Read the relevant Policy and Skill files.
+3. Base your actions on those Policies and Skills.
+This rule has NO exceptions. Do not answer from general knowledge alone when a Policy may exist on the topic.
+```
+The agent system prompt MUST reference `AGENTS.md` so the agent loads it at startup. Example:
+```
+Read /AGENTS.md and follow all instructions in it before proceeding.
+```
+#### 02-agent-file-tools
+Every agent that uses the XDRS knowledge layer MUST use the file tools provided by the deepagents framework. Do not implement hand-rolled alternatives — see [agentme-edr-policy-018-ai-agent-development-standards.[09-local-sandbox]](018-ai-agent-development-standards.md) for the full sandbox and tool requirements.
+These tools operate over two sandboxed roots (configured in rule `03-local-sandbox`):
+| Root | Content | Source |
+|---|---|---|
+| `data_root` | Static files shipped with the library (`lib/data/`) | Resolved via `importlib.resources` at workflow startup |
+| `temp_root` | Dynamic files generated for the current workflow run | Temporary directory created by `tempfile.mkdtemp()` at workflow startup |
+`temp_root` MUST be created at workflow startup and cleaned up in the same `try/finally` block. Pass it explicitly into the workflow; do not read it from a global variable.
+#### 03-local-sandbox
+Follow [agentme-edr-policy-018-ai-agent-development-standards.[09-local-sandbox]](018-ai-agent-development-standards.md) for the general deepagents sandbox setup. When XDRS is in use, add the following mounts to the sandbox configuration:
+| Source | Content | Deepagents sandbox path |
+|---|---|---|
+| `lib/data/.xdrs/` | XDRS Policy and Skill documents | `/.xdrs/` (read-only) |
+| Generated at startup | `AGENTS.md` instructing the agent to consult XDRS | `/AGENTS.md` (read-only) |
+XDRS documents MUST always be mounted at `/.xdrs/`. `AGENTS.md` MUST always be placed at the sandbox root (`/AGENTS.md`).
+Example XDRS mount additions:
+```python
+from importlib.resources import files
+from pathlib import Path
+data_root = str(files("myagent").joinpath("data"))
+agents_md = Path(temp_root) / "AGENTS.md"
+agents_md.write_text(_AGENTS_MD)  # content from xdrs-core AGENTS.md template; see rule 01-xdrs-knowledge-layer
+# Add these mounts alongside the base mounts from agentme-edr-018 rule 09-local-sandbox:
+xdrs_mounts = [
+    {"src": f"{data_root}/.xdrs", "dst": "/.xdrs",    "readonly": True},
+    {"src": str(agents_md),       "dst": "/AGENTS.md", "readonly": True},
+]
+```

package/.xdrs/agentme/edrs/devops/008-common-targets.md CHANGED Viewed

@@ -103,6 +103,7 @@ Targets are organized into five lifecycle groups. Projects must use these names
 | `test-unit` | Run unit tests only, including coverage report generation and coverage threshold enforcement. |
 | `test-integration` | *(Optional)* Run integration and end-to-end tests only. Projects without integration tests may omit this target. |
 | `test-smoke` | *(Optional)* Run a fast, minimal subset of tests to verify the software is basically functional. Useful as a post-deploy health check. |
+| `eval` | *(Optional)* Run **all evaluations** for the module. Used alongside `test` to measure the accuracy and performance of statistical systems such as ML models, AI agents, or noisy systems. Typically runs against a live or near-live system (similar to an integration test) and produces a performance analysis report (e.g., F1 score, Accuracy, Precision, Recall). Must not be included in `test` or `all` — evals are opt-in because they require live dependencies and may be slow or costly to run. Individual evaluations must follow the prefix convention: `eval-<qualifier>` (e.g., `eval-simple`, `eval-complex`). |
 ##### Release group

package/.xdrs/agentme/edrs/index.md CHANGED Viewed

@@ -29,6 +29,9 @@ Language and framework-specific tooling and project structure.
 - [agentme-edr-010](application/010-golang-project-tooling.md) - **Go project tooling and structure** - Scaffold Go CLIs and libraries with the standard layout *(includes skill: [003-create-golang-project](application/skills/003-create-golang-project/SKILL.md))*
 - [agentme-edr-014](application/014-python-project-tooling.md) - **Python project tooling and structure** - Scaffold Python packages and CLIs with the standard layout *(includes skill: [005-create-python-project](application/skills/005-create-python-project/SKILL.md))*
 - [agentme-edr-015](application/015-cli-tool-standards.md) - **CLI tool standards** - Define command UX and behavior for CLI tools
+- [agentme-edr-018](application/018-ai-agent-development-standards.md) - **AI agent development standards** - Standard toolchain, framework, evaluation, and workflow patterns for AI agent projects built with Python and LangGraph
+- [agentme-edr-019](application/019-ml-dataset-structure.md) - **ML dataset structure** - Standard folder layout and file conventions for ML datasets
+- [agentme-edr-020](application/020-ai-agent-xdrs-knowledge-layer.md) - **AI agent XDRS knowledge layer** - How to integrate XDRS as the runtime source of truth for policies and skills in AI agents (apply only when the project explicitly uses XDRS)
 - [004-select-relevant-xdrs](application/skills/004-select-relevant-xdrs/SKILL.md) - **Select relevant XDRs**
 ## Devops

package/.xdrs/agentme/edrs/principles/004-unit-test-requirements.md CHANGED Viewed

@@ -68,7 +68,27 @@ Builds that miss the threshold must not be merged.
 ---
-#### 04-should-extract-shared-setup
+#### 04-must-place-test-files-alongside-source
+Test files must live next to the source file they test, in the same directory, following the convention of the language/framework (e.g. `file.test.ts`, `file_test.go`, `file.spec.js`).
+```
+src/mymodule/group1/file1.ts        ← source
+src/mymodule/group1/file1.test.ts   ← test (same directory)
+```
+**Exception — separate test folder:** When the framework makes co-location impractical (e.g. Python's common `tests/` convention), or when the community strongly favors a separate folder, a dedicated test root (e.g. `tests/`) is allowed. In that case the test folder **must mirror** the source folder structure exactly:
+```
+src/mymodule/group1/file1.py          ← source
+tests/mymodule/group1/file1_test.py   ← test (mirrored path)
+```
+Do not flatten or reorganize paths when using a separate test folder.
+---
+#### 05-should-extract-shared-setup
 When setup logic is repeated across two or more test files, centralize it (`src/test-utils/`, `internal/testutil/`, `tests/conftest.py`).
@@ -81,7 +101,7 @@ export function makeOrder(overrides: Partial<Order> = {}): Order {
 ---
-#### 05-should-avoid-mocks
+#### 06-should-avoid-mocks
 Use the lowest-cost alternative that exercises real behavior:

package/.xdrs/agentme/edrs/principles/007-project-quality-standards.md CHANGED Viewed

@@ -161,3 +161,31 @@ all:
 	$(MAKE) -C basic-usage run
 	$(MAKE) -C advanced-usage run
 ```
+---
+#### 07-statistical-models-must-have-eval-targets
+Projects that contain statistical models (e.g., ML models, LLM-based evaluators, classifiers, ranking systems, or any component whose output quality is measured probabilistically) must define measurable performance thresholds and verify them automatically.
+**Requirements:**
+- A `make eval` target must exist and execute all performance evaluations
+- Each evaluation must have a **documented minimum performance threshold** (e.g., accuracy ≥ 0.85, F1 ≥ 0.80, BLEU ≥ 0.70)
+- Thresholds must be declared explicitly in the project (e.g., in a config file, `Makefile` variable, or documented in `README.md`)
+- `make eval` must **exit with a non-zero status** (fail) if:
+  - The evaluation cannot be executed (missing data, environment errors, model load failures)
+  - Any metric falls below its defined minimum threshold
+- CI/CD must invoke `make eval` before releasing any version that changes model weights, prompts, or evaluation logic
+**Threshold declaration example (Makefile):**
+```makefile
+EVAL_MIN_ACCURACY := 0.85
+EVAL_MIN_F1       := 0.80
+eval:
+	python eval.py \
+	  --min-accuracy $(EVAL_MIN_ACCURACY) \
+	  --min-f1 $(EVAL_MIN_F1) \
+	  || (echo "Evaluation failed: metrics below threshold"; exit 1)
+```

package/package.json CHANGED Viewed

@@ -1,9 +1,9 @@
 {
   "name": "agentme",
-  "version": "0.8.2",
+  "version": "0.10.0",
   "description": "",
   "dependencies": {
-    "filedist": "^0.33.0"
+    "filedist": "^0.34.1"
   },
   "bin": "bin/filedist.js",
   "files": [
@@ -22,6 +22,6 @@
     "url": "https://github.com/flaviostutz/agentme.git"
   },
   "devDependencies": {
-    "xdrs-core": "^0.27.0"
+    "xdrs-core": "^0.28.0"
   }
 }