PyPI - decodingtrust-agent-sdk - Versions diffs - 0.2.9__tar.gz → 0.2.10__tar.gz - Mend

decodingtrust-agent-sdk 0.2.9tar.gz → 0.2.10tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (401) hide show

{decodingtrust_agent_sdk-0.2.9/decodingtrust_agent_sdk.egg-info → decodingtrust_agent_sdk-0.2.10}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: decodingtrust-agent-sdk
-Version: 0.2.9
+Version: 0.2.10
 Summary: DecodingTrust Agent Platform (DTap) — A controllable and interactive red-teaming platform for AI agents
 Author-email: DTap Team <zhaorun@uchicago.edu>
 License:                                  Apache License
@@ -245,6 +245,8 @@ Requires-Dist: rich>=13.0.0
 Requires-Dist: pandas>=2.0.0
 Requires-Dist: psutil>=5.9.0
 Requires-Dist: huggingface_hub>=0.20.0
+Requires-Dist: playwright>=1.53
+Requires-Dist: pillow>=10.0.0
 Provides-Extra: openai
 Requires-Dist: openai>=2.6.1; extra == "openai"
 Requires-Dist: openai-agents>=0.8.4; extra == "openai"
@@ -260,6 +262,10 @@ Requires-Dist: google-generativeai>=0.3.0; extra == "google"
 Requires-Dist: google-genai>=1.0.0; extra == "google"
 Requires-Dist: google-api-core>=2.28.0; extra == "google"
 Requires-Dist: google-api-python-client>=2.100.0; extra == "google"
+Requires-Dist: google-cloud-monitoring>=2.20.0; extra == "google"
+Requires-Dist: google-cloud-trace>=1.13.0; extra == "google"
+Requires-Dist: opentelemetry-exporter-gcp-trace>=1.7.0; extra == "google"
+Requires-Dist: opentelemetry-exporter-gcp-monitoring>=1.7.0a0; extra == "google"
 Provides-Extra: strands
 Requires-Dist: strands-agents>=1.40.0; extra == "strands"
 Provides-Extra: langchain
@@ -269,6 +275,7 @@ Requires-Dist: langchain-openai>=0.2.0; extra == "langchain"
 Requires-Dist: langchain-anthropic>=0.2.0; extra == "langchain"
 Provides-Extra: pocketflow
 Requires-Dist: pocketflow==0.0.3; extra == "pocketflow"
+Provides-Extra: browser
 Provides-Extra: all
 Requires-Dist: decodingtrust-agent-sdk[claude,google,langchain,openai,pocketflow,strands]; extra == "all"
 Provides-Extra: dev
@@ -340,14 +347,57 @@ We have publicly released the full evaluation results, including the complete re
 ## Installation
+### Option A — from PyPI (recommended for users)
+```bash
+pip install decodingtrust-agent-sdk            # core (includes the browser domain deps)
+# …plus the backend(s) you actually use (see "Agent backends" below):
+pip install "decodingtrust-agent-sdk[openai]"   # OpenAI Agents SDK
+pip install "decodingtrust-agent-sdk[google]"   # Google ADK / Gemini
+```
+This installs the `dtap` CLI. Use it instead of `python eval/evaluation.py`, and select
+benchmark tasks with `--domain`:
+```bash
+dtap eval --domain crm --task-type benign --agent-type openaisdk --model gpt-5.4 --max-parallel 4
+```
+On first run, the per-task dataset is auto-downloaded from HuggingFace — **only for the
+domain(s) you request**. Set `HF_TOKEN` to avoid unauthenticated rate-limiting (HTTP 429):
+```bash
+export HF_TOKEN=hf_...
+```
+### Option B — from source (for development)
 ```bash
 git clone https://github.com/AI-secure/DecodingTrust-Agent.git
 cd DecodingTrust-Agent
-pip install -r requirements.txt
-pip install -e .
+pip install -e ".[openai]"     # or [all] for every backend
+# (here `python eval/evaluation.py --task-list benchmark/...` also works)
 ```
-Set the API key for your backbone model (only the providers you actually use are required):
+### Agent backends (optional extras)
+Install only the framework you evaluate with:
+| Extra | Backend (`--agent-type`) |
+|---|---|
+| `openai` | `openaisdk` |
+| `claude` | `claudesdk` |
+| `google` | `googleadk` |
+| `langchain` | `langchain` |
+| `strands` | `strands` |
+| `pocketflow` | `pocketflow` |
+| `all` | every backend above |
+(The `browser` domain needs no extra — its Playwright deps are part of the core install.)
+### Model keys & Docker
+Set the API key for your backbone model (only the providers you use):
 ```bash
 export OPENAI_API_KEY=sk-...
@@ -357,6 +407,10 @@ export GOOGLE_API_KEY=...
 Docker is required: each task spins up isolated MCP servers and Docker-based environments through `TaskExecutor`.
+> **Browser domain note:** browser tasks send full-page screenshots (large image-token
+> input). With vision models on a metered tier, start at `--max-parallel 2` to avoid
+> provider token-rate limits (HTTP 429), then raise it if your quota allows.
 ---
 ## Quick Start
@@ -367,7 +421,7 @@ A single benign CRM task with the OpenAI Agents SDK backbone:
 python eval/evaluation.py \
   --task-list benchmark/crm/benign.jsonl \
   --agent-type openaisdk \
-  --model gpt-4o \
+  --model gpt-5.4 \
   --max-parallel 4
 ```
@@ -405,7 +459,7 @@ Run every benign + direct + indirect task in a domain by pointing `--task-list`
 python eval/evaluation.py \
   --task-list benchmark/finance \
   --agent-type openaisdk \
-  --model gpt-4o \
+  --model gpt-5.4 \
   --max-parallel 8
 ```
@@ -415,7 +469,7 @@ python eval/evaluation.py \
 ```bash
 # Benign utility only
-python eval/evaluation.py --task-list benchmark/crm/benign.jsonl --agent-type openaisdk --model gpt-4o
+python eval/evaluation.py --task-list benchmark/crm/benign.jsonl --agent-type openaisdk --model gpt-5.4
 # Direct prompt injection only
 python eval/evaluation.py --task-list benchmark/crm/direct.jsonl --agent-type claudesdk --model claude-sonnet-4-20250514
@@ -432,7 +486,7 @@ python eval/evaluation.py \
   --task-type malicious \
   --threat-model indirect \
   --risk-category data-exfiltration \
-  --agent-type openaisdk --model gpt-4o
+  --agent-type openaisdk --model gpt-5.4
 ```
 ### The entire benchmark
@@ -443,7 +497,7 @@ Point `--task-list` at the top-level [`benchmark/`](benchmark/) directory to run
 python eval/evaluation.py \
   --task-list benchmark \
   --agent-type openaisdk \
-  --model gpt-4o \
+  --model gpt-5.4 \
   --max-parallel 16 \
   --skip-existing
 ```
@@ -478,7 +532,7 @@ Any JSONL file with the schema below is a valid `--task-list`. Pick a subset of
 Run it like any built-in task list:
 ```bash
-python eval/evaluation.py --task-list my_subset.jsonl --agent-type openaisdk --model gpt-4o --max-parallel 4
+python eval/evaluation.py --task-list my_subset.jsonl --agent-type openaisdk --model gpt-5.4 --max-parallel 4
 ```
 A few practical patterns:
@@ -486,11 +540,11 @@ A few practical patterns:
 ```bash
 # Curate from an existing file
 grep '"risk_category": "data-exfiltration"' benchmark/crm/indirect.jsonl > my_crm_exfil.jsonl
-python eval/evaluation.py --task-list my_crm_exfil.jsonl --agent-type openaisdk --model gpt-4o
+python eval/evaluation.py --task-list my_crm_exfil.jsonl --agent-type openaisdk --model gpt-5.4
 # Try just one task end-to-end
 echo '{"domain": "crm", "type": "benign", "task_id": "1"}' > one_task.jsonl
-python eval/evaluation.py --task-list one_task.jsonl --agent-type openaisdk --model gpt-4o
+python eval/evaluation.py --task-list one_task.jsonl --agent-type openaisdk --model gpt-5.4
 ```
 For per-task internals (`config.yaml`, `judge.py`, `setup.sh`), see [docs/quickstart.md](docs/quickstart.md).
@@ -508,7 +562,7 @@ For per-task internals (`config.yaml`, `judge.py`, `setup.sh`), see [docs/quicks
 | `--risk-category` | `None` | e.g. `data-exfiltration` |
 | `--max-parallel` | `5` | Concurrent tasks (environments are reused across tasks) |
 | `--agent-type` | `openaisdk` | `openaisdk`, `claudesdk`, `googleadk`, `langchain`, `pocketflow`, `openclaw` |
-| `--model` | `gpt-4o` | Backbone model identifier |
+| `--model` | `gpt-5.4` | Backbone model identifier |
 | `--temperature` | `None` | Sampling temperature (model default if unset) |
 | `--port-range` | `None` | Dynamic MCP port range, e.g. `"10000-12000"` |
 | `--direct-prompt` | off | For direct threat model, use the malicious goal as-is instead of replaying attack turns |
@@ -558,13 +612,13 @@ async def main():
     native = OpenAIAgent(
         name="MyAgent",
         instructions="You are a helpful CRM assistant.",
-        model="gpt-4o",
+        model="gpt-5.4",
         mcp_servers=[my_custom_server],
     )
     # 2. Load the benchmark task config (adds salesforce, gmail, etc.)
     agent_cfg   = AgentConfig.from_yaml("dataset/crm/benign/1/config.yaml")
-    runtime_cfg = RuntimeConfig(model="gpt-4o", temperature=0.1, max_turns=200,
+    runtime_cfg = RuntimeConfig(model="gpt-5.4", temperature=0.1, max_turns=200,
                                 output_dir="./results")
     # 3. Wrap — auto-detects OpenAI SDK / LangChain / Claude SDK / Google ADK

{decodingtrust_agent_sdk-0.2.9 → decodingtrust_agent_sdk-0.2.10}/README.md RENAMED Viewed

@@ -59,14 +59,57 @@ We have publicly released the full evaluation results, including the complete re
 ## Installation
+### Option A — from PyPI (recommended for users)
+```bash
+pip install decodingtrust-agent-sdk            # core (includes the browser domain deps)
+# …plus the backend(s) you actually use (see "Agent backends" below):
+pip install "decodingtrust-agent-sdk[openai]"   # OpenAI Agents SDK
+pip install "decodingtrust-agent-sdk[google]"   # Google ADK / Gemini
+```
+This installs the `dtap` CLI. Use it instead of `python eval/evaluation.py`, and select
+benchmark tasks with `--domain`:
+```bash
+dtap eval --domain crm --task-type benign --agent-type openaisdk --model gpt-5.4 --max-parallel 4
+```
+On first run, the per-task dataset is auto-downloaded from HuggingFace — **only for the
+domain(s) you request**. Set `HF_TOKEN` to avoid unauthenticated rate-limiting (HTTP 429):
+```bash
+export HF_TOKEN=hf_...
+```
+### Option B — from source (for development)
 ```bash
 git clone https://github.com/AI-secure/DecodingTrust-Agent.git
 cd DecodingTrust-Agent
-pip install -r requirements.txt
-pip install -e .
+pip install -e ".[openai]"     # or [all] for every backend
+# (here `python eval/evaluation.py --task-list benchmark/...` also works)
 ```
-Set the API key for your backbone model (only the providers you actually use are required):
+### Agent backends (optional extras)
+Install only the framework you evaluate with:
+| Extra | Backend (`--agent-type`) |
+|---|---|
+| `openai` | `openaisdk` |
+| `claude` | `claudesdk` |
+| `google` | `googleadk` |
+| `langchain` | `langchain` |
+| `strands` | `strands` |
+| `pocketflow` | `pocketflow` |
+| `all` | every backend above |
+(The `browser` domain needs no extra — its Playwright deps are part of the core install.)
+### Model keys & Docker
+Set the API key for your backbone model (only the providers you use):
 ```bash
 export OPENAI_API_KEY=sk-...
@@ -76,6 +119,10 @@ export GOOGLE_API_KEY=...
 Docker is required: each task spins up isolated MCP servers and Docker-based environments through `TaskExecutor`.
+> **Browser domain note:** browser tasks send full-page screenshots (large image-token
+> input). With vision models on a metered tier, start at `--max-parallel 2` to avoid
+> provider token-rate limits (HTTP 429), then raise it if your quota allows.
 ---
 ## Quick Start
@@ -86,7 +133,7 @@ A single benign CRM task with the OpenAI Agents SDK backbone:
 python eval/evaluation.py \
   --task-list benchmark/crm/benign.jsonl \
   --agent-type openaisdk \
-  --model gpt-4o \
+  --model gpt-5.4 \
   --max-parallel 4
 ```
@@ -124,7 +171,7 @@ Run every benign + direct + indirect task in a domain by pointing `--task-list`
 python eval/evaluation.py \
   --task-list benchmark/finance \
   --agent-type openaisdk \
-  --model gpt-4o \
+  --model gpt-5.4 \
   --max-parallel 8
 ```
@@ -134,7 +181,7 @@ python eval/evaluation.py \
 ```bash
 # Benign utility only
-python eval/evaluation.py --task-list benchmark/crm/benign.jsonl --agent-type openaisdk --model gpt-4o
+python eval/evaluation.py --task-list benchmark/crm/benign.jsonl --agent-type openaisdk --model gpt-5.4
 # Direct prompt injection only
 python eval/evaluation.py --task-list benchmark/crm/direct.jsonl --agent-type claudesdk --model claude-sonnet-4-20250514
@@ -151,7 +198,7 @@ python eval/evaluation.py \
   --task-type malicious \
   --threat-model indirect \
   --risk-category data-exfiltration \
-  --agent-type openaisdk --model gpt-4o
+  --agent-type openaisdk --model gpt-5.4
 ```
 ### The entire benchmark
@@ -162,7 +209,7 @@ Point `--task-list` at the top-level [`benchmark/`](benchmark/) directory to run
 python eval/evaluation.py \
   --task-list benchmark \
   --agent-type openaisdk \
-  --model gpt-4o \
+  --model gpt-5.4 \
   --max-parallel 16 \
   --skip-existing
 ```
@@ -197,7 +244,7 @@ Any JSONL file with the schema below is a valid `--task-list`. Pick a subset of
 Run it like any built-in task list:
 ```bash
-python eval/evaluation.py --task-list my_subset.jsonl --agent-type openaisdk --model gpt-4o --max-parallel 4
+python eval/evaluation.py --task-list my_subset.jsonl --agent-type openaisdk --model gpt-5.4 --max-parallel 4
 ```
 A few practical patterns:
@@ -205,11 +252,11 @@ A few practical patterns:
 ```bash
 # Curate from an existing file
 grep '"risk_category": "data-exfiltration"' benchmark/crm/indirect.jsonl > my_crm_exfil.jsonl
-python eval/evaluation.py --task-list my_crm_exfil.jsonl --agent-type openaisdk --model gpt-4o
+python eval/evaluation.py --task-list my_crm_exfil.jsonl --agent-type openaisdk --model gpt-5.4
 # Try just one task end-to-end
 echo '{"domain": "crm", "type": "benign", "task_id": "1"}' > one_task.jsonl
-python eval/evaluation.py --task-list one_task.jsonl --agent-type openaisdk --model gpt-4o
+python eval/evaluation.py --task-list one_task.jsonl --agent-type openaisdk --model gpt-5.4
 ```
 For per-task internals (`config.yaml`, `judge.py`, `setup.sh`), see [docs/quickstart.md](docs/quickstart.md).
@@ -227,7 +274,7 @@ For per-task internals (`config.yaml`, `judge.py`, `setup.sh`), see [docs/quicks
 | `--risk-category` | `None` | e.g. `data-exfiltration` |
 | `--max-parallel` | `5` | Concurrent tasks (environments are reused across tasks) |
 | `--agent-type` | `openaisdk` | `openaisdk`, `claudesdk`, `googleadk`, `langchain`, `pocketflow`, `openclaw` |
-| `--model` | `gpt-4o` | Backbone model identifier |
+| `--model` | `gpt-5.4` | Backbone model identifier |
 | `--temperature` | `None` | Sampling temperature (model default if unset) |
 | `--port-range` | `None` | Dynamic MCP port range, e.g. `"10000-12000"` |
 | `--direct-prompt` | off | For direct threat model, use the malicious goal as-is instead of replaying attack turns |
@@ -277,13 +324,13 @@ async def main():
     native = OpenAIAgent(
         name="MyAgent",
         instructions="You are a helpful CRM assistant.",
-        model="gpt-4o",
+        model="gpt-5.4",
         mcp_servers=[my_custom_server],
     )
     # 2. Load the benchmark task config (adds salesforce, gmail, etc.)
     agent_cfg   = AgentConfig.from_yaml("dataset/crm/benign/1/config.yaml")
-    runtime_cfg = RuntimeConfig(model="gpt-4o", temperature=0.1, max_turns=200,
+    runtime_cfg = RuntimeConfig(model="gpt-5.4", temperature=0.1, max_turns=200,
                                 output_dir="./results")
     # 3. Wrap — auto-detects OpenAI SDK / LangChain / Claude SDK / Google ADK

{decodingtrust_agent_sdk-0.2.9 → decodingtrust_agent_sdk-0.2.10}/agent/hermes/example.py RENAMED Viewed

@@ -8,7 +8,7 @@ injection.
 Usage:
     python agent/hermes/example.py --config path/to/config.yaml
-    python agent/hermes/example.py --config path/to/config.yaml --model openai/gpt-4o
+    python agent/hermes/example.py --config path/to/config.yaml --model openai/gpt-5.4
     python agent/hermes/example.py --config path/to/config.yaml --debug
 Prerequisites:

{decodingtrust_agent_sdk-0.2.9 → decodingtrust_agent_sdk-0.2.10}/agent/hermes/src/hermes_runner.py RENAMED Viewed

@@ -25,7 +25,7 @@ Request JSON schema::
       "base_url": "https://api.openai.com/v1",   # may be ""
       "api_key": "sk-...",                         # may be null
       "provider": "openai",                        # may be null
-      "model": "gpt-4o",
+      "model": "gpt-5.4",
       "max_turns": 30,
       "system_prompt": "You are ...",              # may be null
       "enabled_toolsets": ["mcp-salesforce"],      # MCP-only restriction

{decodingtrust_agent_sdk-0.2.9 → decodingtrust_agent_sdk-0.2.10}/agent/langchain/example.py RENAMED Viewed

@@ -46,8 +46,8 @@ Examples:
     parser.add_argument(
         "--model",
         type=str,
-        default="gpt-4o",
-        help="Model to use (default: gpt-4o)"
+        default="gpt-5.4",
+        help="Model to use (default: gpt-5.4)"
     )
     parser.add_argument(
         "--temperature",

{decodingtrust_agent_sdk-0.2.9 → decodingtrust_agent_sdk-0.2.10}/agent/openaisdk/example.py RENAMED Viewed

@@ -49,8 +49,8 @@ Examples:
     parser.add_argument(
         "--model",
         type=str,
-        default="gpt-4o",
-        help="Model to use (default: gpt-4o)"
+        default="gpt-5.4",
+        help="Model to use (default: gpt-5.4)"
     )
     parser.add_argument(
         "--temperature",

{decodingtrust_agent_sdk-0.2.9 → decodingtrust_agent_sdk-0.2.10}/agent/pocketflow/example.py RENAMED Viewed

@@ -48,8 +48,8 @@ Examples:
     parser.add_argument(
         "--model",
         type=str,
-        default="gpt-4o",
-        help="Model to use (default: gpt-4o)"
+        default="gpt-5.4",
+        help="Model to use (default: gpt-5.4)"
     )
     parser.add_argument(
         "--temperature",

{decodingtrust_agent_sdk-0.2.9 → decodingtrust_agent_sdk-0.2.10}/agent/strands/example.py RENAMED Viewed

@@ -39,7 +39,7 @@ async def main():
     parser.add_argument(
         "--model",
         type=str,
-        default="gpt-4o",
+        default="gpt-5.4",
         help="Model to use"
     )

{decodingtrust_agent_sdk-0.2.9 → decodingtrust_agent_sdk-0.2.10}/cli/scaffold.py RENAMED Viewed

@@ -184,7 +184,7 @@ async def main() -> None:
     agent = build_agent(
         native_agent=native,
         agent_cfg=AgentConfig(system_prompt=""),
-        runtime_cfg=RuntimeConfig(model="gpt-4o", max_turns=10),
+        runtime_cfg=RuntimeConfig(model="gpt-5.4", max_turns=10),
     )
     async with agent:
         result = await agent.run("Say hello.", metadata={{"task_id": "smoke"}})
@@ -207,7 +207,7 @@ from .agent import {class_name}
 async def main() -> None:
     agent = {class_name}(
         agent_config=AgentConfig(system_prompt=""),
-        runtime_config=RuntimeConfig(model="gpt-4o", max_turns=10),
+        runtime_config=RuntimeConfig(model="gpt-5.4", max_turns=10),
     )
     async with agent:
         result = await agent.run("Say hello.", metadata={{"task_id": "smoke"}})

{decodingtrust_agent_sdk-0.2.9 → decodingtrust_agent_sdk-0.2.10/decodingtrust_agent_sdk.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: decodingtrust-agent-sdk
-Version: 0.2.9
+Version: 0.2.10
 Summary: DecodingTrust Agent Platform (DTap) — A controllable and interactive red-teaming platform for AI agents
 Author-email: DTap Team <zhaorun@uchicago.edu>
 License:                                  Apache License
@@ -245,6 +245,8 @@ Requires-Dist: rich>=13.0.0
 Requires-Dist: pandas>=2.0.0
 Requires-Dist: psutil>=5.9.0
 Requires-Dist: huggingface_hub>=0.20.0
+Requires-Dist: playwright>=1.53
+Requires-Dist: pillow>=10.0.0
 Provides-Extra: openai
 Requires-Dist: openai>=2.6.1; extra == "openai"
 Requires-Dist: openai-agents>=0.8.4; extra == "openai"
@@ -260,6 +262,10 @@ Requires-Dist: google-generativeai>=0.3.0; extra == "google"
 Requires-Dist: google-genai>=1.0.0; extra == "google"
 Requires-Dist: google-api-core>=2.28.0; extra == "google"
 Requires-Dist: google-api-python-client>=2.100.0; extra == "google"
+Requires-Dist: google-cloud-monitoring>=2.20.0; extra == "google"
+Requires-Dist: google-cloud-trace>=1.13.0; extra == "google"
+Requires-Dist: opentelemetry-exporter-gcp-trace>=1.7.0; extra == "google"
+Requires-Dist: opentelemetry-exporter-gcp-monitoring>=1.7.0a0; extra == "google"
 Provides-Extra: strands
 Requires-Dist: strands-agents>=1.40.0; extra == "strands"
 Provides-Extra: langchain
@@ -269,6 +275,7 @@ Requires-Dist: langchain-openai>=0.2.0; extra == "langchain"
 Requires-Dist: langchain-anthropic>=0.2.0; extra == "langchain"
 Provides-Extra: pocketflow
 Requires-Dist: pocketflow==0.0.3; extra == "pocketflow"
+Provides-Extra: browser
 Provides-Extra: all
 Requires-Dist: decodingtrust-agent-sdk[claude,google,langchain,openai,pocketflow,strands]; extra == "all"
 Provides-Extra: dev
@@ -340,14 +347,57 @@ We have publicly released the full evaluation results, including the complete re
 ## Installation
+### Option A — from PyPI (recommended for users)
+```bash
+pip install decodingtrust-agent-sdk            # core (includes the browser domain deps)
+# …plus the backend(s) you actually use (see "Agent backends" below):
+pip install "decodingtrust-agent-sdk[openai]"   # OpenAI Agents SDK
+pip install "decodingtrust-agent-sdk[google]"   # Google ADK / Gemini
+```
+This installs the `dtap` CLI. Use it instead of `python eval/evaluation.py`, and select
+benchmark tasks with `--domain`:
+```bash
+dtap eval --domain crm --task-type benign --agent-type openaisdk --model gpt-5.4 --max-parallel 4
+```
+On first run, the per-task dataset is auto-downloaded from HuggingFace — **only for the
+domain(s) you request**. Set `HF_TOKEN` to avoid unauthenticated rate-limiting (HTTP 429):
+```bash
+export HF_TOKEN=hf_...
+```
+### Option B — from source (for development)
 ```bash
 git clone https://github.com/AI-secure/DecodingTrust-Agent.git
 cd DecodingTrust-Agent
-pip install -r requirements.txt
-pip install -e .
+pip install -e ".[openai]"     # or [all] for every backend
+# (here `python eval/evaluation.py --task-list benchmark/...` also works)
 ```
-Set the API key for your backbone model (only the providers you actually use are required):
+### Agent backends (optional extras)
+Install only the framework you evaluate with:
+| Extra | Backend (`--agent-type`) |
+|---|---|
+| `openai` | `openaisdk` |
+| `claude` | `claudesdk` |
+| `google` | `googleadk` |
+| `langchain` | `langchain` |
+| `strands` | `strands` |
+| `pocketflow` | `pocketflow` |
+| `all` | every backend above |
+(The `browser` domain needs no extra — its Playwright deps are part of the core install.)
+### Model keys & Docker
+Set the API key for your backbone model (only the providers you use):
 ```bash
 export OPENAI_API_KEY=sk-...
@@ -357,6 +407,10 @@ export GOOGLE_API_KEY=...
 Docker is required: each task spins up isolated MCP servers and Docker-based environments through `TaskExecutor`.
+> **Browser domain note:** browser tasks send full-page screenshots (large image-token
+> input). With vision models on a metered tier, start at `--max-parallel 2` to avoid
+> provider token-rate limits (HTTP 429), then raise it if your quota allows.
 ---
 ## Quick Start
@@ -367,7 +421,7 @@ A single benign CRM task with the OpenAI Agents SDK backbone:
 python eval/evaluation.py \
   --task-list benchmark/crm/benign.jsonl \
   --agent-type openaisdk \
-  --model gpt-4o \
+  --model gpt-5.4 \
   --max-parallel 4
 ```
@@ -405,7 +459,7 @@ Run every benign + direct + indirect task in a domain by pointing `--task-list`
 python eval/evaluation.py \
   --task-list benchmark/finance \
   --agent-type openaisdk \
-  --model gpt-4o \
+  --model gpt-5.4 \
   --max-parallel 8
 ```
@@ -415,7 +469,7 @@ python eval/evaluation.py \
 ```bash
 # Benign utility only
-python eval/evaluation.py --task-list benchmark/crm/benign.jsonl --agent-type openaisdk --model gpt-4o
+python eval/evaluation.py --task-list benchmark/crm/benign.jsonl --agent-type openaisdk --model gpt-5.4
 # Direct prompt injection only
 python eval/evaluation.py --task-list benchmark/crm/direct.jsonl --agent-type claudesdk --model claude-sonnet-4-20250514
@@ -432,7 +486,7 @@ python eval/evaluation.py \
   --task-type malicious \
   --threat-model indirect \
   --risk-category data-exfiltration \
-  --agent-type openaisdk --model gpt-4o
+  --agent-type openaisdk --model gpt-5.4
 ```
 ### The entire benchmark
@@ -443,7 +497,7 @@ Point `--task-list` at the top-level [`benchmark/`](benchmark/) directory to run
 python eval/evaluation.py \
   --task-list benchmark \
   --agent-type openaisdk \
-  --model gpt-4o \
+  --model gpt-5.4 \
   --max-parallel 16 \
   --skip-existing
 ```
@@ -478,7 +532,7 @@ Any JSONL file with the schema below is a valid `--task-list`. Pick a subset of
 Run it like any built-in task list:
 ```bash
-python eval/evaluation.py --task-list my_subset.jsonl --agent-type openaisdk --model gpt-4o --max-parallel 4
+python eval/evaluation.py --task-list my_subset.jsonl --agent-type openaisdk --model gpt-5.4 --max-parallel 4
 ```
 A few practical patterns:
@@ -486,11 +540,11 @@ A few practical patterns:
 ```bash
 # Curate from an existing file
 grep '"risk_category": "data-exfiltration"' benchmark/crm/indirect.jsonl > my_crm_exfil.jsonl
-python eval/evaluation.py --task-list my_crm_exfil.jsonl --agent-type openaisdk --model gpt-4o
+python eval/evaluation.py --task-list my_crm_exfil.jsonl --agent-type openaisdk --model gpt-5.4
 # Try just one task end-to-end
 echo '{"domain": "crm", "type": "benign", "task_id": "1"}' > one_task.jsonl
-python eval/evaluation.py --task-list one_task.jsonl --agent-type openaisdk --model gpt-4o
+python eval/evaluation.py --task-list one_task.jsonl --agent-type openaisdk --model gpt-5.4
 ```
 For per-task internals (`config.yaml`, `judge.py`, `setup.sh`), see [docs/quickstart.md](docs/quickstart.md).
@@ -508,7 +562,7 @@ For per-task internals (`config.yaml`, `judge.py`, `setup.sh`), see [docs/quicks
 | `--risk-category` | `None` | e.g. `data-exfiltration` |
 | `--max-parallel` | `5` | Concurrent tasks (environments are reused across tasks) |
 | `--agent-type` | `openaisdk` | `openaisdk`, `claudesdk`, `googleadk`, `langchain`, `pocketflow`, `openclaw` |
-| `--model` | `gpt-4o` | Backbone model identifier |
+| `--model` | `gpt-5.4` | Backbone model identifier |
 | `--temperature` | `None` | Sampling temperature (model default if unset) |
 | `--port-range` | `None` | Dynamic MCP port range, e.g. `"10000-12000"` |
 | `--direct-prompt` | off | For direct threat model, use the malicious goal as-is instead of replaying attack turns |
@@ -558,13 +612,13 @@ async def main():
     native = OpenAIAgent(
         name="MyAgent",
         instructions="You are a helpful CRM assistant.",
-        model="gpt-4o",
+        model="gpt-5.4",
         mcp_servers=[my_custom_server],
     )
     # 2. Load the benchmark task config (adds salesforce, gmail, etc.)
     agent_cfg   = AgentConfig.from_yaml("dataset/crm/benign/1/config.yaml")
-    runtime_cfg = RuntimeConfig(model="gpt-4o", temperature=0.1, max_turns=200,
+    runtime_cfg = RuntimeConfig(model="gpt-5.4", temperature=0.1, max_turns=200,
                                 output_dir="./results")
     # 3. Wrap — auto-detects OpenAI SDK / LangChain / Claude SDK / Google ADK

{decodingtrust_agent_sdk-0.2.9 → decodingtrust_agent_sdk-0.2.10}/decodingtrust_agent_sdk.egg-info/requires.txt RENAMED Viewed

@@ -21,10 +21,14 @@ rich>=13.0.0
 pandas>=2.0.0
 psutil>=5.9.0
 huggingface_hub>=0.20.0
+playwright>=1.53
+pillow>=10.0.0
 [all]
 decodingtrust-agent-sdk[claude,google,langchain,openai,pocketflow,strands]
+[browser]
 [claude]
 anthropic>=0.18.0
 claude-agent-sdk>=0.1.18
@@ -44,6 +48,10 @@ google-generativeai>=0.3.0
 google-genai>=1.0.0
 google-api-core>=2.28.0
 google-api-python-client>=2.100.0
+google-cloud-monitoring>=2.20.0
+google-cloud-trace>=1.13.0
+opentelemetry-exporter-gcp-trace>=1.7.0
+opentelemetry-exporter-gcp-monitoring>=1.7.0a0
 [langchain]
 langchain>=0.3.0

decodingtrust-agent-sdk 0.2.9__tar.gz → 0.2.10__tar.gz

decodingtrust-agent-sdk 0.2.9tar.gz → 0.2.10tar.gz