@intentsolutionsio/jeremy-vertex-engine 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,69 @@
1
+ # PRD: Vertex Engine Inspector
2
+
3
+ **Version:** 2.1.0
4
+ **Author:** Jeremy Longshore <jeremy@intentsolutions.io>
5
+ **Status:** Active
6
+ **Marketplace:** [tonsofskills.com](https://tonsofskills.com) by [Intent Solutions](https://intentsolutions.io)
7
+ **Portfolio:** [jeremylongshore.com](https://jeremylongshore.com)
8
+
9
+ ---
10
+
11
+ ## Problem Statement
12
+
13
+ Vertex AI Agent Engine deployments involve seven interconnected configuration surfaces (runtime, Code Execution Sandbox, Memory Bank, A2A protocol, security, performance, monitoring) that are validated manually and inconsistently. Teams deploy agents without knowing their production readiness score, leading to security gaps (unhardened IAM, missing VPC-SC), reliability issues (no alerting, stale memory), and protocol non-compliance (broken A2A endpoints). Without a systematic inspection, problems surface only after production incidents.
14
+
15
+ ## Target Users
16
+
17
+ | User | Context | Primary Need |
18
+ |------|---------|-------------|
19
+ | Platform Engineer | Preparing a new Agent Engine deployment for production launch | Comprehensive readiness score with prioritized fix list before go-live |
20
+ | Security Auditor | Reviewing IAM, VPC-SC, and encryption posture after configuration changes | Targeted security inspection confirming least-privilege and perimeter integrity |
21
+ | SRE / On-Call Engineer | Investigating elevated error rates or latency on a deployed agent | Performance metrics retrieval with root-cause correlation (scaling, tokens, errors) |
22
+ | DevOps Lead | Establishing baseline quality gates for agent deployments | Repeatable inspection producing consistent scores across all team deployments |
23
+
24
+ ## Success Criteria
25
+
26
+ 1. Inspect all seven categories (runtime, sandbox, memory, A2A, security, performance, monitoring) in a single invocation
27
+ 2. Generate a weighted production-readiness score (0-100%) with per-category breakdowns
28
+ 3. Produce actionable recommendations with estimated score improvement per remediation item
29
+ 4. Complete a full inspection within 5 minutes for a standard deployment
30
+ 5. Security category catches 100% of missing VPC-SC, overprivileged IAM, and unencrypted configurations
31
+ 6. A2A compliance matrix clearly shows pass/fail for each protocol endpoint
32
+
33
+ ## Functional Requirements
34
+
35
+ 1. Retrieve agent metadata via the Python SDK (`vertexai.Client().agent_engines.get()`) and parse runtime configuration
36
+ 2. Validate Code Execution Sandbox settings: TTL range (7-14 days), sandbox type (`SECURE_ISOLATED`), scoped IAM
37
+ 3. Check Memory Bank: enabled status, retention policy (min 100 memories), Firestore encryption, indexing, auto-cleanup
38
+ 4. Test A2A protocol endpoints: `/.well-known/agent-card`, `POST /v1/tasks:send`, `GET /v1/tasks/<id>`
39
+ 5. Audit security posture: IAM least-privilege, VPC-SC perimeter, Model Armor, encryption, no hardcoded credentials
40
+ 6. Query Cloud Monitoring for 24-hour metrics: error rate, latency percentiles (p50/p95/p99), token usage, cost
41
+ 7. Assess observability: dashboards, alerting policies, structured logging, OpenTelemetry, Cloud Error Reporting
42
+ 8. Calculate weighted scores and generate a prioritized recommendation list
43
+
44
+ ## Non-Functional Requirements
45
+
46
+ - Read-only inspection: skill must not modify the inspected deployment or its configuration
47
+ - All Agent Engine operations use the Python SDK, never gcloud CLI (no gcloud surface exists for Agent Engine)
48
+ - Inspection must work from outside VPC-SC perimeters when access levels are properly configured
49
+ - Output format is YAML for machine-parseability and human readability
50
+ - Scoring weights must reflect production impact: security and reliability weighted higher than monitoring
51
+ - Inspection must handle partial failures gracefully (skip unavailable categories, still produce report)
52
+ - Results must be deterministic: same deployment state always produces the same score
53
+
54
+ ## Dependencies
55
+
56
+ - `google-cloud-aiplatform[agent_engines]>=1.120.0` Python SDK
57
+ - `gcloud` CLI authenticated for IAM and monitoring queries
58
+ - IAM roles: `roles/aiplatform.user` and `roles/monitoring.viewer` on target project
59
+ - Cloud Monitoring API enabled
60
+ - `curl` for A2A endpoint testing
61
+
62
+ ## Out of Scope
63
+
64
+ - Modifying or remediating Agent Engine configurations (inspection only, never writes)
65
+ - Deploying new agents or updating existing deployments (handled by adk-deployment-specialist)
66
+ - Infrastructure provisioning with Terraform (handled by adk-infra-expert)
67
+ - Cost optimization recommendations beyond basic model selection guidance
68
+ - Load testing or performance benchmarking (inspection uses existing metrics only)
69
+ - Cross-project inspection (each invocation targets a single project/agent pair)
@@ -0,0 +1,96 @@
1
+ # Error Handling Reference
2
+
3
+ ## SDK / Authentication Errors
4
+
5
+ | Error | Cause | Solution |
6
+ |-------|-------|----------|
7
+ | `ModuleNotFoundError: No module named 'vertexai'` | Vertex AI SDK not installed | `pip install google-cloud-aiplatform[agent_engines]>=1.120.0` |
8
+ | `google.auth.exceptions.DefaultCredentialsError` | No application default credentials configured | Run `gcloud auth application-default login` or set `GOOGLE_APPLICATION_CREDENTIALS` |
9
+ | `google.api_core.exceptions.PermissionDenied (403)` | Service account or user lacks required IAM roles | Grant `roles/aiplatform.user` and `roles/monitoring.viewer` on the project |
10
+ | `google.api_core.exceptions.NotFound (404)` | Agent engine ID or resource name is incorrect | Verify the full resource name: `projects/PROJECT/locations/LOCATION/reasoningEngines/ID`. List engines with `client.agent_engines.list()` |
11
+ | `google.api_core.exceptions.InvalidArgument (400)` | Malformed request — wrong location, invalid config | Confirm the location matches where the engine was deployed (e.g., `us-central1`) |
12
+
13
+ ## Agent Engine Retrieval Errors
14
+
15
+ | Error | Cause | Solution |
16
+ |-------|-------|----------|
17
+ | `client.agent_engines.get()` returns `None` | Engine was deleted or ID is stale | Re-list engines with `client.agent_engines.list()` to find the current resource name |
18
+ | `UNAVAILABLE` / `DEADLINE_EXCEEDED` | Transient network or service issue | Retry with exponential backoff; check [Vertex AI status](https://status.cloud.google.com/) |
19
+ | `RESOURCE_EXHAUSTED` | Quota limit hit on Agent Engine API | Check quotas in Cloud Console under IAM & Admin > Quotas; request an increase if needed |
20
+
21
+ ## Common gcloud CLI Misconceptions
22
+
23
+ **There is no `gcloud` CLI for Agent Engine.** The following commands do NOT exist and will fail:
24
+ - `gcloud ai agents describe` / `gcloud ai agents list`
25
+ - `gcloud ai reasoning-engines list`
26
+ - `gcloud alpha ai agent-engines list`
27
+ - `gcloud ai agents update`
28
+
29
+ All Agent Engine operations must use the Python SDK:
30
+
31
+ ```python
32
+ import vertexai
33
+
34
+ client = vertexai.Client(project="PROJECT_ID", location="LOCATION")
35
+
36
+ # List all agent engines
37
+ for engine in client.agent_engines.list():
38
+ print(engine.name, engine.display_name)
39
+
40
+ # Get a specific agent engine
41
+ engine = client.agent_engines.get(
42
+ name="projects/PROJECT_ID/locations/LOCATION/reasoningEngines/ENGINE_ID"
43
+ )
44
+
45
+ # Create / deploy a new agent engine
46
+ from google.adk.agents import Agent
47
+ agent = Agent(name="my-agent", model="gemini-2.5-flash")
48
+ engine = client.agent_engines.create(agent=agent, config={"display_name": "my-agent"})
49
+ ```
50
+
51
+ ## A2A Protocol Errors
52
+
53
+ | Error | Cause | Solution |
54
+ |-------|-------|----------|
55
+ | AgentCard endpoint returns 404 | Agent not configured for A2A protocol or wrong endpoint URL | Verify A2A is enabled in the agent config; check that `/.well-known/agent-card` path is correct |
56
+ | Task API returns 401/403 | Missing or invalid auth token in request header | Include `Authorization: Bearer $(gcloud auth print-access-token)` header |
57
+ | Task API returns 500 | Agent crashed while processing the task | Check Cloud Logging for agent error logs; inspect the agent's error handler |
58
+ | Status API timeout | Long-running task with no status update mechanism | Implement streaming or polling with reasonable timeout (30-60s) |
59
+
60
+ ## Monitoring and Observability Errors
61
+
62
+ | Error | Cause | Solution |
63
+ |-------|-------|----------|
64
+ | Cloud Monitoring returns no data | Monitoring API not enabled or no recent agent traffic | Run `gcloud services enable monitoring.googleapis.com`; generate test traffic first |
65
+ | Metrics query returns `INVALID_ARGUMENT` | Incorrect metric type string or filter syntax | Verify metric type with [Metrics Explorer](https://console.cloud.google.com/monitoring/metrics-explorer) |
66
+ | Log queries return empty results | Wrong resource type filter or time range | Use `resource.type="aiplatform.googleapis.com/Agent"` and verify timestamps are UTC |
67
+ | Cloud Trace shows no spans | OpenTelemetry not configured in the agent | Add Cloud Trace exporter to the agent's OpenTelemetry setup |
68
+
69
+ ## Security Posture Check Errors
70
+
71
+ | Error | Cause | Solution |
72
+ |-------|-------|----------|
73
+ | IAM policy query fails | User lacks `resourcemanager.projects.getIamPolicy` permission | Grant `roles/iam.securityReviewer` for read-only IAM inspection |
74
+ | VPC-SC perimeter query fails | No organization-level access | VPC-SC queries require org-level permissions; ask an org admin or skip this check |
75
+ | Model Armor status unknown | Feature not available in the agent's region | Model Armor availability varies by region; check [region support](https://cloud.google.com/vertex-ai/docs/general/locations) |
76
+ | Secret scan false positive | Environment variable names match secret patterns | Add false positive patterns to the exclusion list in the security checker config |
77
+
78
+ ## Code Execution Sandbox Errors
79
+
80
+ | Error | Cause | Solution |
81
+ |-------|-------|----------|
82
+ | State TTL rejected (> 14 days) | Agent Engine enforces max 14-day TTL | Set `state_ttl_days` between 1 and 14 |
83
+ | Sandbox timeout | Code execution exceeded the configured timeout | Increase timeout or optimize the executed code; check for infinite loops |
84
+ | Sandbox OOM (out of memory) | Code execution consumed too much memory | Reduce data size processed in sandbox; increase memory limits if available |
85
+ | Sandbox network error | Code tried to make external network calls from isolated sandbox | Sandbox is network-isolated by design; move network calls outside the sandbox |
86
+
87
+ ## Memory Bank Errors
88
+
89
+ | Error | Cause | Solution |
90
+ |-------|-------|----------|
91
+ | Memory Bank query slow (> 500ms) | Indexing not enabled or index stale | Enable indexing in Memory Bank config; rebuild index if needed |
92
+ | Memory quota exceeded | Too many memories stored without cleanup | Enable auto-cleanup or increase max_memories limit |
93
+ | Firestore permission denied | Agent service account lacks Firestore access | Grant `roles/datastore.user` to the agent's service account |
94
+
95
+ ---
96
+ *[Tons of Skills](https://tonsofskills.com) by [Intent Solutions](https://intentsolutions.io) | [jeremylongshore.com](https://jeremylongshore.com)*
@@ -0,0 +1,50 @@
1
+ # Example Inspection Report
2
+
3
+ ## Example Inspection Report
4
+
5
+ ```yaml
6
+ Agent ID: gcp-deployer-agent
7
+ Deployment Status: RUNNING
8
+ Inspection Date: 2025-12-09
9
+
10
+ Runtime Configuration:
11
+ Model: gemini-2.5-flash
12
+ Code Execution: ✅ Enabled (TTL: 14 days)
13
+ Memory Bank: ✅ Enabled (retention: 90 days)
14
+ VPC: ✅ Configured (private-vpc-prod)
15
+
16
+ A2A Protocol Compliance:
17
+ AgentCard: ✅ Valid
18
+ Task API: ✅ Functional
19
+ Status API: ✅ Functional
20
+ Protocol Version: 1.0
21
+
22
+ Security Posture:
23
+ IAM: ✅ Least privilege (score: 95%)
24
+ VPC-SC: ✅ Enabled
25
+ Model Armor: ✅ Enabled
26
+ Encryption: ✅ At-rest & in-transit
27
+ Overall: 🟢 SECURE (92%)
28
+
29
+ Performance Metrics (24h):
30
+ Request Count: 12,450
31
+ Error Rate: 2.3% 🟢
32
+ Latency (p95): 1,850ms 🟢
33
+ Token Usage: 450K tokens
34
+ Cost Estimate: $12.50/day
35
+
36
+ Production Readiness:
37
+ Security: 92% (28/30 points)
38
+ Performance: 88% (22/25 points)
39
+ Monitoring: 95% (19/20 points)
40
+ Compliance: 80% (12/15 points)
41
+ Reliability: 70% (7/10 points)
42
+
43
+ Overall Score: 87% 🟢 PRODUCTION READY
44
+
45
+ Recommendations:
46
+ 1. Enable multi-region deployment (reliability +10%)
47
+ 2. Configure automated backups (compliance +5%)
48
+ 3. Add circuit breaker pattern (reliability +5%)
49
+ 4. Optimize memory bank indexing (performance +3%)
50
+ ```