@intentsolutionsio/jeremy-vertex-engine 2.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +20 -0
- package/LICENSE +21 -0
- package/README.md +782 -0
- package/agents/vertex-engine-inspector.md +446 -0
- package/package.json +41 -0
- package/skills/vertex-engine-inspector/SKILL.md +84 -0
- package/skills/vertex-engine-inspector/references/ARD.md +74 -0
- package/skills/vertex-engine-inspector/references/PRD.md +69 -0
- package/skills/vertex-engine-inspector/references/errors.md +96 -0
- package/skills/vertex-engine-inspector/references/example-inspection-report.md +50 -0
- package/skills/vertex-engine-inspector/references/examples.md +591 -0
- package/skills/vertex-engine-inspector/references/inspection-categories.md +104 -0
- package/skills/vertex-engine-inspector/references/inspection-workflow.md +52 -0
- package/skills/vertex-engine-inspector/scripts/check-security.py +254 -0
- package/skills/vertex-engine-inspector/scripts/inspect-agent.sh +194 -0
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
# PRD: Vertex Engine Inspector
|
|
2
|
+
|
|
3
|
+
**Version:** 2.1.0
|
|
4
|
+
**Author:** Jeremy Longshore <jeremy@intentsolutions.io>
|
|
5
|
+
**Status:** Active
|
|
6
|
+
**Marketplace:** [tonsofskills.com](https://tonsofskills.com) by [Intent Solutions](https://intentsolutions.io)
|
|
7
|
+
**Portfolio:** [jeremylongshore.com](https://jeremylongshore.com)
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Problem Statement
|
|
12
|
+
|
|
13
|
+
Vertex AI Agent Engine deployments involve seven interconnected configuration surfaces (runtime, Code Execution Sandbox, Memory Bank, A2A protocol, security, performance, monitoring) that are validated manually and inconsistently. Teams deploy agents without knowing their production readiness score, leading to security gaps (unhardened IAM, missing VPC-SC), reliability issues (no alerting, stale memory), and protocol non-compliance (broken A2A endpoints). Without a systematic inspection, problems surface only after production incidents.
|
|
14
|
+
|
|
15
|
+
## Target Users
|
|
16
|
+
|
|
17
|
+
| User | Context | Primary Need |
|
|
18
|
+
|------|---------|-------------|
|
|
19
|
+
| Platform Engineer | Preparing a new Agent Engine deployment for production launch | Comprehensive readiness score with prioritized fix list before go-live |
|
|
20
|
+
| Security Auditor | Reviewing IAM, VPC-SC, and encryption posture after configuration changes | Targeted security inspection confirming least-privilege and perimeter integrity |
|
|
21
|
+
| SRE / On-Call Engineer | Investigating elevated error rates or latency on a deployed agent | Performance metrics retrieval with root-cause correlation (scaling, tokens, errors) |
|
|
22
|
+
| DevOps Lead | Establishing baseline quality gates for agent deployments | Repeatable inspection producing consistent scores across all team deployments |
|
|
23
|
+
|
|
24
|
+
## Success Criteria
|
|
25
|
+
|
|
26
|
+
1. Inspect all seven categories (runtime, sandbox, memory, A2A, security, performance, monitoring) in a single invocation
|
|
27
|
+
2. Generate a weighted production-readiness score (0-100%) with per-category breakdowns
|
|
28
|
+
3. Produce actionable recommendations with estimated score improvement per remediation item
|
|
29
|
+
4. Complete a full inspection within 5 minutes for a standard deployment
|
|
30
|
+
5. Security category catches 100% of missing VPC-SC, overprivileged IAM, and unencrypted configurations
|
|
31
|
+
6. A2A compliance matrix clearly shows pass/fail for each protocol endpoint
|
|
32
|
+
|
|
33
|
+
## Functional Requirements
|
|
34
|
+
|
|
35
|
+
1. Retrieve agent metadata via the Python SDK (`vertexai.Client().agent_engines.get()`) and parse runtime configuration
|
|
36
|
+
2. Validate Code Execution Sandbox settings: TTL range (7-14 days), sandbox type (`SECURE_ISOLATED`), scoped IAM
|
|
37
|
+
3. Check Memory Bank: enabled status, retention policy (min 100 memories), Firestore encryption, indexing, auto-cleanup
|
|
38
|
+
4. Test A2A protocol endpoints: `/.well-known/agent-card`, `POST /v1/tasks:send`, `GET /v1/tasks/<id>`
|
|
39
|
+
5. Audit security posture: IAM least-privilege, VPC-SC perimeter, Model Armor, encryption, no hardcoded credentials
|
|
40
|
+
6. Query Cloud Monitoring for 24-hour metrics: error rate, latency percentiles (p50/p95/p99), token usage, cost
|
|
41
|
+
7. Assess observability: dashboards, alerting policies, structured logging, OpenTelemetry, Cloud Error Reporting
|
|
42
|
+
8. Calculate weighted scores and generate a prioritized recommendation list
|
|
43
|
+
|
|
44
|
+
## Non-Functional Requirements
|
|
45
|
+
|
|
46
|
+
- Read-only inspection: skill must not modify the inspected deployment or its configuration
|
|
47
|
+
- All Agent Engine operations use the Python SDK, never gcloud CLI (no gcloud surface exists for Agent Engine)
|
|
48
|
+
- Inspection must work from outside VPC-SC perimeters when access levels are properly configured
|
|
49
|
+
- Output format is YAML for machine-parseability and human readability
|
|
50
|
+
- Scoring weights must reflect production impact: security and reliability weighted higher than monitoring
|
|
51
|
+
- Inspection must handle partial failures gracefully (skip unavailable categories, still produce report)
|
|
52
|
+
- Results must be deterministic: same deployment state always produces the same score
|
|
53
|
+
|
|
54
|
+
## Dependencies
|
|
55
|
+
|
|
56
|
+
- `google-cloud-aiplatform[agent_engines]>=1.120.0` Python SDK
|
|
57
|
+
- `gcloud` CLI authenticated for IAM and monitoring queries
|
|
58
|
+
- IAM roles: `roles/aiplatform.user` and `roles/monitoring.viewer` on target project
|
|
59
|
+
- Cloud Monitoring API enabled
|
|
60
|
+
- `curl` for A2A endpoint testing
|
|
61
|
+
|
|
62
|
+
## Out of Scope
|
|
63
|
+
|
|
64
|
+
- Modifying or remediating Agent Engine configurations (inspection only, never writes)
|
|
65
|
+
- Deploying new agents or updating existing deployments (handled by adk-deployment-specialist)
|
|
66
|
+
- Infrastructure provisioning with Terraform (handled by adk-infra-expert)
|
|
67
|
+
- Cost optimization recommendations beyond basic model selection guidance
|
|
68
|
+
- Load testing or performance benchmarking (inspection uses existing metrics only)
|
|
69
|
+
- Cross-project inspection (each invocation targets a single project/agent pair)
|
|
@@ -0,0 +1,96 @@
|
|
|
1
|
+
# Error Handling Reference
|
|
2
|
+
|
|
3
|
+
## SDK / Authentication Errors
|
|
4
|
+
|
|
5
|
+
| Error | Cause | Solution |
|
|
6
|
+
|-------|-------|----------|
|
|
7
|
+
| `ModuleNotFoundError: No module named 'vertexai'` | Vertex AI SDK not installed | `pip install google-cloud-aiplatform[agent_engines]>=1.120.0` |
|
|
8
|
+
| `google.auth.exceptions.DefaultCredentialsError` | No application default credentials configured | Run `gcloud auth application-default login` or set `GOOGLE_APPLICATION_CREDENTIALS` |
|
|
9
|
+
| `google.api_core.exceptions.PermissionDenied (403)` | Service account or user lacks required IAM roles | Grant `roles/aiplatform.user` and `roles/monitoring.viewer` on the project |
|
|
10
|
+
| `google.api_core.exceptions.NotFound (404)` | Agent engine ID or resource name is incorrect | Verify the full resource name: `projects/PROJECT/locations/LOCATION/reasoningEngines/ID`. List engines with `client.agent_engines.list()` |
|
|
11
|
+
| `google.api_core.exceptions.InvalidArgument (400)` | Malformed request — wrong location, invalid config | Confirm the location matches where the engine was deployed (e.g., `us-central1`) |
|
|
12
|
+
|
|
13
|
+
## Agent Engine Retrieval Errors
|
|
14
|
+
|
|
15
|
+
| Error | Cause | Solution |
|
|
16
|
+
|-------|-------|----------|
|
|
17
|
+
| `client.agent_engines.get()` returns `None` | Engine was deleted or ID is stale | Re-list engines with `client.agent_engines.list()` to find the current resource name |
|
|
18
|
+
| `UNAVAILABLE` / `DEADLINE_EXCEEDED` | Transient network or service issue | Retry with exponential backoff; check [Vertex AI status](https://status.cloud.google.com/) |
|
|
19
|
+
| `RESOURCE_EXHAUSTED` | Quota limit hit on Agent Engine API | Check quotas in Cloud Console under IAM & Admin > Quotas; request an increase if needed |
|
|
20
|
+
|
|
21
|
+
## Common gcloud CLI Misconceptions
|
|
22
|
+
|
|
23
|
+
**There is no `gcloud` CLI for Agent Engine.** The following commands do NOT exist and will fail:
|
|
24
|
+
- `gcloud ai agents describe` / `gcloud ai agents list`
|
|
25
|
+
- `gcloud ai reasoning-engines list`
|
|
26
|
+
- `gcloud alpha ai agent-engines list`
|
|
27
|
+
- `gcloud ai agents update`
|
|
28
|
+
|
|
29
|
+
All Agent Engine operations must use the Python SDK:
|
|
30
|
+
|
|
31
|
+
```python
|
|
32
|
+
import vertexai
|
|
33
|
+
|
|
34
|
+
client = vertexai.Client(project="PROJECT_ID", location="LOCATION")
|
|
35
|
+
|
|
36
|
+
# List all agent engines
|
|
37
|
+
for engine in client.agent_engines.list():
|
|
38
|
+
print(engine.name, engine.display_name)
|
|
39
|
+
|
|
40
|
+
# Get a specific agent engine
|
|
41
|
+
engine = client.agent_engines.get(
|
|
42
|
+
name="projects/PROJECT_ID/locations/LOCATION/reasoningEngines/ENGINE_ID"
|
|
43
|
+
)
|
|
44
|
+
|
|
45
|
+
# Create / deploy a new agent engine
|
|
46
|
+
from google.adk.agents import Agent
|
|
47
|
+
agent = Agent(name="my-agent", model="gemini-2.5-flash")
|
|
48
|
+
engine = client.agent_engines.create(agent=agent, config={"display_name": "my-agent"})
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
## A2A Protocol Errors
|
|
52
|
+
|
|
53
|
+
| Error | Cause | Solution |
|
|
54
|
+
|-------|-------|----------|
|
|
55
|
+
| AgentCard endpoint returns 404 | Agent not configured for A2A protocol or wrong endpoint URL | Verify A2A is enabled in the agent config; check that `/.well-known/agent-card` path is correct |
|
|
56
|
+
| Task API returns 401/403 | Missing or invalid auth token in request header | Include `Authorization: Bearer $(gcloud auth print-access-token)` header |
|
|
57
|
+
| Task API returns 500 | Agent crashed while processing the task | Check Cloud Logging for agent error logs; inspect the agent's error handler |
|
|
58
|
+
| Status API timeout | Long-running task with no status update mechanism | Implement streaming or polling with reasonable timeout (30-60s) |
|
|
59
|
+
|
|
60
|
+
## Monitoring and Observability Errors
|
|
61
|
+
|
|
62
|
+
| Error | Cause | Solution |
|
|
63
|
+
|-------|-------|----------|
|
|
64
|
+
| Cloud Monitoring returns no data | Monitoring API not enabled or no recent agent traffic | Run `gcloud services enable monitoring.googleapis.com`; generate test traffic first |
|
|
65
|
+
| Metrics query returns `INVALID_ARGUMENT` | Incorrect metric type string or filter syntax | Verify metric type with [Metrics Explorer](https://console.cloud.google.com/monitoring/metrics-explorer) |
|
|
66
|
+
| Log queries return empty results | Wrong resource type filter or time range | Use `resource.type="aiplatform.googleapis.com/Agent"` and verify timestamps are UTC |
|
|
67
|
+
| Cloud Trace shows no spans | OpenTelemetry not configured in the agent | Add Cloud Trace exporter to the agent's OpenTelemetry setup |
|
|
68
|
+
|
|
69
|
+
## Security Posture Check Errors
|
|
70
|
+
|
|
71
|
+
| Error | Cause | Solution |
|
|
72
|
+
|-------|-------|----------|
|
|
73
|
+
| IAM policy query fails | User lacks `resourcemanager.projects.getIamPolicy` permission | Grant `roles/iam.securityReviewer` for read-only IAM inspection |
|
|
74
|
+
| VPC-SC perimeter query fails | No organization-level access | VPC-SC queries require org-level permissions; ask an org admin or skip this check |
|
|
75
|
+
| Model Armor status unknown | Feature not available in the agent's region | Model Armor availability varies by region; check [region support](https://cloud.google.com/vertex-ai/docs/general/locations) |
|
|
76
|
+
| Secret scan false positive | Environment variable names match secret patterns | Add false positive patterns to the exclusion list in the security checker config |
|
|
77
|
+
|
|
78
|
+
## Code Execution Sandbox Errors
|
|
79
|
+
|
|
80
|
+
| Error | Cause | Solution |
|
|
81
|
+
|-------|-------|----------|
|
|
82
|
+
| State TTL rejected (> 14 days) | Agent Engine enforces max 14-day TTL | Set `state_ttl_days` between 1 and 14 |
|
|
83
|
+
| Sandbox timeout | Code execution exceeded the configured timeout | Increase timeout or optimize the executed code; check for infinite loops |
|
|
84
|
+
| Sandbox OOM (out of memory) | Code execution consumed too much memory | Reduce data size processed in sandbox; increase memory limits if available |
|
|
85
|
+
| Sandbox network error | Code tried to make external network calls from isolated sandbox | Sandbox is network-isolated by design; move network calls outside the sandbox |
|
|
86
|
+
|
|
87
|
+
## Memory Bank Errors
|
|
88
|
+
|
|
89
|
+
| Error | Cause | Solution |
|
|
90
|
+
|-------|-------|----------|
|
|
91
|
+
| Memory Bank query slow (> 500ms) | Indexing not enabled or index stale | Enable indexing in Memory Bank config; rebuild index if needed |
|
|
92
|
+
| Memory quota exceeded | Too many memories stored without cleanup | Enable auto-cleanup or increase max_memories limit |
|
|
93
|
+
| Firestore permission denied | Agent service account lacks Firestore access | Grant `roles/datastore.user` to the agent's service account |
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
*[Tons of Skills](https://tonsofskills.com) by [Intent Solutions](https://intentsolutions.io) | [jeremylongshore.com](https://jeremylongshore.com)*
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
# Example Inspection Report
|
|
2
|
+
|
|
3
|
+
## Example Inspection Report
|
|
4
|
+
|
|
5
|
+
```yaml
|
|
6
|
+
Agent ID: gcp-deployer-agent
|
|
7
|
+
Deployment Status: RUNNING
|
|
8
|
+
Inspection Date: 2025-12-09
|
|
9
|
+
|
|
10
|
+
Runtime Configuration:
|
|
11
|
+
Model: gemini-2.5-flash
|
|
12
|
+
Code Execution: ✅ Enabled (TTL: 14 days)
|
|
13
|
+
Memory Bank: ✅ Enabled (retention: 90 days)
|
|
14
|
+
VPC: ✅ Configured (private-vpc-prod)
|
|
15
|
+
|
|
16
|
+
A2A Protocol Compliance:
|
|
17
|
+
AgentCard: ✅ Valid
|
|
18
|
+
Task API: ✅ Functional
|
|
19
|
+
Status API: ✅ Functional
|
|
20
|
+
Protocol Version: 1.0
|
|
21
|
+
|
|
22
|
+
Security Posture:
|
|
23
|
+
IAM: ✅ Least privilege (score: 95%)
|
|
24
|
+
VPC-SC: ✅ Enabled
|
|
25
|
+
Model Armor: ✅ Enabled
|
|
26
|
+
Encryption: ✅ At-rest & in-transit
|
|
27
|
+
Overall: 🟢 SECURE (92%)
|
|
28
|
+
|
|
29
|
+
Performance Metrics (24h):
|
|
30
|
+
Request Count: 12,450
|
|
31
|
+
Error Rate: 2.3% 🟢
|
|
32
|
+
Latency (p95): 1,850ms 🟢
|
|
33
|
+
Token Usage: 450K tokens
|
|
34
|
+
Cost Estimate: $12.50/day
|
|
35
|
+
|
|
36
|
+
Production Readiness:
|
|
37
|
+
Security: 92% (28/30 points)
|
|
38
|
+
Performance: 88% (22/25 points)
|
|
39
|
+
Monitoring: 95% (19/20 points)
|
|
40
|
+
Compliance: 80% (12/15 points)
|
|
41
|
+
Reliability: 70% (7/10 points)
|
|
42
|
+
|
|
43
|
+
Overall Score: 87% 🟢 PRODUCTION READY
|
|
44
|
+
|
|
45
|
+
Recommendations:
|
|
46
|
+
1. Enable multi-region deployment (reliability +10%)
|
|
47
|
+
2. Configure automated backups (compliance +5%)
|
|
48
|
+
3. Add circuit breaker pattern (reliability +5%)
|
|
49
|
+
4. Optimize memory bank indexing (performance +3%)
|
|
50
|
+
```
|