Validator Expert
Current State
!gcloud config get-value project 2>/dev/null || echo 'no active project'
!gcloud auth list --filter=status:ACTIVE --format="value(account)" 2>/dev/null || echo 'not authenticated'
Overview
Validate production readiness of Vertex AI Agent Engine deployments by executing weighted checks across five categories: security (30 points), monitoring (20 points), performance (25 points), compliance (15 points), and best practices (10 points). This skill produces a 0-100% composite score with pass/fail per check and prioritized remediation recommendations.
Prerequisites
gcloudCLI authenticated withroles/aiplatform.viewer,roles/iam.securityReviewer, androles/monitoring.viewer- Access to the target Google Cloud project and Vertex AI Agent Engine deployment
- Cloud Monitoring API and Cloud Logging API enabled in the project
- Knowledge of the deployment's expected SLOs (latency targets, error rate thresholds)
- Read-only access to IAM policies, VPC-SC configurations, and service account bindings
Instructions
- Retrieve the deployment configuration using the Python SDK (
vertexai.Client().agent_engines.get(name)) or REST API (GET https://{LOCATION}-aiplatform.googleapis.com/v1/projects/{PROJECT}/locations/{LOCATION}/reasoningEngines/{ID}) and parse model, scaling, and feature settings - Run the security validation suite (see security checklist):
- Check if Agent Identity is enabled (recommended over service accounts for 2025+ deployments)
- If using service accounts, verify IAM roles follow least-privilege (
roles/aiplatform.expressUser, notroles/aiplatform.admin) - Confirm VPC Service Controls perimeter is active and correctly scoped
- Check encryption at rest (CMEK or Google-managed) and in-transit (TLS 1.3)
- Scan configuration files and environment variables for hardcoded secrets
- Validate Model Armor is enabled with
roles/modelarmor.usergranted - Check Memory Bank IAM Conditions for multi-tenant agents
- Run the monitoring validation suite:
- Verify Cloud Monitoring dashboards exist with required panels (request count, error rate, latency)
- Confirm alerting policies cover error rate spikes, latency SLO breaches, and cost thresholds
- Check token usage tracking is enabled with per-model granularity
- Validate structured logging with severity levels and correlation IDs
- Confirm latency SLOs are defined with p95 and p99 targets
- Run the performance validation suite:
- Verify auto-scaling is configured with appropriate min/max instance counts
- Check resource limits (CPU, memory) match expected workload profile
- Confirm caching strategy is implemented for repeated prompts or embeddings
- Validate Code Execution Sandbox TTL is set between 7-14 days
- Check Memory Bank retention policy (min 100 memories, auto-cleanup enabled)
- Run the compliance validation suite:
- Confirm audit logging is enabled for all admin and data access operations
- Verify data residency meets regional requirements
- Check privacy policies and data retention schedules
- Validate backup and disaster recovery configuration
- Calculate weighted scores per category and compute the overall production readiness percentage
- Generate a prioritized recommendation list sorted by score impact per remediation effort
Output
- Production readiness score: 0-100% with status (READY >= 85%, NEEDS WORK 70-84%, NOT READY < 70%)
- Per-category breakdown: security (x/30), monitoring (x/20), performance (x/25), compliance (x/15), best practices (x/10)
- Pass/fail table for each individual check with evidence notes
- Prioritized remediation plan: action items ranked by score improvement per effort
- Comparison to previous validation run (if available) showing score delta
Error Handling
| Error | Cause | Solution |
|-------|-------|----------|
| Insufficient IAM permissions | Viewer roles not granted on target project | Request roles/aiplatform.viewer and roles/iam.securityReviewer from project admin |
| Agent deployment not found | Incorrect agent ID or deployment deleted | Verify agent ID with vertexai.Client().agent_engines.list() or REST GET .../reasoningEngines; confirm deployment region |
| Monitoring API returns no data | API not enabled or agent has zero traffic | Enable Monitoring API; generate synthetic traffic to populate baseline metrics |
| VPC-SC configuration inaccessible | Organization policy restricts VPC-SC reads | Request roles/accesscontextmanager.policyReader at organization level |
| Compliance check inconclusive | Audit logs not enabled or retention too short | Enable Data Access audit logs; set log retention to minimum 365 days |
Examples
Scenario 1: Pre-Launch Validation -- Validate a new ADK agent before production launch. Run all five validation categories. Target score: 85%+ overall, with security score at 28/30 minimum. Generate remediation plan for any failing checks.
Scenario 2: Post-Incident Security Audit -- After a permission escalation incident, re-validate security posture. Focus on IAM least-privilege, service account bindings, and VPC-SC perimeter integrity. Compare scores against the last passing validation.
Scenario 3: Quarterly Compliance Review -- Execute compliance and monitoring validation suites for SOC 2 audit preparation. Verify audit logging coverage, data residency compliance, and backup/DR configuration. Export results as evidence artifacts.
Resources
Validation checklists (read the relevant one during each validation step):
- Security checklist — IAM, VPC-SC, encryption, Model Armor (30% weight)
- Monitoring checklist — dashboards, alerts, SLOs, logging (20% weight)
- Performance & compliance checklist — auto-scaling, caching, audit logs, DR (40% weight)
Official Google Cloud documentation: