Agent Skills: Cloud Solution Architect

>-

UncategorizedID: microsoft/agent-skills/cloud-solution-architect

Install this agent skill to your local

pnpm dlx add-skill https://github.com/microsoft/skills/tree/HEAD/.github/skills/cloud-solution-architect

Skill Files

Browse the full folder contents for cloud-solution-architect.

Download Skill

Loading file tree…

.github/skills/cloud-solution-architect/SKILL.md

Skill Metadata

Name
cloud-solution-architect
Description
>-

Cloud Solution Architect

Overview

Design well-architected, production-grade cloud systems following Azure Architecture Center best practices. This skill provides:

  • 10 design principles for Azure applications
  • 6 architecture styles with selection guidance
  • 44 cloud design patterns mapped to WAF pillars
  • Technology choice frameworks for compute, storage, data, messaging
  • Performance antipatterns to avoid
  • Architecture review workflow for systematic design validation

Ten Design Principles for Azure Applications

| # | Principle | Key Tactics | |---|-----------|-------------| | 1 | Design for self-healing | Retry with backoff, circuit breaker, bulkhead isolation, health endpoint monitoring, graceful degradation | | 2 | Make all things redundant | Eliminate single points of failure, use availability zones, deploy multi-region, replicate data | | 3 | Minimize coordination | Decouple services, use async messaging, embrace eventual consistency, use domain events | | 4 | Design to scale out | Horizontal scaling, autoscaling rules, stateless services, avoid session stickiness, partition workloads | | 5 | Partition around limits | Data partitioning (shard/hash/range), respect compute & network limits, use CDNs for static content | | 6 | Design for operations | Structured logging, distributed tracing, metrics & dashboards, runbook automation, infrastructure as code | | 7 | Use managed services | Prefer PaaS over IaaS, reduce operational burden, leverage built-in HA/DR/scaling | | 8 | Use an identity service | Microsoft Entra ID, managed identity, RBAC, avoid storing credentials, zero-trust principles | | 9 | Design for evolution | Loose coupling, versioned APIs, backward compatibility, async messaging for integration, feature flags | | 10 | Build for business needs | Define SLAs/SLOs, establish RTO/RPO targets, domain-driven design, cost modeling, composite SLAs |


Architecture Styles

| Style | Description | When to Use | Key Services | |-------|-------------|-------------|--------------| | N-tier | Horizontal layers (presentation, business, data) | Traditional enterprise apps, lift-and-shift | App Service, SQL Database, VNets | | Web-Queue-Worker | Web frontend → message queue → backend worker | Moderate-complexity apps with long-running tasks | App Service, Service Bus, Functions | | Microservices | Small autonomous services, bounded contexts, independent deploy | Complex domains, independent team scaling | AKS, Container Apps, API Management | | Event-driven | Pub/sub model, event producers/consumers | Real-time processing, IoT, reactive systems | Event Hubs, Event Grid, Functions | | Big data | Batch + stream processing pipeline | Analytics, ML pipelines, large-scale data | Synapse, Data Factory, Databricks | | Big compute | HPC, parallel processing | Simulations, modeling, rendering, genomics | Batch, CycleCloud, HPC VMs |

Selection Criteria

  • Domain complexity → Microservices (high), N-tier (low-medium)
  • Team autonomy → Microservices (independent teams), N-tier (single team)
  • Data volume → Big data (TB+), others (GB)
  • Latency requirements → Event-driven (real-time), Web-Queue-Worker (tolerant)

Cloud Design Patterns

44 patterns organized by primary concern. WAF pillar mapping: R=Reliability, S=Security, CO=Cost Optimization, OE=Operational Excellence, PE=Performance Efficiency.

Messaging & Communication

| Pattern | Summary | Pillars | |---------|---------|---------| | Asynchronous Request-Reply | Decouple request/response with polling or callbacks | R, PE | | Claim Check | Split large messages; store payload separately, pass reference | R, PE | | Choreography | Services coordinate via events without central orchestrator | R, OE | | Competing Consumers | Multiple consumers process messages from shared queue concurrently | R, PE | | Messaging Bridge | Connect incompatible messaging systems | R, OE | | Pipes and Filters | Decompose complex processing into reusable filter stages | R, OE | | Priority Queue | Prioritize requests so higher-priority work is processed first | R, PE | | Publisher/Subscriber | Decouple senders from receivers via topics/subscriptions | R, PE | | Queue-Based Load Leveling | Buffer requests with a queue to smooth intermittent loads | R, PE | | Sequential Convoy | Process related messages in order while allowing parallel groups | R, PE |

Reliability & Resilience

| Pattern | Summary | Pillars | |---------|---------|---------| | Bulkhead | Isolate resources per workload to prevent cascading failure | R | | Circuit Breaker | Stop calling a failing service; fail fast to protect resources | R | | Compensating Transaction | Undo previously committed steps when a later step fails | R | | Health Endpoint Monitoring | Expose health checks for load balancers and orchestrators | R, OE | | Leader Election | Coordinate distributed instances by electing a leader | R | | Retry | Handle transient faults by retrying with exponential backoff | R | | Saga | Manage data consistency across microservices with compensating transactions | R | | Scheduler Agent Supervisor | Coordinate distributed actions with retry and failure handling | R |

Data Management

| Pattern | Summary | Pillars | |---------|---------|---------| | Cache-Aside | Load data on demand into cache from data store | PE | | CQRS | Separate read and write models for independent scaling | PE, R | | Event Sourcing | Store state as append-only sequence of domain events | R, OE | | Index Table | Create indexes over frequently queried fields in data stores | PE | | Materialized View | Pre-compute views over data for efficient queries | PE | | Sharding | Distribute data across partitions for scale and performance | PE, R | | Static Content Hosting | Serve static content from cloud storage/CDN directly | PE, CO | | Valet Key | Grant clients limited direct access to storage resources | S, PE |

Design & Structure

| Pattern | Summary | Pillars | |---------|---------|---------| | Ambassador | Offload cross-cutting concerns to a helper sidecar proxy | OE | | Anti-Corruption Layer | Translate between new and legacy system models | OE, R | | Backends for Frontends | Create separate backends per frontend type (mobile, web, etc.) | OE, PE | | Compute Resource Consolidation | Combine multiple workloads into fewer compute instances | CO | | External Configuration Store | Externalize configuration from deployment packages | OE | | Sidecar | Deploy helper components alongside the main service | OE | | Strangler Fig | Incrementally migrate legacy systems by replacing pieces | OE, R |

Security & Access

| Pattern | Summary | Pillars | |---------|---------|---------| | Federated Identity | Delegate authentication to an external identity provider | S | | Gatekeeper | Protect services using a dedicated broker that validates requests | S | | Quarantine | Isolate and validate external assets before allowing use | S | | Rate Limiting | Control consumption rate of resources by consumers | R, S | | Throttling | Control resource consumption to sustain SLAs under load | R, PE |

Deployment & Scaling

| Pattern | Summary | Pillars | |---------|---------|---------| | Deployment Stamps | Deploy multiple independent copies of application components | R, PE | | Edge Workload Configuration | Configure workloads differently across diverse edge devices | OE | | Gateway Aggregation | Aggregate multiple backend calls into a single client request | PE | | Gateway Offloading | Offload shared functionality (SSL, auth) to a gateway | OE, S | | Gateway Routing | Route requests to multiple backends using a single endpoint | OE | | Geode | Deploy backends to multiple regions for active-active serving | R, PE |

See Design Patterns Reference for detailed implementation guidance.


Technology Choices

Decision Framework

For each technology area, evaluate: requirements → constraints → tradeoffs → select.

| Area | Key Options | Selection Criteria | |------|-------------|-------------------| | Compute | App Service, Functions, Container Apps, AKS, VMs, Batch | Hosting model, scaling, cost, team skills | | Storage | Blob Storage, Data Lake, Files, Disks, Managed Lustre | Access patterns, throughput, cost tier | | Data stores | SQL Database, Cosmos DB, PostgreSQL, Redis, Table Storage | Consistency model, query patterns, scale | | Messaging | Service Bus, Event Hubs, Event Grid, Queue Storage | Ordering, throughput, pub/sub vs queue | | Networking | Front Door, Application Gateway, Load Balancer, Traffic Manager | Global vs regional, L4 vs L7, WAF | | AI services | Azure OpenAI, AI Search, AI Foundry, Document Intelligence | Model needs, data grounding, orchestration | | Containers | Container Apps, AKS, Container Instances | Operational control vs simplicity |

See Technology Choices Reference for detailed decision trees.


Best Practices

| Practice | Key Guidance | |----------|-------------| | API design | RESTful conventions, resource-oriented URIs, HATEOAS, versioning via URL path or header | | API implementation | Async operations, pagination, idempotent PUT/DELETE, content negotiation, ETag caching | | Autoscaling | Scale on metrics (CPU, queue depth, custom), cool-down periods, predictive scaling, scale-in protection | | Background jobs | Use queues or scheduled triggers, idempotent processing, poison message handling, graceful shutdown | | Caching | Cache-aside pattern, TTL policies, cache invalidation strategies, distributed cache for multi-instance | | CDN | Static asset offloading, cache-busting with versioned URLs, geo-distribution, HTTPS enforcement | | Data partitioning | Horizontal (sharding), vertical, functional partitioning; partition key selection for even distribution | | Partitioning strategies | Hash-based, range-based, directory-based; rebalancing approach, cross-partition query avoidance | | Host name preservation | Preserve original host header through proxies/gateways for cookies, redirects, auth flows | | Message encoding | Schema evolution (Avro/Protobuf), backward/forward compatibility, schema registry | | Monitoring & diagnostics | Structured logging, distributed tracing (W3C Trace Context), metrics, alerts, dashboards | | Transient fault handling | Retry with exponential backoff + jitter, circuit breaker, idempotency keys, timeout budgets |

See Best Practices Reference for implementation details.


Performance Antipatterns

Avoid these common patterns that degrade performance under load:

| Antipattern | Problem | Fix | |-------------|---------|-----| | Busy Database | Offloading too much processing to the database | Move logic to application tier, use caching | | Busy Front End | Resource-intensive work on frontend request threads | Offload to background workers/queues | | Chatty I/O | Many small I/O requests instead of fewer large ones | Batch requests, use bulk APIs, buffer writes | | Extraneous Fetching | Retrieving more data than needed | Project only required fields, paginate, filter server-side | | Improper Instantiation | Recreating expensive objects per request | Use singletons, connection pooling, HttpClientFactory | | Monolithic Persistence | Single data store for all data types | Polyglot persistence — right store for each workload | | No Caching | Repeatedly fetching unchanged data | Cache-aside pattern, CDN, output caching, Redis | | Noisy Neighbor | One tenant consuming all shared resources | Bulkhead isolation, per-tenant quotas, throttling | | Retry Storm | Aggressive retries overwhelming a recovering service | Exponential backoff + jitter, circuit breaker, retry budgets | | Synchronous I/O | Blocking threads on I/O operations | Async/await, non-blocking I/O, reactive streams |


Mission-Critical Design

For workloads targeting 99.99%+ SLO, address these design areas:

| Design Area | Key Considerations | |-------------|-------------------| | Application platform | Multi-region active-active, availability zones, Container Apps or AKS with zone redundancy | | Application design | Stateless services, idempotent operations, graceful degradation, bulkhead isolation | | Networking | Azure Front Door (global LB), DDoS Protection, private endpoints, redundant connectivity | | Data platform | Multi-region Cosmos DB, zone-redundant SQL, async replication, conflict resolution | | Deployment & testing | Blue-green deployments, canary releases, chaos engineering, automated rollback | | Health modeling | Composite health scores, dependency health tracking, automated remediation, SLI dashboards | | Security | Zero-trust, managed identity everywhere, key rotation, WAF policies, threat modeling | | Operational procedures | Automated runbooks, incident response playbooks, game days, postmortems |

See Mission-Critical Reference for detailed guidance.


Well-Architected Framework (WAF) Pillars

Every architecture decision should be evaluated against all five pillars:

| Pillar | Focus | Key Questions | |--------|-------|---------------| | Reliability | Resiliency, availability, disaster recovery | What is the RTO/RPO? How does it handle failures? Is there redundancy? | | Security | Threat protection, identity, data protection | Is identity managed? Is data encrypted? Are there network controls? | | Cost Optimization | Cost management, efficiency, right-sizing | Is compute right-sized? Are there reserved instances? Is there waste? | | Operational Excellence | Monitoring, deployment, automation | Is deployment automated? Is there observability? Are there runbooks? | | Performance Efficiency | Scaling, load testing, performance targets | Can it scale horizontally? Are there performance baselines? Is caching used? |

WAF Tradeoff Matrix

| Optimizing for... | May impact... | |-------------------|---------------| | Reliability (redundancy) | Cost (more resources) | | Security (isolation) | Performance (added latency) | | Cost (consolidation) | Reliability (shared failure domains) | | Performance (caching) | Cost (cache infrastructure), Reliability (stale data) |


Architecture Review Workflow

When reviewing or designing a system, follow this structured approach:

Step 1: Identify Requirements

Functional: What must the system do?
Non-functional:
  - Availability target (e.g., 99.9%, 99.99%)
  - Latency requirements (p50, p95, p99)
  - Throughput (requests/sec, messages/sec)
  - Data residency and compliance
  - Recovery targets (RTO, RPO)
  - Cost constraints

Step 2: Select Architecture Style

Match requirements to architecture style using the selection criteria table above.

Step 3: Choose Technology Stack

Use the technology choices decision framework. Prefer managed services (PaaS) over IaaS.

Step 4: Apply Design Patterns

Select relevant patterns from the 44 cloud design patterns based on identified concerns.

Step 5: Address Cross-Cutting Concerns

  • Identity & access — Microsoft Entra ID, managed identity, RBAC
  • Monitoring — Application Insights, Azure Monitor, Log Analytics
  • Security — Network segmentation, encryption at rest/in transit, Key Vault
  • CI/CD — GitHub Actions, Azure DevOps Pipelines, infrastructure as code

Step 6: Validate Against WAF Pillars

Review each pillar systematically. Document tradeoffs explicitly.

Step 7: Document Decisions

Use Architecture Decision Records (ADRs):

# ADR-NNN: [Decision Title]

## Status: [Proposed | Accepted | Deprecated]

## Context
[What is the issue we're addressing?]

## Decision
[What did we decide and why?]

## Consequences
[What are the positive and negative impacts?]

References


Source

Content derived from the Azure Architecture Center — Microsoft's official guidance for cloud solution architecture on Azure. Covers design principles, architecture styles, cloud design patterns, technology choices, best practices, performance antipatterns, mission-critical design, and the Well-Architected Framework.