Cloud Solution Architect
Overview
Design well-architected, production-grade cloud systems following Azure Architecture Center best practices. This skill provides:
- 10 design principles for Azure applications
- 6 architecture styles with selection guidance
- 44 cloud design patterns mapped to WAF pillars
- Technology choice frameworks for compute, storage, data, messaging
- Performance antipatterns to avoid
- Architecture review workflow for systematic design validation
Ten Design Principles for Azure Applications
| # | Principle | Key Tactics | |---|-----------|-------------| | 1 | Design for self-healing | Retry with backoff, circuit breaker, bulkhead isolation, health endpoint monitoring, graceful degradation | | 2 | Make all things redundant | Eliminate single points of failure, use availability zones, deploy multi-region, replicate data | | 3 | Minimize coordination | Decouple services, use async messaging, embrace eventual consistency, use domain events | | 4 | Design to scale out | Horizontal scaling, autoscaling rules, stateless services, avoid session stickiness, partition workloads | | 5 | Partition around limits | Data partitioning (shard/hash/range), respect compute & network limits, use CDNs for static content | | 6 | Design for operations | Structured logging, distributed tracing, metrics & dashboards, runbook automation, infrastructure as code | | 7 | Use managed services | Prefer PaaS over IaaS, reduce operational burden, leverage built-in HA/DR/scaling | | 8 | Use an identity service | Microsoft Entra ID, managed identity, RBAC, avoid storing credentials, zero-trust principles | | 9 | Design for evolution | Loose coupling, versioned APIs, backward compatibility, async messaging for integration, feature flags | | 10 | Build for business needs | Define SLAs/SLOs, establish RTO/RPO targets, domain-driven design, cost modeling, composite SLAs |
Architecture Styles
| Style | Description | When to Use | Key Services | |-------|-------------|-------------|--------------| | N-tier | Horizontal layers (presentation, business, data) | Traditional enterprise apps, lift-and-shift | App Service, SQL Database, VNets | | Web-Queue-Worker | Web frontend → message queue → backend worker | Moderate-complexity apps with long-running tasks | App Service, Service Bus, Functions | | Microservices | Small autonomous services, bounded contexts, independent deploy | Complex domains, independent team scaling | AKS, Container Apps, API Management | | Event-driven | Pub/sub model, event producers/consumers | Real-time processing, IoT, reactive systems | Event Hubs, Event Grid, Functions | | Big data | Batch + stream processing pipeline | Analytics, ML pipelines, large-scale data | Synapse, Data Factory, Databricks | | Big compute | HPC, parallel processing | Simulations, modeling, rendering, genomics | Batch, CycleCloud, HPC VMs |
Selection Criteria
- Domain complexity → Microservices (high), N-tier (low-medium)
- Team autonomy → Microservices (independent teams), N-tier (single team)
- Data volume → Big data (TB+), others (GB)
- Latency requirements → Event-driven (real-time), Web-Queue-Worker (tolerant)
Cloud Design Patterns
44 patterns organized by primary concern. WAF pillar mapping: R=Reliability, S=Security, CO=Cost Optimization, OE=Operational Excellence, PE=Performance Efficiency.
Messaging & Communication
| Pattern | Summary | Pillars | |---------|---------|---------| | Asynchronous Request-Reply | Decouple request/response with polling or callbacks | R, PE | | Claim Check | Split large messages; store payload separately, pass reference | R, PE | | Choreography | Services coordinate via events without central orchestrator | R, OE | | Competing Consumers | Multiple consumers process messages from shared queue concurrently | R, PE | | Messaging Bridge | Connect incompatible messaging systems | R, OE | | Pipes and Filters | Decompose complex processing into reusable filter stages | R, OE | | Priority Queue | Prioritize requests so higher-priority work is processed first | R, PE | | Publisher/Subscriber | Decouple senders from receivers via topics/subscriptions | R, PE | | Queue-Based Load Leveling | Buffer requests with a queue to smooth intermittent loads | R, PE | | Sequential Convoy | Process related messages in order while allowing parallel groups | R, PE |
Reliability & Resilience
| Pattern | Summary | Pillars | |---------|---------|---------| | Bulkhead | Isolate resources per workload to prevent cascading failure | R | | Circuit Breaker | Stop calling a failing service; fail fast to protect resources | R | | Compensating Transaction | Undo previously committed steps when a later step fails | R | | Health Endpoint Monitoring | Expose health checks for load balancers and orchestrators | R, OE | | Leader Election | Coordinate distributed instances by electing a leader | R | | Retry | Handle transient faults by retrying with exponential backoff | R | | Saga | Manage data consistency across microservices with compensating transactions | R | | Scheduler Agent Supervisor | Coordinate distributed actions with retry and failure handling | R |
Data Management
| Pattern | Summary | Pillars | |---------|---------|---------| | Cache-Aside | Load data on demand into cache from data store | PE | | CQRS | Separate read and write models for independent scaling | PE, R | | Event Sourcing | Store state as append-only sequence of domain events | R, OE | | Index Table | Create indexes over frequently queried fields in data stores | PE | | Materialized View | Pre-compute views over data for efficient queries | PE | | Sharding | Distribute data across partitions for scale and performance | PE, R | | Static Content Hosting | Serve static content from cloud storage/CDN directly | PE, CO | | Valet Key | Grant clients limited direct access to storage resources | S, PE |
Design & Structure
| Pattern | Summary | Pillars | |---------|---------|---------| | Ambassador | Offload cross-cutting concerns to a helper sidecar proxy | OE | | Anti-Corruption Layer | Translate between new and legacy system models | OE, R | | Backends for Frontends | Create separate backends per frontend type (mobile, web, etc.) | OE, PE | | Compute Resource Consolidation | Combine multiple workloads into fewer compute instances | CO | | External Configuration Store | Externalize configuration from deployment packages | OE | | Sidecar | Deploy helper components alongside the main service | OE | | Strangler Fig | Incrementally migrate legacy systems by replacing pieces | OE, R |
Security & Access
| Pattern | Summary | Pillars | |---------|---------|---------| | Federated Identity | Delegate authentication to an external identity provider | S | | Gatekeeper | Protect services using a dedicated broker that validates requests | S | | Quarantine | Isolate and validate external assets before allowing use | S | | Rate Limiting | Control consumption rate of resources by consumers | R, S | | Throttling | Control resource consumption to sustain SLAs under load | R, PE |
Deployment & Scaling
| Pattern | Summary | Pillars | |---------|---------|---------| | Deployment Stamps | Deploy multiple independent copies of application components | R, PE | | Edge Workload Configuration | Configure workloads differently across diverse edge devices | OE | | Gateway Aggregation | Aggregate multiple backend calls into a single client request | PE | | Gateway Offloading | Offload shared functionality (SSL, auth) to a gateway | OE, S | | Gateway Routing | Route requests to multiple backends using a single endpoint | OE | | Geode | Deploy backends to multiple regions for active-active serving | R, PE |
See Design Patterns Reference for detailed implementation guidance.
Technology Choices
Decision Framework
For each technology area, evaluate: requirements → constraints → tradeoffs → select.
| Area | Key Options | Selection Criteria | |------|-------------|-------------------| | Compute | App Service, Functions, Container Apps, AKS, VMs, Batch | Hosting model, scaling, cost, team skills | | Storage | Blob Storage, Data Lake, Files, Disks, Managed Lustre | Access patterns, throughput, cost tier | | Data stores | SQL Database, Cosmos DB, PostgreSQL, Redis, Table Storage | Consistency model, query patterns, scale | | Messaging | Service Bus, Event Hubs, Event Grid, Queue Storage | Ordering, throughput, pub/sub vs queue | | Networking | Front Door, Application Gateway, Load Balancer, Traffic Manager | Global vs regional, L4 vs L7, WAF | | AI services | Azure OpenAI, AI Search, AI Foundry, Document Intelligence | Model needs, data grounding, orchestration | | Containers | Container Apps, AKS, Container Instances | Operational control vs simplicity |
See Technology Choices Reference for detailed decision trees.
Best Practices
| Practice | Key Guidance | |----------|-------------| | API design | RESTful conventions, resource-oriented URIs, HATEOAS, versioning via URL path or header | | API implementation | Async operations, pagination, idempotent PUT/DELETE, content negotiation, ETag caching | | Autoscaling | Scale on metrics (CPU, queue depth, custom), cool-down periods, predictive scaling, scale-in protection | | Background jobs | Use queues or scheduled triggers, idempotent processing, poison message handling, graceful shutdown | | Caching | Cache-aside pattern, TTL policies, cache invalidation strategies, distributed cache for multi-instance | | CDN | Static asset offloading, cache-busting with versioned URLs, geo-distribution, HTTPS enforcement | | Data partitioning | Horizontal (sharding), vertical, functional partitioning; partition key selection for even distribution | | Partitioning strategies | Hash-based, range-based, directory-based; rebalancing approach, cross-partition query avoidance | | Host name preservation | Preserve original host header through proxies/gateways for cookies, redirects, auth flows | | Message encoding | Schema evolution (Avro/Protobuf), backward/forward compatibility, schema registry | | Monitoring & diagnostics | Structured logging, distributed tracing (W3C Trace Context), metrics, alerts, dashboards | | Transient fault handling | Retry with exponential backoff + jitter, circuit breaker, idempotency keys, timeout budgets |
See Best Practices Reference for implementation details.
Performance Antipatterns
Avoid these common patterns that degrade performance under load:
| Antipattern | Problem | Fix | |-------------|---------|-----| | Busy Database | Offloading too much processing to the database | Move logic to application tier, use caching | | Busy Front End | Resource-intensive work on frontend request threads | Offload to background workers/queues | | Chatty I/O | Many small I/O requests instead of fewer large ones | Batch requests, use bulk APIs, buffer writes | | Extraneous Fetching | Retrieving more data than needed | Project only required fields, paginate, filter server-side | | Improper Instantiation | Recreating expensive objects per request | Use singletons, connection pooling, HttpClientFactory | | Monolithic Persistence | Single data store for all data types | Polyglot persistence — right store for each workload | | No Caching | Repeatedly fetching unchanged data | Cache-aside pattern, CDN, output caching, Redis | | Noisy Neighbor | One tenant consuming all shared resources | Bulkhead isolation, per-tenant quotas, throttling | | Retry Storm | Aggressive retries overwhelming a recovering service | Exponential backoff + jitter, circuit breaker, retry budgets | | Synchronous I/O | Blocking threads on I/O operations | Async/await, non-blocking I/O, reactive streams |
Mission-Critical Design
For workloads targeting 99.99%+ SLO, address these design areas:
| Design Area | Key Considerations | |-------------|-------------------| | Application platform | Multi-region active-active, availability zones, Container Apps or AKS with zone redundancy | | Application design | Stateless services, idempotent operations, graceful degradation, bulkhead isolation | | Networking | Azure Front Door (global LB), DDoS Protection, private endpoints, redundant connectivity | | Data platform | Multi-region Cosmos DB, zone-redundant SQL, async replication, conflict resolution | | Deployment & testing | Blue-green deployments, canary releases, chaos engineering, automated rollback | | Health modeling | Composite health scores, dependency health tracking, automated remediation, SLI dashboards | | Security | Zero-trust, managed identity everywhere, key rotation, WAF policies, threat modeling | | Operational procedures | Automated runbooks, incident response playbooks, game days, postmortems |
See Mission-Critical Reference for detailed guidance.
Well-Architected Framework (WAF) Pillars
Every architecture decision should be evaluated against all five pillars:
| Pillar | Focus | Key Questions | |--------|-------|---------------| | Reliability | Resiliency, availability, disaster recovery | What is the RTO/RPO? How does it handle failures? Is there redundancy? | | Security | Threat protection, identity, data protection | Is identity managed? Is data encrypted? Are there network controls? | | Cost Optimization | Cost management, efficiency, right-sizing | Is compute right-sized? Are there reserved instances? Is there waste? | | Operational Excellence | Monitoring, deployment, automation | Is deployment automated? Is there observability? Are there runbooks? | | Performance Efficiency | Scaling, load testing, performance targets | Can it scale horizontally? Are there performance baselines? Is caching used? |
WAF Tradeoff Matrix
| Optimizing for... | May impact... | |-------------------|---------------| | Reliability (redundancy) | Cost (more resources) | | Security (isolation) | Performance (added latency) | | Cost (consolidation) | Reliability (shared failure domains) | | Performance (caching) | Cost (cache infrastructure), Reliability (stale data) |
Architecture Review Workflow
When reviewing or designing a system, follow this structured approach:
Step 1: Identify Requirements
Functional: What must the system do?
Non-functional:
- Availability target (e.g., 99.9%, 99.99%)
- Latency requirements (p50, p95, p99)
- Throughput (requests/sec, messages/sec)
- Data residency and compliance
- Recovery targets (RTO, RPO)
- Cost constraints
Step 2: Select Architecture Style
Match requirements to architecture style using the selection criteria table above.
Step 3: Choose Technology Stack
Use the technology choices decision framework. Prefer managed services (PaaS) over IaaS.
Step 4: Apply Design Patterns
Select relevant patterns from the 44 cloud design patterns based on identified concerns.
Step 5: Address Cross-Cutting Concerns
- Identity & access — Microsoft Entra ID, managed identity, RBAC
- Monitoring — Application Insights, Azure Monitor, Log Analytics
- Security — Network segmentation, encryption at rest/in transit, Key Vault
- CI/CD — GitHub Actions, Azure DevOps Pipelines, infrastructure as code
Step 6: Validate Against WAF Pillars
Review each pillar systematically. Document tradeoffs explicitly.
Step 7: Document Decisions
Use Architecture Decision Records (ADRs):
# ADR-NNN: [Decision Title]
## Status: [Proposed | Accepted | Deprecated]
## Context
[What is the issue we're addressing?]
## Decision
[What did we decide and why?]
## Consequences
[What are the positive and negative impacts?]
References
- Design Patterns Reference — Detailed pattern implementations
- Technology Choices Reference — Decision trees for Azure services
- Best Practices Reference — Implementation guidance
- Mission-Critical Reference — High-availability design
Source
Content derived from the Azure Architecture Center — Microsoft's official guidance for cloud solution architecture on Azure. Covers design principles, architecture styles, cloud design patterns, technology choices, best practices, performance antipatterns, mission-critical design, and the Well-Architected Framework.