Senior Cloud Architect
Expert-level cloud architecture and infrastructure design.
Core Competencies
- Multi-cloud architecture
- AWS, GCP, Azure platforms
- Cloud-native design patterns
- Cost optimization
- Security and compliance
- Migration strategies
- Disaster recovery
- Infrastructure automation
Cloud Platform Comparison
| Service | AWS | GCP | Azure | |---------|-----|-----|-------| | Compute | EC2, ECS, EKS | GCE, GKE | VMs, AKS | | Serverless | Lambda | Cloud Functions | Azure Functions | | Storage | S3 | Cloud Storage | Blob Storage | | Database | RDS, DynamoDB | Cloud SQL, Spanner | SQL DB, CosmosDB | | ML | SageMaker | Vertex AI | Azure ML | | CDN | CloudFront | Cloud CDN | Azure CDN |
AWS Architecture
Well-Architected Framework
Pillars:
-
Operational Excellence
- Infrastructure as Code
- Monitoring and observability
- Incident response
- Continuous improvement
-
Security
- Identity and access management
- Data protection
- Infrastructure protection
- Incident response
-
Reliability
- Fault tolerance
- Disaster recovery
- Change management
- Failure testing
-
Performance Efficiency
- Right-sizing resources
- Monitoring performance
- Trade-off decisions
- Keeping current
-
Cost Optimization
- Cost awareness
- Right-sizing
- Reserved capacity
- Efficient resources
-
Sustainability
- Region selection
- Efficient algorithms
- Hardware utilization
- Data management
Reference Architecture
┌─────────────────────────────────────────────────────────────┐
│ Route 53 (DNS) │
└─────────────────────────────┬───────────────────────────────┘
│
┌─────────────────────────────▼───────────────────────────────┐
│ CloudFront (CDN) │
│ WAF (Web Application Firewall) │
└─────────────────────────────┬───────────────────────────────┘
│
┌─────────────────────────────▼───────────────────────────────┐
│ Application Load Balancer │
└──────────┬───────────────────────────────────┬──────────────┘
│ │
┌──────────▼──────────┐ ┌──────────▼──────────┐
│ ECS/EKS Cluster │ │ ECS/EKS Cluster │
│ (AZ-a) │ │ (AZ-b) │
└──────────┬──────────┘ └──────────┬──────────┘
│ │
┌──────────▼───────────────────────────────────▼──────────┐
│ ElastiCache (Redis) │
└─────────────────────────────┬───────────────────────────┘
│
┌─────────────────────────────▼───────────────────────────┐
│ RDS Multi-AZ │
│ (Primary + Standby) │
└─────────────────────────────────────────────────────────┘
Terraform AWS Module
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "${var.project}-${var.environment}"
cidr = var.vpc_cidr
azs = ["${var.region}a", "${var.region}b", "${var.region}c"]
private_subnets = var.private_subnets
public_subnets = var.public_subnets
enable_nat_gateway = true
single_nat_gateway = var.environment != "production"
enable_dns_hostnames = true
enable_dns_support = true
tags = local.common_tags
}
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"
cluster_name = "${var.project}-${var.environment}"
cluster_version = "1.28"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
cluster_endpoint_public_access = true
cluster_endpoint_private_access = true
eks_managed_node_groups = {
main = {
instance_types = var.node_instance_types
min_size = var.node_min_size
max_size = var.node_max_size
desired_size = var.node_desired_size
}
}
tags = local.common_tags
}
module "rds" {
source = "terraform-aws-modules/rds/aws"
version = "~> 6.0"
identifier = "${var.project}-${var.environment}"
engine = "postgres"
engine_version = "15"
family = "postgres15"
major_engine_version = "15"
instance_class = var.db_instance_class
allocated_storage = var.db_allocated_storage
max_allocated_storage = var.db_max_allocated_storage
db_name = var.db_name
username = var.db_username
port = 5432
multi_az = var.environment == "production"
db_subnet_group_name = module.vpc.database_subnet_group
vpc_security_group_ids = [module.security_group.security_group_id]
backup_retention_period = var.environment == "production" ? 30 : 7
skip_final_snapshot = var.environment != "production"
tags = local.common_tags
}
Cost Optimization
Reserved vs On-Demand vs Spot
| Type | Discount | Commitment | Use Case | |------|----------|------------|----------| | On-Demand | 0% | None | Variable workloads | | Reserved | 30-72% | 1-3 years | Steady-state | | Savings Plans | 30-72% | 1-3 years | Flexible compute | | Spot | 60-90% | None | Fault-tolerant |
Cost Optimization Strategies
Right-sizing:
def analyze_utilization(instance_id: str, days: int = 14):
"""Analyze CPU/memory utilization for right-sizing recommendations."""
cloudwatch = boto3.client('cloudwatch')
metrics = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=datetime.now() - timedelta(days=days),
EndTime=datetime.now(),
Period=3600,
Statistics=['Average', 'Maximum']
)
avg_cpu = sum(p['Average'] for p in metrics['Datapoints']) / len(metrics['Datapoints'])
max_cpu = max(p['Maximum'] for p in metrics['Datapoints'])
if avg_cpu < 10 and max_cpu < 30:
return 'downsize'
elif avg_cpu > 80:
return 'upsize'
else:
return 'optimal'
Cost Allocation Tags:
required_tags:
- Environment: production|staging|development
- Project: project-name
- Owner: team-name
- CostCenter: cost-center-id
automation:
- Untagged resources alert after 24 hours
- Auto-terminate development resources after 7 days
- Weekly cost reports by tag
Cost Dashboard
┌─────────────────────────────────────────────────────────────┐
│ Monthly Cost Summary │
├─────────────────────────────────────────────────────────────┤
│ Total: $45,231 vs Last Month: +5% │
│ │
│ By Service: By Environment: │
│ ├── EC2: $18,500 (41%) ├── Production: $38,000 │
│ ├── RDS: $12,000 (27%) ├── Staging: $4,500 │
│ ├── S3: $3,200 (7%) └── Development: $2,731 │
│ ├── Lambda: $1,800 (4%) │
│ └── Other: $9,731 (21%) Savings Opportunity: $8,200 │
│ │
│ Recommendations: │
│ • Convert 12 instances to Reserved (save $4,200/mo) │
│ • Delete 5 unused EBS volumes (save $180/mo) │
│ • Resize 8 over-provisioned instances (save $1,800/mo) │
└─────────────────────────────────────────────────────────────┘
Disaster Recovery
DR Strategies
| Strategy | RTO | RPO | Cost | |----------|-----|-----|------| | Backup & Restore | Hours | Hours | $ | | Pilot Light | Minutes | Minutes | $$ | | Warm Standby | Minutes | Seconds | $$$ | | Multi-Site Active | Seconds | Near-zero | $$$$ |
Multi-Region Architecture
┌────────────────────────────────────────────────────────────┐
│ Global Load Balancer │
│ (Route 53 / Cloud DNS) │
└──────────────┬─────────────────────────────┬───────────────┘
│ │
┌──────────────▼──────────────┐ ┌────────────▼──────────────┐
│ Primary Region │ │ Secondary Region │
│ (us-east-1) │ │ (us-west-2) │
│ │ │ │
│ ┌──────────────────────┐ │ │ ┌──────────────────────┐ │
│ │ Application Layer │ │ │ │ Application Layer │ │
│ │ (Active) │ │ │ │ (Standby/Active) │ │
│ └──────────┬───────────┘ │ │ └──────────┬───────────┘ │
│ │ │ │ │ │
│ ┌──────────▼───────────┐ │ │ ┌──────────▼───────────┐ │
│ │ Database │──┼─┼──│ Database │ │
│ │ (Primary) │ │ │ │ (Read Replica) │ │
│ └──────────────────────┘ │ │ └──────────────────────┘ │
└────────────────────────────┘ └────────────────────────────┘
│
│ Cross-Region Replication
▼
┌──────────────────────┐
│ S3 Backup │
│ (Multi-Region) │
└──────────────────────┘
Backup Strategy
backup_policy:
database:
frequency: continuous
retention: 35 days
cross_region: true
encryption: aws/rds
application_data:
frequency: daily
retention: 90 days
versioning: enabled
lifecycle:
- transition_to_ia: 30 days
- transition_to_glacier: 90 days
- expiration: 365 days
configuration:
frequency: on_change
retention: unlimited
storage: git + s3
Security Architecture
Network Security
┌─────────────────────────────────────────────────────────────┐
│ VPC │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Public Subnet │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │ │
│ │ │ NAT GW │ │ ALB │ │ Bastion │ │ │
│ │ └─────────────┘ └─────────────┘ └───────────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────▼───────────────────────────┐ │
│ │ Private Subnet │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │ │
│ │ │ App Tier │ │ App Tier │ │ App Tier │ │ │
│ │ └─────────────┘ └─────────────┘ └───────────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────▼───────────────────────────┐ │
│ │ Data Subnet │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │ │
│ │ │ RDS │ │ Redis │ │ Elasticsearch│ │ │
│ │ └─────────────┘ └─────────────┘ └───────────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
IAM Best Practices
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "LeastPrivilegeExample",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::my-bucket/uploads/*",
"Condition": {
"StringEquals": {
"aws:PrincipalTag/Team": "engineering"
},
"IpAddress": {
"aws:SourceIp": ["10.0.0.0/8"]
}
}
}
]
}
Reference Materials
references/aws_patterns.md- AWS architecture patternsreferences/gcp_patterns.md- GCP architecture patternsreferences/multi_cloud.md- Multi-cloud strategiesreferences/cost_optimization.md- Cost optimization guide
Scripts
# Infrastructure cost analyzer
python scripts/cost_analyzer.py --account production --period monthly
# DR validation
python scripts/dr_test.py --region us-west-2 --type failover
# Security audit
python scripts/security_audit.py --framework cis --output report.html
# Resource inventory
python scripts/inventory.py --accounts all --format csv