Distributed Storage Skill

Distributed Storage

Purpose

Design distributed storage systems for data workloads. Select the right storage backend (HDFS, S3, ADLS, GCS, MinIO), configure durability and replication, optimize file formats and compression, and implement lifecycle policies.

Agent Protocol

Trigger

Exact user phrases: "HDFS", "S3", "ADLS", "GCS", "MinIO", "distributed storage", "object storage", "file system", "data lake storage", "storage tier", "lifecycle policy", "replication factor", "erasure coding", "storage cost".

Input Context

Storage backend preference (S3, ADLS, GCS, HDFS, MinIO)
Data volume (current TB, annual growth)
Access patterns (frequent, infrequent, archive)
Durability and availability requirements
Budget and cost constraints
Compliance requirements (data residency, encryption)
Network bandwidth and latency constraints

Output Artifact

Distributed storage architecture with backend selection, tiering strategy, lifecycle policy, and cost model.

Response Format

# Storage backend
# Tier configuration
# Lifecycle rules
# Cost projection

Completion Criteria

[ ] Storage backend selected with rationale
[ ] Storage tiers defined (hot/warm/cold/archive)
[ ] Lifecycle policies configured for data movement
[ ] Data durability and replication strategy documented
[ ] Encryption (at rest, in transit) configured
[ ] Cost projection for current and 3-year growth
[ ] Backup and disaster recovery plan defined

Workflow

Step 1: Backend Selection

Object Store Comparison

| Feature | AWS S3 | ADLS Gen2 | GCS | MinIO | |---|---|---|---|---| | Consistency | Read-after-write | Strong | Strong | Strong | | Auth | IAM roles, bucket policies | RBAC, SAS tokens | IAM, service accounts | JWT, OIDC, LDAP | | Encryption | SSE-S3/KMS/CSE | SSE-AES/KMS/CMK | Google-managed/CMEK/CSE | KMS, auto-encryption | | Lifecycle | Transition, expiry, versioning | Tiering, soft-delete | Nearline/Coldline/Archive | Bucket lifecycle | | Max object size | 5 TB | 4.75 TB | 5 TB | 100 TB (configurable) | | S3 Compatible | Native | Yes (via gateway) | Yes (XML API) | Native | | Cost (per TB/mo) | ~$23 (standard) | ~$20 (hot) | ~$20 (standard) | Hardware + ops | | Durability | 99.999999999% (11 9s) | 99.999999999% (11 9s) | 99.999999999% (11 9s) | Configurable | | Availability SLA | 99.99% | 99.9% | 99.95% | Depends on setup |

Backend Selection Decision Tree

Cloud provider?
├── AWS → S3 (best integration with AWS ecosystem)
├── Azure → ADLS Gen2 (best with Azure AD, Active Directory)
├── GCP → GCS (strong consistency, best for Google ecosystem)
├── Multi-cloud → MinIO (S3-compatible layer across clouds)
└── On-premise / air-gapped → MinIO or HDFS

Primary workload?
├── Data lake with compute engines → Object store (S3/ADLS/GCS)
├── Legacy Hadoop/Spark without cloud → HDFS
├── Private cloud, edge, or air-gapped → MinIO
└── High-performance computing → Parallel file system (Lustre, GPFS)

Step 2: Storage Tiering

Tier Characteristics

| Tier | Storage Class | Latency | Cost/TB/mo | Use Case | |---|---|---|---|---| | Hot | S3 Standard, ADLS Hot | Millisecond | ~$23 | Active data, daily pipelines, ML training | | Warm | S3 Infrequent Access | Millisecond | ~$12.50 | Monthly queries, staging data | | Cold | S3 Glacier Instant | Millisecond (instant) | ~$4 | Quarterly access, compliance retention | | Archive | S3 Glacier Deep Archive | 12 hours | ~$1 | Annual access, legal holds |

Tiering Strategy

Hot: last 30-90 days of data. Warm: 90 days to 1 year. Cold: 1-3 years. Archive: 3-7+ years (retention-based). Intelligent tiering (S3) auto-moves objects between hot and warm based on access patterns.

Step 3: Lifecycle Policies

{
  "Rules": [
    {
      "Id": "TierAndExpire",
      "Status": "Enabled",
      "Filter": { "Prefix": "logs/" },
      "Transitions": [
        { "Days": 30, "StorageClass": "STANDARD_IA" },
        { "Days": 90, "StorageClass": "GLACIER" },
        { "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
      ],
      "Expiration": { "Days": 2555 }
    },
    {
      "Id": "ExpireTempFiles",
      "Status": "Enabled",
      "Filter": { "Prefix": "tmp/" },
      "Expiration": { "Days": 7 }
    }
  ]
}

Step 4: Durability and Replication

Object Store Durability

S3/ADLS/GCS provide 11 9s of durability via erasure coding across multiple devices/facilities. Replication is handled by the cloud provider. For additional protection: cross-region replication (CRR) for disaster recovery, same-region replication (SRR) for compliance.

HDFS Replication

Default replication factor: 3 (1 primary, 2 replicas on different racks). Storage overhead: 3x. Erasure coding reduces overhead: RS-6-3 (1.33x overhead), RS-10-4 (1.5x overhead). EC recommended for cold data on HDFS (> 100TB).

Backup and DR

Cross-region backup for critical data. Versioning enabled on all buckets. Replication Rules: replicate Tier 1 data to DR region synchronously or asynchronously. Test restore quarterly.

Step 5: Compression and Encoding

| Codec | Speed | Compression Ratio | Use Case | |---|---|---|---| | ZSTD (level 3) | Very fast | Good (2-4x) | Default for Parquet/ORC | | GZIP (level 6) | Moderate | Best (3-6x) | Archive, cold data | | Snappy | Fastest | Fair (1.5-3x) | Streaming, hot data | | LZ4 | Fastest | Fair (1.5-2x) | Logs, real-time | | BZIP2 | Slow | Best (4-8x) | Archive only |

Compress data at the file format level (Parquet/ORC) rather than at the transport level. Avoid double compression. Test: compress a representative dataset with each codec, measure speed and ratio.

Step 6: Data Layout

Partitioning by File Structure

Organize storage by: source/system, data domain, date partition, file. Example: s3://data-lake/raw/salesforce/orders/year=2026/month=05/day=01/load_id=abc123/orders_001.parquet. This layout enables: partition pruning, easy lifecycle management, clear ownership boundaries.

File Size Targets

Small files cause performance problems for query engines. Target file size: 128MB-1GB. Use compaction jobs to merge small files. Monitor average file size per dataset.

Step 7: HDFS Architecture

NameNode and DataNode

Active NameNode manages filesystem metadata (namespace, blocks, locations). Standby NameNode for HA via QJM (Quorum Journal Manager). DataNodes store block data and report to NameNode. Block size: 128MB or 256MB default. Rack awareness for replica placement.

HDFS Key Config

# hdfs-site.xml
dfs.replication: 3
dfs.block.size: 268435456  # 256MB
dfs.namenode.handler.count: 100
dfs.datanode.handler.count: 50
dfs.namenode.gc.time.threshold: 60s
dfs.namenode.checkpoint.period: 3600  # 1 hour
dfs.permissions.enabled: true

Step 8: Security

Encryption at Rest

Server-side encryption: SSE-S3 (AES-256, S3-managed keys), SSE-KMS (AWS KMS-managed keys), SSE-C (customer-provided keys). Client-side encryption: encrypt before upload, decrypt after download. For compliance: SSE-KMS with audit logging.

Encryption in Transit

TLS 1.2+ for all S3/ADLS/GCS API calls. Enforce HTTPS-only bucket policies. HDFS: enable SSL for RPC and data transfer.

{
  "Statement": [
    {
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": "arn:aws:s3:::data-lake/*",
      "Condition": {
        "Bool": { "aws:SecureTransport": "false" }
      }
    }
  ]
}

Step 9: Cost Optimization

Strategies

Use intelligent tiering for automatic cost savings on variable-access data. Request (S3) Reduced Redundancy for non-critical data (lower durability = lower cost). Reserved capacity for predictable usage. Monitor and alert on cost anomalies.

Cost Monitoring

Track: storage cost per bucket/container, data transfer costs (egress), request costs (PUT/GET/LIST), lifecycle transition costs. Allocate costs to teams/domains via tag-based cost allocation.

Decision Trees

Storage Backend

Cloud adoption strategy?
├── Single cloud → Native object store (S3/ADLS/GCS)
├── Multi-cloud → MinIO (abstraction layer)
├── On-premise
│   ├── Hadoop ecosystem → HDFS
│   └── Kubernetes-native → MinIO (S3-compatible)
└── Edge / IoT → MinIO (lightweight, S3 API)

Lifecycle Policy

Data access pattern?
├── Accessed frequently → Hot tier (delete when no longer needed)
├── Accessed occasionally → Intelligent tiering or manual tiering
├── Compliance retention → Write once, tier to cold → delete after retention
└── Temporary data → Hot tier → delete in 7-30 days

MinIO Deployment Architecture

Multi-Node Setup

MinIO runs as a distributed system across multiple nodes. Minimum 4 nodes for erasure coding protection. Each node: 4+ drives, SSD/NVMe preferred for performance. Deployed via: Docker Compose (dev/tiny), Kubernetes Operator (production), bare metal (HPC).

# MinIO Kubernetes Operator
apiVersion: minio.min.io/v2
kind: Tenant
metadata:
  name: data-lake-storage
spec:
  image: ${MINIO_IMAGE_REF:?Set MINIO_IMAGE_REF to quay.io/minio/minio@sha256:<reviewed-digest>}
  credsSecret:
    name: minio-creds
  pools:
    - servers: 4
      volumesPerServer: 4
      volumeClaimTemplate:
        spec:
          storageClassName: ssd
          accessModes: [ReadWriteOnce]
          resources:
            requests:
              storage: 4Ti
  mountPath: /export
  requestAutoCert: true
  buckets:
    - name: raw
    - name: staging
    - name: analytics

Erasure Coding

MinIO uses Reed-Solomon erasure coding. Default parity: N/2 drives (max protection). Configurable: standard (EC:4) for 8-drive setup, reduced (EC:2) for performance. Read tolerance: lose up to N/2 drives. Write tolerance: lose up to N/2 drives. Storage overhead: 2x for N/2 parity, 1.33x for N/4 parity.

# Erasure coding parity by deployment size
# 4 nodes × 4 drives = 16 drives
#   EC:8 → tolerate 8 drive failures, 2x overhead
#   EC:4 → tolerate 4 drive failures, 1.33x overhead
#   EC:2 → tolerate 2 drive failures, 1.14x overhead

# Set via environment variable
MINIO_STORAGE_CLASS_STANDARD: "EC:4"
MINIO_STORAGE_CLASS_RRS: "EC:2"     # Reduced redundancy for temp data

Performance Tuning

# Drive-level
# Use XFS filesystem with noatime,nodiratime mount options
# Separate write-intensive (metadata, WAL) from read-intensive drives

# Network-level
# MinIO uses S3-compatible HTTP/REST
# Enable HTTP/2 for multiplexing
# Use load balancer (nginx, HAProxy) for multi-node distribution

# Kernel tuning
# net.core.rmem_max: 134217728    # 128MB receive buffer
# net.core.wmem_max: 134217728    # 128MB send buffer
# net.ipv4.tcp_congestion_control: bbr  # BBR for better throughput
# vm.dirty_ratio: 30              # Page cache dirty ratio
# vm.dirty_background_ratio: 10

Step 10: Data Transfer and Migration

Transfer Acceleration

# AWS S3 Transfer Acceleration
# Uses CloudFront edge locations, TCP optimization
# Enable: s3.put_bucket_accelerate_configuration
# Best for: cross-region uploads, large objects > 1GB
# Cost: premium per GB transferred
# Test: s3-accelerate-speedtest.s3-accelerate.amazonaws.com

# Azure Data Box / Import-Export
# Physical transfer: Data Box (80TB), Data Box Disk (8TB)
# For: > 10TB initial loads, slow networks, air-gapped
# Timeline: order → receive → load → return → ingest (2-4 weeks)

# GCS Storage Transfer Service
# Online: from S3, HTTP, or GS URLs
# Schedule: one-time or recurring
# Best for: ongoing sync from S3 to GCS

Migration Strategy

migration_phases:
  phase_1_assessment:
    - inventory current storage (total TB, object count, avg size)
    - identify access patterns (hot vs cold, frequency)
    - document retention and compliance requirements
    - estimate egress costs and transfer time

  phase_2_pilot:
    - migrate 1-2 TB of non-critical data first
    - validate performance, cost, and access patterns
    - document issues and adjust strategy

  phase_3_bulk:
    - parallel multi-threaded copy (aws s3 sync, rclone, distcp)
    - validate data integrity checksums post-transfer
    - monitor for failures (rate limits, timeouts)

  phase_4_cutover:
    - final sync of delta changes
    - update application configs to point to new storage
    - redirect DNS / endpoint to new location
    - decommission old storage after 30-day validation window

Step 11: Compliance Features

Object Lock / WORM

# S3 Object Lock (compliance mode)
# Prevents object deletion or overwrite for retention period
# Governance mode: users with special permissions can override
# Compliance mode: no one (including root) can override
# Legal hold: prevents deletion regardless of retention

object_lock_examples:
  - name: "Financial records (SOX)"
    mode: COMPLIANCE
    retention_days: 2555  # 7 years
    bucket: "data-lake/finance/"

  - name: "Logs (internal policy)"
    mode: GOVERNANCE
    retention_days: 365  # 1 year
    bucket: "data-lake/logs/"

# ADLS Gen2: immutability policy
# Set at container level
# Time-based retention or legal hold

Data Residency

# Data residency requirements
# GDPR: data must stay in EU region
# Financial services: data within country borders
# Healthcare: HIPAA requires US-only storage (for US patients)

# Implementation:
# - Per-region buckets with IAM restrictions
# - S3 bucket policies denying cross-region replication for sensitive data
# - MinIO multi-tenant deployment per region
# - Regular audit to verify no data leaves allowed regions

Step 12: Cost Modeling

Cost Components

cost_components:
  storage:
    - per-GB-month by tier (hot, warm, cold, archive)
    - minimum storage duration charges (Glazer: 90 days min)
    - early deletion fees (cold/archive tiers)

  operations:
    - PUT/COPY/POST/LIST requests (per 1000)
    - GET/SELECT requests (per 1000)
    - lifecycle transition requests (per 1000)

  data_transfer:
    - upload (usually free)
    - download / egress (per GB, tiered pricing)
    - cross-region replication (per GB)
    - transfer acceleration (premium per GB)

  additional:
    - encryption (KMS: per key, per API call)
    - monitoring (CloudWatch, metrics per custom metric)
    - backup / replication (CRR storage in destination region)

Estimation Tool

# Monthly storage cost estimate
cost_estimate:
  hot_data:
    volume: 50 TB
    unit_cost: $23/TB/mo
    subtotal: $1,150/mo

  warm_data:
    volume: 150 TB
    unit_cost: $12.50/TB/mo
    subtotal: $1,875/mo

  cold_data:
    volume: 300 TB
    unit_cost: $4/TB/mo
    subtotal: $1,200/mo

  archive_data:
    volume: 500 TB
    unit_cost: $1/TB/mo
    subtotal: $500/mo

  operations:
    requests: $200/mo  # Based on 10M PUT + 100M GET

  data_transfer:
    egress: $300/mo    # Based on 10TB egress

  monitoring_backup:
    cross_region_replication: $500/mo  # 50TB replicated
    monitoring: $100/mo

  total_monthly: $5,825/mo
  total_annual: $69,900/yr

Step 13: Monitoring and Observability

Storage Metrics

# Object store monitoring (CloudWatch, Azure Monitor, GCS Ops)
metrics:
  storage:
    - BucketSizeBytes (by storage tier)
    - NumberOfObjects
    - AverageObjectSize

  requests:
    - AllRequests (count)
    - GetRequests / PutRequests / ListRequests
    - 4xxErrors / 5xxErrors
    - FirstByteLatency / TotalRequestLatency

  throughput:
    - BytesDownloaded / BytesUploaded
    - GetBandwidth / PutBandwidth (MB/s)

  cost:
    - StorageCost (by bucket/tag)
    - RequestCost
    - DataTransferCost
    - TotalCost

Alerting Thresholds

alerts:
  - name: "Storage growth anomaly"
    metric: BucketSizeBytes
    threshold: "> 20% week-over-week increase"
    severity: warning

  - name: "Error rate spike"
    metric: 5xxErrors
    threshold: "> 1% of total requests"
    severity: critical

  - name: "Latency degradation"
    metric: FirstByteLatency
    threshold: "p99 > 500ms"
    severity: warning

  - name: "Budget threshold"
    metric: TotalCost
    threshold: "> 80% of monthly budget"
    severity: warning

  - name: "Lifecycle failure"
    metric: LifecycleTransitions
    threshold: "transitions failed > 10 per hour"
    severity: warning

Decision Trees (continued)

Compression Codec Selection

Workload type?
├── Hot data, frequent queries → ZSTD (level 3) — fast decompression
├── Cold data, archive → GZIP (level 6) — best compression ratio
├── Streaming, low-latency → Snappy or LZ4 — minimal CPU overhead
├── Columnar formats (Parquet/ORC)
│   ├── Default → ZSTD (best balance)
│   └── Legacy compatibility → Snappy
└── JSON/CSV files
    ├── Compressed → GZIP (most tools support it)
    └── Splittable → BZIP2 or LZ4

Replication Strategy

Compliance requirement?
├── Disaster recovery (region outage)
│   ├── RTO < 1 hour → Synchronous CRR
│   └── RTO < 4 hours → Asynchronous CRR
├── Data sovereignty (must stay in region)
│   └── Same-region replication (SRR) + backup to another AZ
├── GDPR right to erasure
│   ├── Replicate selectively (no unnecessary copies)
│   └── Document replication topology for audit
└── No compliance requirement
    └── No replication — rely on cloud provider durability

Rules

Separate storage from compute for elasticity
Use lifecycle policies to automatically tier data
Compress all data at rest — storage is not free
Encrypt data at rest and in transit
Monitor storage growth and set budget alerts
Document retention policies for compliance requirements
Prefer object storage over HDFS for new deployments
Use versioning for data protection against accidental deletion
Target 128MB-1GB file sizes for query engine performance
Monitor and alert on storage cost anomalies by team/project
Test disaster recovery procedures annually
Allocate storage costs to teams for cost accountability
Run MinIO with at least 4 nodes and N/2 erasure coding parity for production
Benchmark storage performance before committing to architecture
Use Object Lock (WORM) for compliance-bound data
Plan data migration with pilot phase before bulk transfer
Model total cost of ownership including operations, transfer, and egress
Alert on storage growth anomalies and budget thresholds

Agent Skills: Distributed Storage

Install this agent skill to your local

Skill Files