Cluster Stacks

Standardize infrastructure foundations across Kubernetes clusters. Capture infrastructure components as versioned stacks and replicate them to new clusters for consistent platform experiences.

What is a Cluster Stack?

A cluster stack is a versioned collection of infrastructure components that provide the foundation for application workloads. Think of it as an infrastructure template that includes ingress controllers, certificate managers, monitoring agents, logging, security tools, and other cluster-wide services.

What gets included:

Ingress controllers (NGINX, Traefik, AWS ALB)
Certificate managers (cert-manager for Let's Encrypt)
Monitoring agents (Prometheus, Datadog, New Relic)
Logging infrastructure (Fluentd, Loki)
Security tools (Falco, OPA/Gatekeeper)
Service meshes (Istio, Linkerd)
Custom operators and CRDs

Cluster stacks enable:

Standardized infrastructure across teams
Faster cluster provisioning (minutes instead of hours)
Infrastructure compliance and governance
Version-controlled platform evolution

Business Value

Standardization:

Every cluster starts with the same foundation
Eliminate "works in one cluster but not another" problems
Enforce organizational security and monitoring baselines

Speed:

New clusters production-ready in 10 minutes (not hours or days)
No manual installation of infrastructure components
Replicate battle-tested configurations instantly

Governance:

Centralized control over platform components
Version infrastructure independently from applications
Audit trail for infrastructure changes

Efficiency:

Platform teams maintain golden configurations
Application teams inherit standardized infrastructure
Reduce infrastructure drift across clusters

How Cluster Stacks Work

Golden Cluster                    New Cluster
     ↓                                ↓
[Infrastructure Components]     [Empty Cluster]
     ↓                                ↓
cod cluster stack capture       cod cluster stack deploy
     ↓                                ↓
Cluster Stack (v1.0)            Replicated Infrastructure
     ↓                                ↓
[Versioned in Codiac]           [Same components installed]

Workflow:

Configure infrastructure in a "golden" cluster
Capture infrastructure as cluster stack
Version and tag the stack
Deploy stack to new clusters
Update stack as infrastructure evolves

Infrastructure Enterprise (infrx)

Cluster stacks are managed through a special Infrastructure Enterprise (infrx by convention). This separates infrastructure-level assets from application assets.

Structure:

my-company (application enterprise)
└── Environments
    ├── prod
    │   └── Clusters
    │       ├── prod-us-east
    │       └── prod-eu-west

infrx (infrastructure enterprise)
└── Infrastructure Assets
    ├── nginx-ingress
    ├── cert-manager
    ├── prometheus
    └── fluentd

Why separate?

Different lifecycle (infrastructure updates less frequent)
Different ownership (platform team vs app teams)
Different permissions (admin vs developer access)
Clear separation of concerns

Capturing Cluster Stacks

Create a reusable infrastructure template from an existing cluster.

Basic Capture

cod cluster stack capture my-golden-cluster

What happens:

Codiac scans cluster for infrastructure components
Identifies assets in infrastructure enterprise (infrx)
Captures component versions and configurations
Creates versioned cluster stack definition
Stores stack in enterprise for reuse

Captured components:

All assets deployed to infrx enterprise
Kubernetes operators (installed via Helm or manifests)
Custom Resource Definitions (CRDs)
Cluster-wide configurations
Infrastructure-specific secrets (references, not values)

Capture with Options

cod cluster stack capture my-golden-cluster \
  --name production-stack-v2 \
  --description "Production infrastructure with Istio + Prometheus" \
  --tag stable,prod-ready \
  --include-monitoring \
  --include-logging

Options:

--name: Stack name (defaults to cluster name)
--description: Documentation for the stack
--tag: Tags for organization
--include-*: Specific component categories
--exclude-*: Exclude certain components

Selective Capture

Capture only ingress infrastructure:

cod cluster stack capture my-cluster \
  --name ingress-stack \
  --include ingress,certificates

Capture monitoring stack:

cod cluster stack capture my-cluster \
  --name monitoring-stack \
  --include monitoring,alerting,dashboards

Use case: Create specialized stacks for different cluster types (dev clusters don't need full monitoring, production clusters require everything).

Viewing Cluster Stacks

List All Stacks

cod cluster stack list

Example output:

NAME                  VERSION   CREATED         COMPONENTS   TAGS
production-stack-v2   2.1.0     2026-01-20      12 assets    stable, prod-ready
dev-stack             1.5.0     2026-01-15      6 assets     dev-clusters
monitoring-stack      3.0.0     2026-01-10      8 assets     monitoring

View Stack Details

cod cluster stack view production-stack-v2

Output:

Cluster Stack: production-stack-v2
Version: 2.1.0
Created: 2026-01-20 14:30:00
Creator: platform-team@company.com
Description: Production infrastructure with Istio + Prometheus
Tags: stable, prod-ready

Components (12):
  nginx-ingress           v1.9.5
  cert-manager            v1.13.0
  prometheus              v2.48.0
  grafana                 v10.2.0
  istio-base              v1.20.0
  istio-istiod            v1.20.0
  fluentd                 v1.16.0
  datadog-agent           v7.50.0
  falco                   v0.36.0
  external-dns            v0.14.0
  cluster-autoscaler      v1.28.0
  metrics-server          v0.6.4

Configuration Scopes:
  - Environment: prod
  - Infrastructure Enterprise: infrx

Compare Stacks

cod cluster stack diff production-stack-v2 production-stack-v1

Shows differences:

Components Added:
  + falco v0.36.0 (security monitoring)
  + external-dns v0.14.0 (automatic DNS management)

Components Updated:
  ✓ prometheus: v2.45.0 → v2.48.0
  ✓ istio-istiod: v1.19.0 → v1.20.0

Components Removed:
  - kube-state-metrics v2.10.0 (replaced by Prometheus ServiceMonitor)

Configuration Changes:
  nginx-ingress:
    replica-count: 2 → 3
    resource-limits.cpu: 200m → 500m

Deploying Cluster Stacks

Apply a cluster stack to new or existing clusters.

Deploy to New Cluster

# Create cluster
cod cluster create prod-us-west \
  --provider aws \
  --region us-west-2 \
  --environment prod

# Deploy infrastructure stack
cod cluster stack deploy production-stack-v2 \
  --cluster prod-us-west

Timeline:

Cluster creation: 8-12 minutes (automated)
Stack deployment: 3-5 minutes
Total: ~15 minutes to production-ready cluster

Deploy to Existing Cluster

cod cluster stack deploy monitoring-stack \
  --cluster existing-cluster

Use case: Add monitoring infrastructure to cluster that was previously without it.

Deploy with Overrides

cod cluster stack deploy production-stack-v2 \
  --cluster prod-eu-west \
  --override ingress.replicas=5 \
  --override monitoring.retention=90d

When to override:

Different cluster sizes (more replicas for larger clusters)
Regional variations (data retention policies)
Environment-specific settings (dev vs prod resource limits)

Updating Cluster Stacks

Infrastructure evolves. Update existing stacks and deploy changes.

Create New Stack Version

# Make changes to golden cluster infrastructure
cod asset deploy --enterprise infrx --cabinet prod-infra --asset prometheus --update 2.49.0

# Capture updated stack
cod cluster stack capture my-golden-cluster \
  --name production-stack-v2 \
  --version 2.2.0 \
  --tag latest

Result: New version 2.2.0 of production-stack-v2 with updated Prometheus.

Rolling Updates

Update all clusters with new stack version:

# List clusters using old stack
cod cluster list --stack production-stack-v2:2.1.0

# Update clusters one by one
cod cluster stack deploy production-stack-v2:2.2.0 --cluster prod-us-east
cod cluster stack deploy production-stack-v2:2.2.0 --cluster prod-us-west
cod cluster stack deploy production-stack-v2:2.2.0 --cluster prod-eu-west

Automated rollout:

# Update all clusters in environment
cod cluster stack deploy production-stack-v2:2.2.0 \
  --environment prod \
  --rolling-update

Safety features:

One cluster at a time
Health checks between updates
Automatic rollback on failure
Configurable delay between clusters

Common Cluster Stack Patterns

Pattern 1: Production Stack

Full infrastructure for production workloads:

production-stack (12 components)
├── Ingress
│   ├── nginx-ingress (multi-replica, HA)
│   └── cert-manager (Let's Encrypt, automated)
├── Monitoring
│   ├── prometheus (90-day retention)
│   ├── grafana (dashboards)
│   └── alertmanager (PagerDuty integration)
├── Logging
│   ├── fluentd (aggregate logs)
│   └── loki (log storage)
├── Security
│   ├── falco (runtime security)
│   └── OPA gatekeeper (policy enforcement)
└── Platform
    ├── external-dns (automatic DNS)
    ├── cluster-autoscaler (node scaling)
    └── metrics-server (resource metrics)

Use for:

Production clusters
Mission-critical workloads
Compliance-required environments

Pattern 2: Development Stack

Lightweight infrastructure for dev/test:

dev-stack (6 components)
├── Ingress
│   ├── nginx-ingress (single replica)
│   └── cert-manager (staging Let's Encrypt)
├── Monitoring
│   └── prometheus (7-day retention)
└── Platform
    └── metrics-server

Use for:

Development clusters
Testing environments
Short-lived clusters

Pattern 3: Monitoring-Only Stack

Add observability to existing clusters:

monitoring-stack (8 components)
├── prometheus
├── grafana
├── alertmanager
├── node-exporter
├── kube-state-metrics
├── loki
├── promtail
└── tempo (distributed tracing)

Use for:

Clusters without monitoring
Adding observability to acquired infrastructure
Compliance requirements (add monitoring to legacy clusters)

Pattern 4: Security-Hardened Stack

Enhanced security for sensitive workloads:

security-stack (10 components)
├── Ingress (with ModSecurity WAF)
├── Falco (runtime threat detection)
├── OPA Gatekeeper (admission control)
├── Trivy Operator (vulnerability scanning)
├── cert-manager (with private CA)
├── External Secrets Operator
├── Network Policies (enforced)
├── Pod Security Standards (restricted)
├── Audit Logging (enhanced)
└── Encryption at Rest (enabled)

Use for:

Healthcare (HIPAA compliance)
Financial services (PCI-DSS)
Government (FedRAMP, IL5)

Cluster Stack Versioning

Track infrastructure evolution over time.

Semantic Versioning

Cluster stacks use semantic versioning: MAJOR.MINOR.PATCH

Version bumps:

MAJOR (1.0.0 → 2.0.0): Breaking changes (component removals, incompatible configs)
MINOR (1.0.0 → 1.1.0): New components added, non-breaking updates
PATCH (1.0.0 → 1.0.1): Bug fixes, security patches

Example:

production-stack:1.0.0  (Initial release)
production-stack:1.1.0  (Added Falco for security)
production-stack:1.1.1  (Updated Prometheus to patch CVE)
production-stack:2.0.0  (Removed Fluentd, migrated to Loki)

Version Tags

Tag stack versions for organization:

cod cluster stack tag production-stack:2.1.0 --add stable,recommended
cod cluster stack tag production-stack:2.0.0 --add deprecated

Common tags:

stable - Production-ready, thoroughly tested
beta - Testing in pre-prod
deprecated - Old version, plan migration
security-hardened - Enhanced security components
minimal - Lightweight, cost-optimized

Viewing Version History

cod cluster stack versions production-stack

Output:

VERSION   RELEASED      COMPONENTS   TAGS            CLUSTERS USING
2.0     2026-01-23    12 assets    latest          0
1.0     2026-01-20    12 assets    stable          15
0.0     2025-12-15    11 assets    deprecated      3
5.0     2025-11-01    10 assets    -               0

Best Practices

1. Start with a Golden Cluster

Build infrastructure in one cluster first:

Provision new cluster
Install and configure all infrastructure components
Test thoroughly (deploy sample applications)
Document component choices and configurations
Capture as cluster stack

Benefits:

Iterate on infrastructure without affecting multiple clusters
Test changes in isolation
Create battle-tested configurations

2. Version Infrastructure Independently

Don't couple infrastructure and application versions:

Application enterprises: my-company
Infrastructure enterprise: infrx

Separate lifecycles:

Infrastructure updated quarterly (stable, slow-moving)
Applications updated daily/weekly (fast-moving)

3. Use Semantic Versioning

Be intentional about version bumps:

# Patch: Security update to cert-manager
cod cluster stack capture --version 2.1.1 --tag security-patch

# Minor: Added Falco for runtime security
cod cluster stack capture --version 2.2.0 --tag feature-addition

# Major: Removed Fluentd, replaced with Loki (breaking change)
cod cluster stack capture --version 3.0.0 --tag breaking-change

4. Test Stack Updates in Non-Prod

Never deploy new stack versions directly to production:

# Deploy to staging first
cod cluster stack deploy production-stack:2.2.0 --cluster staging-us-east

# Run smoke tests
# Monitor for 24-48 hours

# Deploy to production after validation
cod cluster stack deploy production-stack:2.2.0 --cluster prod-us-east

5. Document Stack Contents

Add descriptions to stacks:

cod cluster stack capture my-cluster \
  --name production-stack-v2 \
  --description "Production infrastructure with: NGINX Ingress (HA), cert-manager (Let's Encrypt), Prometheus + Grafana (90d retention), Falco (security), Fluentd (logging to S3), Istio 1.20 (service mesh), cluster-autoscaler, external-dns" \
  --tag stable

Create README for stacks:

# Production Stack v2.1.0

## Components
- **Ingress**: NGINX Ingress Controller (HA, 3 replicas)
- **Certificates**: cert-manager with Let's Encrypt prod issuer
- **Monitoring**: Prometheus + Grafana (90-day retention)
- **Security**: Falco runtime security monitoring
- **Logging**: Fluentd → S3 (7-year retention for compliance)

## Requirements
- Kubernetes 1.26+
- AWS EKS or equivalent
- 5+ nodes (for HA components)
- 100GB+ storage for Prometheus

## Configuration Overrides
- `ingress.replicas`: Default 3, increase for large clusters
- `prometheus.retention`: Default 90d, adjust for compliance needs

6. Automate Stack Deployment

Infrastructure as Code for cluster stacks:

# terraform/main.tf
resource "codiac_cluster" "prod_us_west" {
  name     = "prod-us-west"
  provider = "aws"
  region   = "us-west-2"

  stack    = "production-stack-v2:2.1.0"
}

CI/CD for stack updates:

# .github/workflows/update-infrastructure.yml
name: Update Infrastructure

on:
  workflow_dispatch:
    inputs:
      stack_version:
        description: 'Stack version to deploy'
        required: true

jobs:
  update_stack:
    runs-on: ubuntu-latest
    steps:
      - name: Deploy Stack to Staging
        run: |
          cod cluster stack deploy production-stack-v2:${{ inputs.stack_version }} \
            --cluster staging-us-east

      - name: Run Tests
        run: ./scripts/test-infrastructure.sh

      - name: Deploy to Production
        if: success()
        run: |
          cod cluster stack deploy production-stack-v2:${{ inputs.stack_version }} \
            --environment prod \
            --rolling-update \
            --health-check-interval 5m

Integration with Multi-Cloud

Cluster stacks work across cloud providers with provider-specific adaptations.

Cloud-Agnostic Stacks

# Capture stack from AWS cluster
cod cluster stack capture aws-prod-cluster --name universal-stack

# Deploy to Azure cluster
cod cluster stack deploy universal-stack --cluster azure-prod-cluster

What Codiac adapts automatically:

Load balancer types (AWS ALB → Azure Load Balancer)
Storage classes (AWS EBS → Azure Disk)
Cloud-specific annotations
IAM/RBAC configurations

What remains consistent:

Component versions
Configurations
Resource limits
Monitoring setup

Provider-Specific Stacks

Create cloud-specific variations:

# AWS-optimized stack
cod cluster stack capture aws-cluster \
  --name aws-production-stack \
  --tag cloud-aws

# Azure-optimized stack
cod cluster stack capture azure-cluster \
  --name azure-production-stack \
  --tag cloud-azure

Use cases:

Cloud-native integrations (AWS CloudWatch, Azure Monitor)
Provider-specific security tools
Optimized for cloud provider's strengths

Troubleshooting

Problem: Stack deployment fails

Error: Failed to deploy component nginx-ingress: ImagePullBackOff

Cause: Component image not available in target cluster's region/registry.

Solution:

# Check image registries
cod cluster view my-cluster | grep registry

# Configure image pull secret
cod imageRegistry pullSecret set --cluster my-cluster

# Retry deployment
cod cluster stack deploy production-stack-v2 --cluster my-cluster --retry

Problem: Stack components conflict with existing resources

Error: Conflict: Resource "ingress-nginx" already exists

Cause: Cluster already has infrastructure components installed.

Solution:

# Audit existing infrastructure
kubectl get all -A

# Selective stack deployment (skip conflicting components)
cod cluster stack deploy production-stack-v2 \
  --cluster my-cluster \
  --skip ingress-nginx

# Or: Uninstall conflicting component first
helm uninstall ingress-nginx -n ingress
cod cluster stack deploy production-stack-v2 --cluster my-cluster

Problem: Stack version drift

Issue: Some clusters running stack v2.0.0, others on v2.1.0.

Detection:

cod cluster list --show-stack-version

Output:

CLUSTER           STACK                        VERSION
prod-us-east      production-stack-v2          2.1.0
prod-us-west      production-stack-v2          2.1.0
prod-eu-west      production-stack-v2          2.0.0  ⚠️ OUTDATED
dev-us-east       dev-stack                    1.5.0

Solution:

# Update outdated cluster
cod cluster stack deploy production-stack-v2:2.1.0 \
  --cluster prod-eu-west

FAQ

Q: Can I create a stack without a golden cluster?

A: Not recommended, but yes. You can manually define a stack using asset references. However, capturing from a working cluster ensures tested configurations.

Q: What happens to application workloads during stack updates?

A: Application workloads continue running. Infrastructure components are updated with rolling deployments. Brief traffic interruptions possible during ingress controller updates (typically < 30 seconds).

Q: Can I use cluster stacks with existing Helm charts?

A: Yes. Cluster stacks can include Helm charts as assets. Codiac manages Helm releases as part of the stack.

Q: How do I rollback a stack deployment?

A: Deploy the previous stack version:

cod cluster stack deploy production-stack-v2:2.0.0 --cluster my-cluster

Q: Can I share stacks between enterprises?

A: Yes (Enterprise tier). Export stacks to kit libraries for cross-enterprise sharing:

cod kit create --from-stack production-stack-v2 --library @mycompany/infra

Q: Do stacks include custom Kubernetes resources (CRDs)?

A: Yes. Stacks capture operators, CRDs, and custom resources deployed in the infrastructure enterprise.

Need help with cluster stacks? Contact Support or visit codiac.io to schedule hands-on stack creation guidance.

What is a Cluster Stack?​

Business Value​

How Cluster Stacks Work​

Infrastructure Enterprise (infrx)​

Capturing Cluster Stacks​

Basic Capture​

Capture with Options​

Selective Capture​

Viewing Cluster Stacks​

List All Stacks​

View Stack Details​

Compare Stacks​

Deploying Cluster Stacks​

Deploy to New Cluster​

Deploy to Existing Cluster​

Deploy with Overrides​

Updating Cluster Stacks​

Create New Stack Version​

Rolling Updates​

Common Cluster Stack Patterns​

Pattern 1: Production Stack​

Pattern 2: Development Stack​

Pattern 3: Monitoring-Only Stack​

Pattern 4: Security-Hardened Stack​

Cluster Stack Versioning​

Semantic Versioning​

Version Tags​

Viewing Version History​

Best Practices​

1. Start with a Golden Cluster​

2. Version Infrastructure Independently​

3. Use Semantic Versioning​

4. Test Stack Updates in Non-Prod​

5. Document Stack Contents​

6. Automate Stack Deployment​

Integration with Multi-Cloud​

Cloud-Agnostic Stacks​

Provider-Specific Stacks​

Troubleshooting​

Problem: Stack deployment fails​

Problem: Stack components conflict with existing resources​

Problem: Stack version drift​

FAQ​

Related Documentation​

What is a Cluster Stack?

Business Value

How Cluster Stacks Work

Infrastructure Enterprise (infrx)

Capturing Cluster Stacks

Basic Capture

Capture with Options

Selective Capture

Viewing Cluster Stacks

List All Stacks

View Stack Details

Compare Stacks

Deploying Cluster Stacks

Deploy to New Cluster

Deploy to Existing Cluster

Deploy with Overrides

Updating Cluster Stacks

Create New Stack Version

Rolling Updates

Common Cluster Stack Patterns

Pattern 1: Production Stack

Pattern 2: Development Stack

Pattern 3: Monitoring-Only Stack

Pattern 4: Security-Hardened Stack

Cluster Stack Versioning

Semantic Versioning

Version Tags

Viewing Version History

Best Practices

1. Start with a Golden Cluster

2. Version Infrastructure Independently

3. Use Semantic Versioning

4. Test Stack Updates in Non-Prod

5. Document Stack Contents

6. Automate Stack Deployment

Integration with Multi-Cloud

Cloud-Agnostic Stacks

Provider-Specific Stacks

Troubleshooting

Problem: Stack deployment fails

Problem: Stack components conflict with existing resources

Problem: Stack version drift

FAQ

Related Documentation