Skip to main content

Cluster Stacks

Standardize infrastructure foundations across Kubernetes clusters. Capture infrastructure components as versioned stacks and replicate them to new clusters for consistent platform experiences.

What is a Cluster Stack?

A cluster stack is a versioned collection of infrastructure components that provide the foundation for application workloads. Think of it as an infrastructure template that includes ingress controllers, certificate managers, monitoring agents, logging, security tools, and other cluster-wide services.

What gets included:

  • Ingress controllers (NGINX, Traefik, AWS ALB)
  • Certificate managers (cert-manager for Let's Encrypt)
  • Monitoring agents (Prometheus, Datadog, New Relic)
  • Logging infrastructure (Fluentd, Loki)
  • Security tools (Falco, OPA/Gatekeeper)
  • Service meshes (Istio, Linkerd)
  • Custom operators and CRDs

Cluster stacks enable:

  • Standardized infrastructure across teams
  • Faster cluster provisioning (minutes instead of hours)
  • Infrastructure compliance and governance
  • Version-controlled platform evolution

Business Value

Standardization:

  • Every cluster starts with the same foundation
  • Eliminate "works in one cluster but not another" problems
  • Enforce organizational security and monitoring baselines

Speed:

  • New clusters production-ready in 10 minutes (not hours or days)
  • No manual installation of infrastructure components
  • Replicate battle-tested configurations instantly

Governance:

  • Centralized control over platform components
  • Version infrastructure independently from applications
  • Audit trail for infrastructure changes

Efficiency:

  • Platform teams maintain golden configurations
  • Application teams inherit standardized infrastructure
  • Reduce infrastructure drift across clusters

How Cluster Stacks Work

Golden Cluster                    New Cluster
↓ ↓
[Infrastructure Components] [Empty Cluster]
↓ ↓
cod cluster stack capture cod cluster stack deploy
↓ ↓
Cluster Stack (v1.0) Replicated Infrastructure
↓ ↓
[Versioned in Codiac] [Same components installed]

Workflow:

  1. Configure infrastructure in a "golden" cluster
  2. Capture infrastructure as cluster stack
  3. Version and tag the stack
  4. Deploy stack to new clusters
  5. Update stack as infrastructure evolves

Infrastructure Enterprise (infrx)

Cluster stacks are managed through a special Infrastructure Enterprise (infrx by convention). This separates infrastructure-level assets from application assets.

Structure:

my-company (application enterprise)
└── Environments
├── prod
│ └── Clusters
│ ├── prod-us-east
│ └── prod-eu-west

infrx (infrastructure enterprise)
└── Infrastructure Assets
├── nginx-ingress
├── cert-manager
├── prometheus
└── fluentd

Why separate?

  • Different lifecycle (infrastructure updates less frequent)
  • Different ownership (platform team vs app teams)
  • Different permissions (admin vs developer access)
  • Clear separation of concerns

Capturing Cluster Stacks

Create a reusable infrastructure template from an existing cluster.

Basic Capture

cod cluster stack capture my-golden-cluster

What happens:

  1. Codiac scans cluster for infrastructure components
  2. Identifies assets in infrastructure enterprise (infrx)
  3. Captures component versions and configurations
  4. Creates versioned cluster stack definition
  5. Stores stack in enterprise for reuse

Captured components:

  • All assets deployed to infrx enterprise
  • Kubernetes operators (installed via Helm or manifests)
  • Custom Resource Definitions (CRDs)
  • Cluster-wide configurations
  • Infrastructure-specific secrets (references, not values)

Capture with Options

cod cluster stack capture my-golden-cluster \
--name production-stack-v2 \
--description "Production infrastructure with Istio + Prometheus" \
--tag stable,prod-ready \
--include-monitoring \
--include-logging

Options:

  • --name: Stack name (defaults to cluster name)
  • --description: Documentation for the stack
  • --tag: Tags for organization
  • --include-*: Specific component categories
  • --exclude-*: Exclude certain components

Selective Capture

Capture only ingress infrastructure:

cod cluster stack capture my-cluster \
--name ingress-stack \
--include ingress,certificates

Capture monitoring stack:

cod cluster stack capture my-cluster \
--name monitoring-stack \
--include monitoring,alerting,dashboards

Use case: Create specialized stacks for different cluster types (dev clusters don't need full monitoring, production clusters require everything).


Viewing Cluster Stacks

List All Stacks

cod cluster stack list

Example output:

NAME                  VERSION   CREATED         COMPONENTS   TAGS
production-stack-v2 2.1.0 2026-01-20 12 assets stable, prod-ready
dev-stack 1.5.0 2026-01-15 6 assets dev-clusters
monitoring-stack 3.0.0 2026-01-10 8 assets monitoring

View Stack Details

cod cluster stack view production-stack-v2

Output:

Cluster Stack: production-stack-v2
Version: 2.1.0
Created: 2026-01-20 14:30:00
Creator: platform-team@company.com
Description: Production infrastructure with Istio + Prometheus
Tags: stable, prod-ready

Components (12):
nginx-ingress v1.9.5
cert-manager v1.13.0
prometheus v2.48.0
grafana v10.2.0
istio-base v1.20.0
istio-istiod v1.20.0
fluentd v1.16.0
datadog-agent v7.50.0
falco v0.36.0
external-dns v0.14.0
cluster-autoscaler v1.28.0
metrics-server v0.6.4

Configuration Scopes:
- Environment: prod
- Infrastructure Enterprise: infrx

Compare Stacks

cod cluster stack diff production-stack-v2 production-stack-v1

Shows differences:

Components Added:
+ falco v0.36.0 (security monitoring)
+ external-dns v0.14.0 (automatic DNS management)

Components Updated:
✓ prometheus: v2.45.0 → v2.48.0
✓ istio-istiod: v1.19.0 → v1.20.0

Components Removed:
- kube-state-metrics v2.10.0 (replaced by Prometheus ServiceMonitor)

Configuration Changes:
nginx-ingress:
replica-count: 2 → 3
resource-limits.cpu: 200m → 500m

Deploying Cluster Stacks

Apply a cluster stack to new or existing clusters.

Deploy to New Cluster

# Create cluster
cod cluster create prod-us-west \
--provider aws \
--region us-west-2 \
--environment prod

# Deploy infrastructure stack
cod cluster stack deploy production-stack-v2 \
--cluster prod-us-west

Timeline:

  • Cluster creation: 8-12 minutes (automated)
  • Stack deployment: 3-5 minutes
  • Total: ~15 minutes to production-ready cluster

Deploy to Existing Cluster

cod cluster stack deploy monitoring-stack \
--cluster existing-cluster

Use case: Add monitoring infrastructure to cluster that was previously without it.

Deploy with Overrides

cod cluster stack deploy production-stack-v2 \
--cluster prod-eu-west \
--override ingress.replicas=5 \
--override monitoring.retention=90d

When to override:

  • Different cluster sizes (more replicas for larger clusters)
  • Regional variations (data retention policies)
  • Environment-specific settings (dev vs prod resource limits)

Updating Cluster Stacks

Infrastructure evolves. Update existing stacks and deploy changes.

Create New Stack Version

# Make changes to golden cluster infrastructure
cod asset deploy --enterprise infrx --cabinet prod-infra --asset prometheus --update 2.49.0

# Capture updated stack
cod cluster stack capture my-golden-cluster \
--name production-stack-v2 \
--version 2.2.0 \
--tag latest

Result: New version 2.2.0 of production-stack-v2 with updated Prometheus.

Rolling Updates

Update all clusters with new stack version:

# List clusters using old stack
cod cluster list --stack production-stack-v2:2.1.0

# Update clusters one by one
cod cluster stack deploy production-stack-v2:2.2.0 --cluster prod-us-east
cod cluster stack deploy production-stack-v2:2.2.0 --cluster prod-us-west
cod cluster stack deploy production-stack-v2:2.2.0 --cluster prod-eu-west

Automated rollout:

# Update all clusters in environment
cod cluster stack deploy production-stack-v2:2.2.0 \
--environment prod \
--rolling-update

Safety features:

  • One cluster at a time
  • Health checks between updates
  • Automatic rollback on failure
  • Configurable delay between clusters

Common Cluster Stack Patterns

Pattern 1: Production Stack

Full infrastructure for production workloads:

production-stack (12 components)
├── Ingress
│ ├── nginx-ingress (multi-replica, HA)
│ └── cert-manager (Let's Encrypt, automated)
├── Monitoring
│ ├── prometheus (90-day retention)
│ ├── grafana (dashboards)
│ └── alertmanager (PagerDuty integration)
├── Logging
│ ├── fluentd (aggregate logs)
│ └── loki (log storage)
├── Security
│ ├── falco (runtime security)
│ └── OPA gatekeeper (policy enforcement)
└── Platform
├── external-dns (automatic DNS)
├── cluster-autoscaler (node scaling)
└── metrics-server (resource metrics)

Use for:

  • Production clusters
  • Mission-critical workloads
  • Compliance-required environments

Pattern 2: Development Stack

Lightweight infrastructure for dev/test:

dev-stack (6 components)
├── Ingress
│ ├── nginx-ingress (single replica)
│ └── cert-manager (staging Let's Encrypt)
├── Monitoring
│ └── prometheus (7-day retention)
└── Platform
└── metrics-server

Use for:

  • Development clusters
  • Testing environments
  • Short-lived clusters

Pattern 3: Monitoring-Only Stack

Add observability to existing clusters:

monitoring-stack (8 components)
├── prometheus
├── grafana
├── alertmanager
├── node-exporter
├── kube-state-metrics
├── loki
├── promtail
└── tempo (distributed tracing)

Use for:

  • Clusters without monitoring
  • Adding observability to acquired infrastructure
  • Compliance requirements (add monitoring to legacy clusters)

Pattern 4: Security-Hardened Stack

Enhanced security for sensitive workloads:

security-stack (10 components)
├── Ingress (with ModSecurity WAF)
├── Falco (runtime threat detection)
├── OPA Gatekeeper (admission control)
├── Trivy Operator (vulnerability scanning)
├── cert-manager (with private CA)
├── External Secrets Operator
├── Network Policies (enforced)
├── Pod Security Standards (restricted)
├── Audit Logging (enhanced)
└── Encryption at Rest (enabled)

Use for:

  • Healthcare (HIPAA compliance)
  • Financial services (PCI-DSS)
  • Government (FedRAMP, IL5)

Cluster Stack Versioning

Track infrastructure evolution over time.

Semantic Versioning

Cluster stacks use semantic versioning: MAJOR.MINOR.PATCH

Version bumps:

  • MAJOR (1.0.0 → 2.0.0): Breaking changes (component removals, incompatible configs)
  • MINOR (1.0.0 → 1.1.0): New components added, non-breaking updates
  • PATCH (1.0.0 → 1.0.1): Bug fixes, security patches

Example:

production-stack:1.0.0  (Initial release)
production-stack:1.1.0 (Added Falco for security)
production-stack:1.1.1 (Updated Prometheus to patch CVE)
production-stack:2.0.0 (Removed Fluentd, migrated to Loki)

Version Tags

Tag stack versions for organization:

cod cluster stack tag production-stack:2.1.0 --add stable,recommended
cod cluster stack tag production-stack:2.0.0 --add deprecated

Common tags:

  • stable - Production-ready, thoroughly tested
  • beta - Testing in pre-prod
  • deprecated - Old version, plan migration
  • security-hardened - Enhanced security components
  • minimal - Lightweight, cost-optimized

Viewing Version History

cod cluster stack versions production-stack

Output:

VERSION   RELEASED      COMPONENTS   TAGS            CLUSTERS USING
2.2.0 2026-01-23 12 assets latest 0
2.1.0 2026-01-20 12 assets stable 15
2.0.0 2025-12-15 11 assets deprecated 3
1.5.0 2025-11-01 10 assets - 0

Best Practices

1. Start with a Golden Cluster

Build infrastructure in one cluster first:

  1. Provision new cluster
  2. Install and configure all infrastructure components
  3. Test thoroughly (deploy sample applications)
  4. Document component choices and configurations
  5. Capture as cluster stack

Benefits:

  • Iterate on infrastructure without affecting multiple clusters
  • Test changes in isolation
  • Create battle-tested configurations

2. Version Infrastructure Independently

Don't couple infrastructure and application versions:

  • Application enterprises: my-company
  • Infrastructure enterprise: infrx

Separate lifecycles:

  • Infrastructure updated quarterly (stable, slow-moving)
  • Applications updated daily/weekly (fast-moving)

3. Use Semantic Versioning

Be intentional about version bumps:

# Patch: Security update to cert-manager
cod cluster stack capture --version 2.1.1 --tag security-patch

# Minor: Added Falco for runtime security
cod cluster stack capture --version 2.2.0 --tag feature-addition

# Major: Removed Fluentd, replaced with Loki (breaking change)
cod cluster stack capture --version 3.0.0 --tag breaking-change

4. Test Stack Updates in Non-Prod

Never deploy new stack versions directly to production:

# Deploy to staging first
cod cluster stack deploy production-stack:2.2.0 --cluster staging-us-east

# Run smoke tests
# Monitor for 24-48 hours

# Deploy to production after validation
cod cluster stack deploy production-stack:2.2.0 --cluster prod-us-east

5. Document Stack Contents

Add descriptions to stacks:

cod cluster stack capture my-cluster \
--name production-stack-v2 \
--description "Production infrastructure with: NGINX Ingress (HA), cert-manager (Let's Encrypt), Prometheus + Grafana (90d retention), Falco (security), Fluentd (logging to S3), Istio 1.20 (service mesh), cluster-autoscaler, external-dns" \
--tag stable

Create README for stacks:

# Production Stack v2.1.0

## Components
- **Ingress**: NGINX Ingress Controller (HA, 3 replicas)
- **Certificates**: cert-manager with Let's Encrypt prod issuer
- **Monitoring**: Prometheus + Grafana (90-day retention)
- **Security**: Falco runtime security monitoring
- **Logging**: Fluentd → S3 (7-year retention for compliance)

## Requirements
- Kubernetes 1.26+
- AWS EKS or equivalent
- 5+ nodes (for HA components)
- 100GB+ storage for Prometheus

## Configuration Overrides
- `ingress.replicas`: Default 3, increase for large clusters
- `prometheus.retention`: Default 90d, adjust for compliance needs

6. Automate Stack Deployment

Infrastructure as Code for cluster stacks:

# terraform/main.tf
resource "codiac_cluster" "prod_us_west" {
name = "prod-us-west"
provider = "aws"
region = "us-west-2"

stack = "production-stack-v2:2.1.0"
}

CI/CD for stack updates:

# .github/workflows/update-infrastructure.yml
name: Update Infrastructure

on:
workflow_dispatch:
inputs:
stack_version:
description: 'Stack version to deploy'
required: true

jobs:
update_stack:
runs-on: ubuntu-latest
steps:
- name: Deploy Stack to Staging
run: |
cod cluster stack deploy production-stack-v2:${{ inputs.stack_version }} \
--cluster staging-us-east

- name: Run Tests
run: ./scripts/test-infrastructure.sh

- name: Deploy to Production
if: success()
run: |
cod cluster stack deploy production-stack-v2:${{ inputs.stack_version }} \
--environment prod \
--rolling-update \
--health-check-interval 5m

Integration with Multi-Cloud

Cluster stacks work across cloud providers with provider-specific adaptations.

Cloud-Agnostic Stacks

# Capture stack from AWS cluster
cod cluster stack capture aws-prod-cluster --name universal-stack

# Deploy to Azure cluster
cod cluster stack deploy universal-stack --cluster azure-prod-cluster

What Codiac adapts automatically:

  • Load balancer types (AWS ALB → Azure Load Balancer)
  • Storage classes (AWS EBS → Azure Disk)
  • Cloud-specific annotations
  • IAM/RBAC configurations

What remains consistent:

  • Component versions
  • Configurations
  • Resource limits
  • Monitoring setup

Provider-Specific Stacks

Create cloud-specific variations:

# AWS-optimized stack
cod cluster stack capture aws-cluster \
--name aws-production-stack \
--tag cloud-aws

# Azure-optimized stack
cod cluster stack capture azure-cluster \
--name azure-production-stack \
--tag cloud-azure

Use cases:

  • Cloud-native integrations (AWS CloudWatch, Azure Monitor)
  • Provider-specific security tools
  • Optimized for cloud provider's strengths

Troubleshooting

Problem: Stack deployment fails

Error: Failed to deploy component nginx-ingress: ImagePullBackOff

Cause: Component image not available in target cluster's region/registry.

Solution:

# Check image registries
cod cluster view my-cluster | grep registry

# Configure image pull secret
cod imageRegistry pullSecret set --cluster my-cluster

# Retry deployment
cod cluster stack deploy production-stack-v2 --cluster my-cluster --retry

Problem: Stack components conflict with existing resources

Error: Conflict: Resource "ingress-nginx" already exists

Cause: Cluster already has infrastructure components installed.

Solution:

# Audit existing infrastructure
kubectl get all -A

# Selective stack deployment (skip conflicting components)
cod cluster stack deploy production-stack-v2 \
--cluster my-cluster \
--skip ingress-nginx

# Or: Uninstall conflicting component first
helm uninstall ingress-nginx -n ingress
cod cluster stack deploy production-stack-v2 --cluster my-cluster

Problem: Stack version drift

Issue: Some clusters running stack v2.0.0, others on v2.1.0.

Detection:

cod cluster list --show-stack-version

Output:

CLUSTER           STACK                        VERSION
prod-us-east production-stack-v2 2.1.0
prod-us-west production-stack-v2 2.1.0
prod-eu-west production-stack-v2 2.0.0 ⚠️ OUTDATED
dev-us-east dev-stack 1.5.0

Solution:

# Update outdated cluster
cod cluster stack deploy production-stack-v2:2.1.0 \
--cluster prod-eu-west

FAQ

Q: Can I create a stack without a golden cluster?

A: Not recommended, but yes. You can manually define a stack using asset references. However, capturing from a working cluster ensures tested configurations.

Q: What happens to application workloads during stack updates?

A: Application workloads continue running. Infrastructure components are updated with rolling deployments. Brief traffic interruptions possible during ingress controller updates (typically < 30 seconds).

Q: Can I use cluster stacks with existing Helm charts?

A: Yes. Cluster stacks can include Helm charts as assets. Codiac manages Helm releases as part of the stack.

Q: How do I rollback a stack deployment?

A: Deploy the previous stack version:

cod cluster stack deploy production-stack-v2:2.0.0 --cluster my-cluster

Q: Can I share stacks between enterprises?

A: Yes (Enterprise tier). Export stacks to kit libraries for cross-enterprise sharing:

cod kit create --from-stack production-stack-v2 --library @mycompany/infra

Q: Do stacks include custom Kubernetes resources (CRDs)?

A: Yes. Stacks capture operators, CRDs, and custom resources deployed in the infrastructure enterprise.



Need help with cluster stacks? Contact Support or visit codiac.io to schedule hands-on stack creation guidance.