Cluster Stacks
Standardize infrastructure foundations across Kubernetes clusters. Capture infrastructure components as versioned stacks and replicate them to new clusters for consistent platform experiences.
What is a Cluster Stack?
A cluster stack is a versioned collection of infrastructure components that provide the foundation for application workloads. Think of it as an infrastructure template that includes ingress controllers, certificate managers, monitoring agents, logging, security tools, and other cluster-wide services.
What gets included:
- Ingress controllers (NGINX, Traefik, AWS ALB)
- Certificate managers (cert-manager for Let's Encrypt)
- Monitoring agents (Prometheus, Datadog, New Relic)
- Logging infrastructure (Fluentd, Loki)
- Security tools (Falco, OPA/Gatekeeper)
- Service meshes (Istio, Linkerd)
- Custom operators and CRDs
Cluster stacks enable:
- Standardized infrastructure across teams
- Faster cluster provisioning (minutes instead of hours)
- Infrastructure compliance and governance
- Version-controlled platform evolution
Business Value
Standardization:
- Every cluster starts with the same foundation
- Eliminate "works in one cluster but not another" problems
- Enforce organizational security and monitoring baselines
Speed:
- New clusters production-ready in 10 minutes (not hours or days)
- No manual installation of infrastructure components
- Replicate battle-tested configurations instantly
Governance:
- Centralized control over platform components
- Version infrastructure independently from applications
- Audit trail for infrastructure changes
Efficiency:
- Platform teams maintain golden configurations
- Application teams inherit standardized infrastructure
- Reduce infrastructure drift across clusters
How Cluster Stacks Work
Golden Cluster New Cluster
↓ ↓
[Infrastructure Components] [Empty Cluster]
↓ ↓
cod cluster stack capture cod cluster stack deploy
↓ ↓
Cluster Stack (v1.0) Replicated Infrastructure
↓ ↓
[Versioned in Codiac] [Same components installed]
Workflow:
- Configure infrastructure in a "golden" cluster
- Capture infrastructure as cluster stack
- Version and tag the stack
- Deploy stack to new clusters
- Update stack as infrastructure evolves
Infrastructure Enterprise (infrx)
Cluster stacks are managed through a special Infrastructure Enterprise (infrx by convention). This separates infrastructure-level assets from application assets.
Structure:
my-company (application enterprise)
└── Environments
├── prod
│ └── Clusters
│ ├── prod-us-east
│ └── prod-eu-west
infrx (infrastructure enterprise)
└── Infrastructure Assets
├── nginx-ingress
├── cert-manager
├── prometheus
└── fluentd
Why separate?
- Different lifecycle (infrastructure updates less frequent)
- Different ownership (platform team vs app teams)
- Different permissions (admin vs developer access)
- Clear separation of concerns
Capturing Cluster Stacks
Create a reusable infrastructure template from an existing cluster.
Basic Capture
cod cluster stack capture my-golden-cluster
What happens:
- Codiac scans cluster for infrastructure components
- Identifies assets in infrastructure enterprise (
infrx) - Captures component versions and configurations
- Creates versioned cluster stack definition
- Stores stack in enterprise for reuse
Captured components:
- All assets deployed to
infrxenterprise - Kubernetes operators (installed via Helm or manifests)
- Custom Resource Definitions (CRDs)
- Cluster-wide configurations
- Infrastructure-specific secrets (references, not values)
Capture with Options
cod cluster stack capture my-golden-cluster \
--name production-stack-v2 \
--description "Production infrastructure with Istio + Prometheus" \
--tag stable,prod-ready \
--include-monitoring \
--include-logging
Options:
--name: Stack name (defaults to cluster name)--description: Documentation for the stack--tag: Tags for organization--include-*: Specific component categories--exclude-*: Exclude certain components
Selective Capture
Capture only ingress infrastructure:
cod cluster stack capture my-cluster \
--name ingress-stack \
--include ingress,certificates
Capture monitoring stack:
cod cluster stack capture my-cluster \
--name monitoring-stack \
--include monitoring,alerting,dashboards
Use case: Create specialized stacks for different cluster types (dev clusters don't need full monitoring, production clusters require everything).
Viewing Cluster Stacks
List All Stacks
cod cluster stack list
Example output:
NAME VERSION CREATED COMPONENTS TAGS
production-stack-v2 2.1.0 2026-01-20 12 assets stable, prod-ready
dev-stack 1.5.0 2026-01-15 6 assets dev-clusters
monitoring-stack 3.0.0 2026-01-10 8 assets monitoring
View Stack Details
cod cluster stack view production-stack-v2
Output:
Cluster Stack: production-stack-v2
Version: 2.1.0
Created: 2026-01-20 14:30:00
Creator: platform-team@company.com
Description: Production infrastructure with Istio + Prometheus
Tags: stable, prod-ready
Components (12):
nginx-ingress v1.9.5
cert-manager v1.13.0
prometheus v2.48.0
grafana v10.2.0
istio-base v1.20.0
istio-istiod v1.20.0
fluentd v1.16.0
datadog-agent v7.50.0
falco v0.36.0
external-dns v0.14.0
cluster-autoscaler v1.28.0
metrics-server v0.6.4
Configuration Scopes:
- Environment: prod
- Infrastructure Enterprise: infrx
Compare Stacks
cod cluster stack diff production-stack-v2 production-stack-v1
Shows differences:
Components Added:
+ falco v0.36.0 (security monitoring)
+ external-dns v0.14.0 (automatic DNS management)
Components Updated:
✓ prometheus: v2.45.0 → v2.48.0
✓ istio-istiod: v1.19.0 → v1.20.0
Components Removed:
- kube-state-metrics v2.10.0 (replaced by Prometheus ServiceMonitor)
Configuration Changes:
nginx-ingress:
replica-count: 2 → 3
resource-limits.cpu: 200m → 500m
Deploying Cluster Stacks
Apply a cluster stack to new or existing clusters.
Deploy to New Cluster
# Create cluster
cod cluster create prod-us-west \
--provider aws \
--region us-west-2 \
--environment prod
# Deploy infrastructure stack
cod cluster stack deploy production-stack-v2 \
--cluster prod-us-west
Timeline:
- Cluster creation: 8-12 minutes (automated)
- Stack deployment: 3-5 minutes
- Total: ~15 minutes to production-ready cluster
Deploy to Existing Cluster
cod cluster stack deploy monitoring-stack \
--cluster existing-cluster
Use case: Add monitoring infrastructure to cluster that was previously without it.
Deploy with Overrides
cod cluster stack deploy production-stack-v2 \
--cluster prod-eu-west \
--override ingress.replicas=5 \
--override monitoring.retention=90d
When to override:
- Different cluster sizes (more replicas for larger clusters)
- Regional variations (data retention policies)
- Environment-specific settings (dev vs prod resource limits)
Updating Cluster Stacks
Infrastructure evolves. Update existing stacks and deploy changes.
Create New Stack Version
# Make changes to golden cluster infrastructure
cod asset deploy --enterprise infrx --cabinet prod-infra --asset prometheus --update 2.49.0
# Capture updated stack
cod cluster stack capture my-golden-cluster \
--name production-stack-v2 \
--version 2.2.0 \
--tag latest
Result: New version 2.2.0 of production-stack-v2 with updated Prometheus.
Rolling Updates
Update all clusters with new stack version:
# List clusters using old stack
cod cluster list --stack production-stack-v2:2.1.0
# Update clusters one by one
cod cluster stack deploy production-stack-v2:2.2.0 --cluster prod-us-east
cod cluster stack deploy production-stack-v2:2.2.0 --cluster prod-us-west
cod cluster stack deploy production-stack-v2:2.2.0 --cluster prod-eu-west
Automated rollout:
# Update all clusters in environment
cod cluster stack deploy production-stack-v2:2.2.0 \
--environment prod \
--rolling-update
Safety features:
- One cluster at a time
- Health checks between updates
- Automatic rollback on failure
- Configurable delay between clusters
Common Cluster Stack Patterns
Pattern 1: Production Stack
Full infrastructure for production workloads:
production-stack (12 components)
├── Ingress
│ ├── nginx-ingress (multi-replica, HA)
│ └── cert-manager (Let's Encrypt, automated)
├── Monitoring
│ ├── prometheus (90-day retention)
│ ├── grafana (dashboards)
│ └── alertmanager (PagerDuty integration)
├── Logging
│ ├── fluentd (aggregate logs)
│ └── loki (log storage)
├── Security
│ ├── falco (runtime security)
│ └── OPA gatekeeper (policy enforcement)
└── Platform
├── external-dns (automatic DNS)
├── cluster-autoscaler (node scaling)
└── metrics-server (resource metrics)
Use for:
- Production clusters
- Mission-critical workloads
- Compliance-required environments
Pattern 2: Development Stack
Lightweight infrastructure for dev/test:
dev-stack (6 components)
├── Ingress
│ ├── nginx-ingress (single replica)
│ └── cert-manager (staging Let's Encrypt)
├── Monitoring
│ └── prometheus (7-day retention)
└── Platform
└── metrics-server
Use for:
- Development clusters
- Testing environments
- Short-lived clusters
Pattern 3: Monitoring-Only Stack
Add observability to existing clusters:
monitoring-stack (8 components)
├── prometheus
├── grafana
├── alertmanager
├── node-exporter
├── kube-state-metrics
├── loki
├── promtail
└── tempo (distributed tracing)
Use for:
- Clusters without monitoring
- Adding observability to acquired infrastructure
- Compliance requirements (add monitoring to legacy clusters)
Pattern 4: Security-Hardened Stack
Enhanced security for sensitive workloads:
security-stack (10 components)
├── Ingress (with ModSecurity WAF)
├── Falco (runtime threat detection)
├── OPA Gatekeeper (admission control)
├── Trivy Operator (vulnerability scanning)
├── cert-manager (with private CA)
├── External Secrets Operator
├── Network Policies (enforced)
├── Pod Security Standards (restricted)
├── Audit Logging (enhanced)
└── Encryption at Rest (enabled)
Use for:
- Healthcare (HIPAA compliance)
- Financial services (PCI-DSS)
- Government (FedRAMP, IL5)
Cluster Stack Versioning
Track infrastructure evolution over time.
Semantic Versioning
Cluster stacks use semantic versioning: MAJOR.MINOR.PATCH
Version bumps:
- MAJOR (1.0.0 → 2.0.0): Breaking changes (component removals, incompatible configs)
- MINOR (1.0.0 → 1.1.0): New components added, non-breaking updates
- PATCH (1.0.0 → 1.0.1): Bug fixes, security patches
Example:
production-stack:1.0.0 (Initial release)
production-stack:1.1.0 (Added Falco for security)
production-stack:1.1.1 (Updated Prometheus to patch CVE)
production-stack:2.0.0 (Removed Fluentd, migrated to Loki)
Version Tags
Tag stack versions for organization:
cod cluster stack tag production-stack:2.1.0 --add stable,recommended
cod cluster stack tag production-stack:2.0.0 --add deprecated
Common tags:
stable- Production-ready, thoroughly testedbeta- Testing in pre-proddeprecated- Old version, plan migrationsecurity-hardened- Enhanced security componentsminimal- Lightweight, cost-optimized
Viewing Version History
cod cluster stack versions production-stack
Output:
VERSION RELEASED COMPONENTS TAGS CLUSTERS USING
2.2.0 2026-01-23 12 assets latest 0
2.1.0 2026-01-20 12 assets stable 15
2.0.0 2025-12-15 11 assets deprecated 3
1.5.0 2025-11-01 10 assets - 0
Best Practices
1. Start with a Golden Cluster
Build infrastructure in one cluster first:
- Provision new cluster
- Install and configure all infrastructure components
- Test thoroughly (deploy sample applications)
- Document component choices and configurations
- Capture as cluster stack
Benefits:
- Iterate on infrastructure without affecting multiple clusters
- Test changes in isolation
- Create battle-tested configurations
2. Version Infrastructure Independently
Don't couple infrastructure and application versions:
- Application enterprises:
my-company - Infrastructure enterprise:
infrx
Separate lifecycles:
- Infrastructure updated quarterly (stable, slow-moving)
- Applications updated daily/weekly (fast-moving)
3. Use Semantic Versioning
Be intentional about version bumps:
# Patch: Security update to cert-manager
cod cluster stack capture --version 2.1.1 --tag security-patch
# Minor: Added Falco for runtime security
cod cluster stack capture --version 2.2.0 --tag feature-addition
# Major: Removed Fluentd, replaced with Loki (breaking change)
cod cluster stack capture --version 3.0.0 --tag breaking-change
4. Test Stack Updates in Non-Prod
Never deploy new stack versions directly to production:
# Deploy to staging first
cod cluster stack deploy production-stack:2.2.0 --cluster staging-us-east
# Run smoke tests
# Monitor for 24-48 hours
# Deploy to production after validation
cod cluster stack deploy production-stack:2.2.0 --cluster prod-us-east
5. Document Stack Contents
Add descriptions to stacks:
cod cluster stack capture my-cluster \
--name production-stack-v2 \
--description "Production infrastructure with: NGINX Ingress (HA), cert-manager (Let's Encrypt), Prometheus + Grafana (90d retention), Falco (security), Fluentd (logging to S3), Istio 1.20 (service mesh), cluster-autoscaler, external-dns" \
--tag stable
Create README for stacks:
# Production Stack v2.1.0
## Components
- **Ingress**: NGINX Ingress Controller (HA, 3 replicas)
- **Certificates**: cert-manager with Let's Encrypt prod issuer
- **Monitoring**: Prometheus + Grafana (90-day retention)
- **Security**: Falco runtime security monitoring
- **Logging**: Fluentd → S3 (7-year retention for compliance)
## Requirements
- Kubernetes 1.26+
- AWS EKS or equivalent
- 5+ nodes (for HA components)
- 100GB+ storage for Prometheus
## Configuration Overrides
- `ingress.replicas`: Default 3, increase for large clusters
- `prometheus.retention`: Default 90d, adjust for compliance needs
6. Automate Stack Deployment
Infrastructure as Code for cluster stacks:
# terraform/main.tf
resource "codiac_cluster" "prod_us_west" {
name = "prod-us-west"
provider = "aws"
region = "us-west-2"
stack = "production-stack-v2:2.1.0"
}
CI/CD for stack updates:
# .github/workflows/update-infrastructure.yml
name: Update Infrastructure
on:
workflow_dispatch:
inputs:
stack_version:
description: 'Stack version to deploy'
required: true
jobs:
update_stack:
runs-on: ubuntu-latest
steps:
- name: Deploy Stack to Staging
run: |
cod cluster stack deploy production-stack-v2:${{ inputs.stack_version }} \
--cluster staging-us-east
- name: Run Tests
run: ./scripts/test-infrastructure.sh
- name: Deploy to Production
if: success()
run: |
cod cluster stack deploy production-stack-v2:${{ inputs.stack_version }} \
--environment prod \
--rolling-update \
--health-check-interval 5m
Integration with Multi-Cloud
Cluster stacks work across cloud providers with provider-specific adaptations.
Cloud-Agnostic Stacks
# Capture stack from AWS cluster
cod cluster stack capture aws-prod-cluster --name universal-stack
# Deploy to Azure cluster
cod cluster stack deploy universal-stack --cluster azure-prod-cluster
What Codiac adapts automatically:
- Load balancer types (AWS ALB → Azure Load Balancer)
- Storage classes (AWS EBS → Azure Disk)
- Cloud-specific annotations
- IAM/RBAC configurations
What remains consistent:
- Component versions
- Configurations
- Resource limits
- Monitoring setup
Provider-Specific Stacks
Create cloud-specific variations:
# AWS-optimized stack
cod cluster stack capture aws-cluster \
--name aws-production-stack \
--tag cloud-aws
# Azure-optimized stack
cod cluster stack capture azure-cluster \
--name azure-production-stack \
--tag cloud-azure
Use cases:
- Cloud-native integrations (AWS CloudWatch, Azure Monitor)
- Provider-specific security tools
- Optimized for cloud provider's strengths
Troubleshooting
Problem: Stack deployment fails
Error: Failed to deploy component nginx-ingress: ImagePullBackOff
Cause: Component image not available in target cluster's region/registry.
Solution:
# Check image registries
cod cluster view my-cluster | grep registry
# Configure image pull secret
cod imageRegistry pullSecret set --cluster my-cluster
# Retry deployment
cod cluster stack deploy production-stack-v2 --cluster my-cluster --retry
Problem: Stack components conflict with existing resources
Error: Conflict: Resource "ingress-nginx" already exists
Cause: Cluster already has infrastructure components installed.
Solution:
# Audit existing infrastructure
kubectl get all -A
# Selective stack deployment (skip conflicting components)
cod cluster stack deploy production-stack-v2 \
--cluster my-cluster \
--skip ingress-nginx
# Or: Uninstall conflicting component first
helm uninstall ingress-nginx -n ingress
cod cluster stack deploy production-stack-v2 --cluster my-cluster
Problem: Stack version drift
Issue: Some clusters running stack v2.0.0, others on v2.1.0.
Detection:
cod cluster list --show-stack-version
Output:
CLUSTER STACK VERSION
prod-us-east production-stack-v2 2.1.0
prod-us-west production-stack-v2 2.1.0
prod-eu-west production-stack-v2 2.0.0 ⚠️ OUTDATED
dev-us-east dev-stack 1.5.0
Solution:
# Update outdated cluster
cod cluster stack deploy production-stack-v2:2.1.0 \
--cluster prod-eu-west
FAQ
Q: Can I create a stack without a golden cluster?
A: Not recommended, but yes. You can manually define a stack using asset references. However, capturing from a working cluster ensures tested configurations.
Q: What happens to application workloads during stack updates?
A: Application workloads continue running. Infrastructure components are updated with rolling deployments. Brief traffic interruptions possible during ingress controller updates (typically < 30 seconds).
Q: Can I use cluster stacks with existing Helm charts?
A: Yes. Cluster stacks can include Helm charts as assets. Codiac manages Helm releases as part of the stack.
Q: How do I rollback a stack deployment?
A: Deploy the previous stack version:
cod cluster stack deploy production-stack-v2:2.0.0 --cluster my-cluster
Q: Can I share stacks between enterprises?
A: Yes (Enterprise tier). Export stacks to kit libraries for cross-enterprise sharing:
cod kit create --from-stack production-stack-v2 --library @mycompany/infra
Q: Do stacks include custom Kubernetes resources (CRDs)?
A: Yes. Stacks capture operators, CRDs, and custom resources deployed in the infrastructure enterprise.
Related Documentation
- Infrastructure Enterprise (infrx)
- Kits & Component Marketplace
- Cluster Lifecycle Management
- Cluster Hopping
- Glossary: Cluster Stack
Need help with cluster stacks? Contact Support or visit codiac.io to schedule hands-on stack creation guidance.