Table of Contents
- Overview
- The Challenge
- Architecture Design
- The Implementation Journey
- Technology Stack
- Key Decisions
- Final Directory Structure
- Common Operations
- Lessons Learned
- Troubleshooting Reference
- Next Steps
- References
Overview
This guide documents building a production-ready four-node Talos Linux Kubernetes lab with multi-cluster architecture. The implementation features immutable infrastructure, GitOps workflows, and proper separation of stateful and stateless workloads across two distinct clusters.
What You’ll Learn:
- Setting up Talos Linux on bare metal hardware
- Designing multi-cluster architecture for workload isolation
- Implementing Cilium CNI and Democratic CSI storage
- GitOps deployment with Flux CD
- Converting legacy k3s nodes to Talos (bonus section)
The Architecture
The lab consists of four physical nodes running Talos Linux, organized into two Kubernetes clusters with distinct purposes. This design provides workload isolation, independent scaling, and proper security boundaries.
Key Design Principles:
- Immutable infrastructure with API-driven management
- Separation of stateful and stateless workloads
- High availability for critical database services
- GitOps-driven deployments with version control
- Comprehensive secret management with SOPS encryption
Architecture Design
Four-Node Talos Cluster Design
The environment consists of two distinct Kubernetes clusters:
DB Cluster (stateful workloads)
- Purpose: Databases, persistent storage, stateful apps
- Nodes: 3x control plane nodes (10.0.0.102-104)
- Management: Flux GitOps for continuous delivery
- Storage: Democratic CSI providing iSCSI persistent volumes
- Config: lab/db-cluster/
App Cluster (stateless workloads)
- Purpose: Stateless apps, web services
- Nodes: 1x control plane node (10.0.0.115)
- Networking: Cilium CNI with L2 LoadBalancer support
- Config: lab/app-cluster/
Network Layout
Physical Hardware:
- 4x bare metal nodes running Talos Linux
- Organized into 2 Kubernetes clusters
- Static IP addressing for stability
- Shared network infrastructure (10.0.0.0/24)
DB Cluster (3 nodes):
- talos-0lj-bma: 10.0.0.104 (control plane)
- talos-6qj-6v8: 10.0.0.103 (control plane)
- talos-mf1-tt5: 10.0.0.102 (control plane)
App Cluster (1 node):
- app-cp1: 10.0.0.115 (control plane)
- VIP: 10.0.0.170
Network:
- Gateway: 10.0.0.1
- DNS: 10.0.0.1, 8.8.8.8
- Subnet: 10.0.0.0/24
See network configs: lab/db-cluster/network-patch.yaml
The Implementation Journey
Phase 1: Planning and Hardware Preparation
Before implementation, plan your cluster architecture and prepare hardware.
Hardware Requirements:
- 4x physical machines (bare metal or VMs)
- Each node: 4+ CPU cores, 8GB+ RAM, 50GB+ storage
- Network connectivity on same subnet
- USB drives for Talos installation
Download Talos ISO:
| |
Network Planning:
- DB Cluster: 10.0.0.102, 10.0.0.103, 10.0.0.104
- App Cluster: 10.0.0.115
- Gateway: 10.0.0.1
- DNS: 10.0.0.1, 8.8.8.8
Phase 2: DB Cluster Setup (Three-Node HA)
The DB cluster provides high availability for stateful workloads with three control plane nodes.
Step 1: Boot Nodes into Maintenance Mode
For each of the three DB cluster nodes:
- Insert USB with Talos ISO
- Boot from USB
- Node comes up in maintenance mode
- Note the IP address assigned via DHCP
Step 2: Generate Cluster Configs
| |
This creates base configs for control plane and worker nodes.
See the DB cluster configs: lab/db-cluster/
Step 3: Create Network Patches
Each node needs unique network config with static IP:
| |
See actual network patches: lab/db-cluster/network-patch.yaml
Step 4: Apply Configs and Bootstrap
| |
Step 5: Install Democratic CSI for Storage
| |
The DB cluster is now ready with HA control plane and persistent storage.
Phase 3: App Cluster Setup (Single Control Plane)
The app cluster handles stateless workloads with a single control plane node for simplicity.
Step 1: Generate Talos Configurations
| |
This creates the base config files needed for control plane and worker nodes.
Note: The app cluster uses a single control plane since stateless apps can be quickly redeployed if needed. For production stateful workloads, use the HA DB cluster.
See the actual configs:
Step 2: Create Network Configuration Patch
Static IP addressing is essential for cluster nodes. A patch file defines the network config:
| |
The VIP (Virtual IP) at 10.0.0.170 provides a stable endpoint for the Kubernetes API server.
Each of the 4 nodes has its own network patch with unique hostname and IP:
- app-cluster/controlplane-network-patch.yaml (10.0.0.115)
- db-cluster/network-patch.yaml (DB cluster nodes)
Step 3: Create Cilium CNI Patch
Cilium requires specific configuration to work with Talos:
| |
This tells Talos not to install a default CNI, allowing Cilium to be installed manually.
Step 4: Apply Configuration
| |
The --insecure flag is needed for the initial application since the node is in maintenance mode without established trust.
Step 5: Install Cilium CNI
| |
Cilium provides not just CNI functionality but also replaces kube-proxy for better performance.
Step 6: Merge Kubeconfig
To manage both clusters from a single terminal:
| |
Now switching between clusters is as simple as kubectl config use-context.
Phase 4: Repository Reorganization
With two clusters operational, the repository structure needed to reflect this architecture.
Step 1: Rename Directories for Clarity
| |
Step 2: Archive Old Files
Rather than deleting old configurations, they were archived for reference:
| |
Step 3: Clean GitOps Submodule
The gitops directory contained unnecessary files that should be removed:
| |
Phase 5: Security Audit
Before committing changes, a thorough security review was conducted.
Files Identified and Removed:
Kubernetes Credentials
1 2find . -name "kubeconfig" -type f find . -name "*kubeconfig*" -type fAll kubeconfig files grant cluster admin access and must never be committed.
Talos API Credentials
1find . -name "talosconfig" -type fTalosconfig files allow full control of the Talos nodes.
Plain-text Passwords
1 2grep -r "password:" . | grep -v ".git" # Found in Democratic CSI configs: Gl4sh33n*Several YAML files contained iSCSI storage passwords.
Machine Secrets and Tokens Talos machine configs contain bootstrap tokens and certificates that should never be shared.
Large Binary Files
1 2find . -name "*.iso" -o -name "*.raw*" -exec ls -lh {} \; # Found: metal-amd64.iso (100MB)
Updated .gitignore
A comprehensive .gitignore was created to prevent future leaks:
# Secrets - NEVER commit these
*.key
*.pem
age.agekey
**/age.agekey
kubeconfig
**/kubeconfig
talosconfig
**/talosconfig
*-secrets.yaml
*.enc.yaml
# SOPS decrypted files
*.dec
*.dec.yaml
*-decrypted*
# Democratic CSI (has passwords)
democratic-csi-iscsi.yaml
# Large binary files
*.iso
*.raw
*.raw.xz
*.img
*.qcow2
# Images and screenshots
*.jpg
*.jpeg
*.png
*.gif
# Archive and backups
archive/
backups/
# Build outputs
_out/
*.log
See the actual .gitignore: lab/.gitignore
Phase 6: Documentation and Commit
Create README
A comprehensive README.md was created documenting:
- Architecture overview
- Quick start commands
- How to switch between clusters
- Management procedures
- Troubleshooting guides
Commit Changes
| |
Technology Stack
| Component | Version | Purpose |
|---|---|---|
| Talos Linux | v1.11.3 | Immutable Kubernetes OS |
| Kubernetes | v1.34.1 | Container orchestration |
| Cilium | v1.16.5 | CNI and L2 LoadBalancer |
| Flux | v2.x | GitOps continuous delivery |
| Democratic CSI | Latest | iSCSI storage provisioner |
| External Secrets | Latest | Azure Key Vault integration |
| SOPS + age | Latest | Secret encryption |
Key Decisions
Why Separate Clusters?
Workload Isolation Databases and stateful applications have different resource requirements and failure modes than stateless apps. Separating them prevents resource contention and limits blast radius.
Resource Management The DB cluster can use local NVMe storage for high-performance persistent volumes, while the app cluster can focus on compute and memory for application workloads.
Security A compromised application workload can’t directly access database credentials or persistent storage in the separate cluster.
Operational Flexibility Each cluster can be upgraded, maintained, or modified independently without affecting the other.
Why Single Control Plane for App Cluster?
Cost Efficiency Running three control plane nodes requires significant resources. For stateless applications, a single control plane with worker redundancy provides adequate availability.
Simplicity Managing a smaller cluster is operationally simpler, especially during the initial setup phase.
Extensibility If true control plane HA becomes necessary, additional control plane nodes can be added later.
Risk Profile The app cluster runs stateless workloads that can be quickly redeployed if the control plane fails, unlike stateful databases.
Why Talos Linux?
Immutability No SSH access, no package manager, no manual modifications. The entire system is declared via machine configs.
API-Driven
All operations happen through a gRPC API (talosctl), not shell scripts or Ansible playbooks.
Security Minimal attack surface with ~80MB OS footprint. No persistent shells or user accounts to compromise.
Kubernetes-Native Purpose-built for Kubernetes. Every component is designed specifically for running container workloads.
Declarative Machine config is versioned YAML. Changes are applied atomically, and rolling back is trivial.
Proven Reliability The three existing Talos nodes in the DB cluster had already proven Talos’s reliability. Converting the fourth node from k3s to Talos completed the homogeneous infrastructure.
See the complete setup: github.com/t12-pybash/lab
Final Directory Structure
k8s-lab/
├── .devcontainer/ # VS Code dev container config
├── .git/ # Git repository
├── .gitignore # Comprehensive secrets exclusion
├── README.md # Cluster documentation
│
├── app-cluster/ # App cluster configs
│ ├── _out/ # Generated (gitignored)
│ │ ├── controlplane.yaml
│ │ ├── talosconfig
│ │ └── worker.yaml
│ ├── cilium-patch.yaml
│ └── controlplane-network-patch.yaml
│
├── db-cluster/ # DB cluster configs
│ ├── patches/
│ │ └── wk5.patch
│ ├── allow-scheduling-patch.yaml
│ ├── azure-cluster-secretstore.yaml
│ ├── azure-secretstore.yaml
│ ├── cilium-l2-config.yaml
│ ├── democratic-csi-talos-overrides.yaml
│ ├── etcd-patch.yaml
│ ├── iscsi-extension-patch.yaml
│ ├── linkding-external-secret.yaml
│ ├── network-patch.yaml
│ ├── schematic.yaml
│ └── wireguard-extension.yaml
│
├── gitops/ # GitOps manifests (submodule)
│ └── k8s-lab-cluster/
│ ├── apps/ # Application definitions
│ ├── base/ # Base configurations
│ ├── clusters/ # Cluster-specific configs
│ ├── infrastructure/ # Infrastructure components
│ ├── monitoring/ # Monitoring stack
│ └── workspaces/ # Workspace configs
│
├── docs/ # Documentation
│ └── multi-cluster-setup-guide.md
│
├── archive/ # Old files (gitignored)
│ ├── calico/
│ ├── mealie/
│ ├── deployments/
│ └── talos-prod-cluster/
│
└── backups/ # Config backups (gitignored)
Common Operations
Switch Between Clusters
| |
Access Talos Nodes
| |
Check Cluster Health
| |
Update Talos
| |
Update Kubernetes
| |
Lessons Learned
What Worked
USB ISO Boot Method This proved to be the most reliable way to install Talos on physical hardware. It provides a clean, known-good environment without existing configuration conflicts.
Static IP Addressing Using static IPs (10.0.0.102-104, 10.0.0.115) eliminated problems with dynamic IP changes and made cluster configuration predictable.
Configuration Patches Talos’s patch system allows overriding specific configuration sections without regenerating entire machine configs. This makes network and CNI customization straightforward.
Separate GitOps Repository Using a git submodule for GitOps manifests keeps cluster configuration separate from application deployments while maintaining the link between them.
Early Security Audit Checking for secrets before the first commit prevented credentials from entering the repository history.
What Didn’t Work
Single Control Plane Risk For the app cluster, a single control plane is acceptable since workloads are stateless. If higher availability is needed, add more control plane nodes using the same process as the DB cluster.
Mixing Workload Types Keep stateful workloads on the DB cluster with HA. Stateless apps can run on either cluster, but separation provides better resource isolation.
DHCP IP Allocation Dynamic IPs caused problems when node IPs changed. Static addressing solved this completely.
Best Practices
Always Use Static IPs Cluster nodes should have predictable, unchanging IP addresses. This simplifies configuration and troubleshooting.
Keep Secrets Out of Git Use comprehensive .gitignore patterns from the start. Once secrets enter git history, they’re difficult to remove completely.
Separate Cluster Configurations Distinct directories for each cluster (db-cluster/, app-cluster/) make it clear which configurations apply where.
Archive, Don’t Delete Keep old configurations in an archive/ directory. They’re useful for reference and don’t hurt anything if gitignored.
Document Network Topology Clear documentation of IPs, hostnames, and network layout prevents confusion and mistakes.
Use SOPS for GitOps Secrets Encrypted secrets can be safely stored in git while remaining secure. SOPS with age keys provides a good balance of security and usability.
Test in Maintenance Mode Before bootstrapping, verify all configurations work correctly. It’s easier to fix issues when the cluster isn’t running yet.
Troubleshooting Reference
Node Won’t Bootstrap
| |
Network Connectivity Issues
| |
Certificate Errors
| |
Storage Issues
| |
Next Steps
The multi-cluster foundation is now in place. Future enhancements include:
Add VPS Worker Node Configure a remote VPS to join the app cluster via WireGuard VPN for true workload redundancy.
Deploy Applications Migrate stateless applications to the app cluster and databases to the DB cluster.
Implement Monitoring Deploy Prometheus and Grafana for cluster and application metrics.
Configure Ingress Set up Traefik or Nginx Ingress Controller with TLS certificates.
Backup Strategy Implement automated backups for etcd and persistent volumes.
CI/CD Integration Connect Flux to automatically deploy from git commits.
References
Documentation
Live Implementation
Main Lab Repo: github.com/t12-pybash/lab
Key directories:
- app-cluster/ - Single node app cluster configs
- db-cluster/ - Three-node HA DB cluster configs
- gitops/ - Flux GitOps manifests
- docs/ - Additional documentation
- .gitignore - Comprehensive secrets exclusion
- README.md - Cluster overview and operations
GitOps Repo: github.com/t12-pybash/k8s-lab-cluster
Explore the complete four-node Talos setup - from initial k3s conversion to multi-cluster production environment.
Tags: Kubernetes, Talos, Multi-Cluster, DevOps, GitOps, Infrastructure, Cilium, Flux