Table of Contents
- Overview
- The Challenge
- Architecture Design
- The Implementation Journey
- Technology Stack
- Key Decisions
- Final Directory Structure
- Common Operations
- Lessons Learned
- Troubleshooting Reference
- Next Steps
- References
Overview
This guide documents the complete process of transforming a single Kubernetes cluster with a broken worker node into a multi-cluster laboratory environment using Talos Linux. The journey involves troubleshooting, architectural redesign, and implementing proper GitOps practices while maintaining security throughout.
The Challenge
The starting point was a Kubernetes cluster with three control plane nodes and one worker node that had become unresponsive. Initial diagnostics revealed:
- Worker node (
talos-worker-1) in NotReady state - Certificate mismatches preventing remote management
- Mixed workload types (stateful and stateless) on the same cluster
- Disorganized repository structure with potential security issues
Rather than simply fixing the broken node, this became an opportunity to implement a better architecture: separate clusters for different workload types.
Architecture Design
Final Architecture
The redesigned environment consists of two distinct Kubernetes clusters:
DB Cluster (formerly talos-cluster)
- Purpose: Stateful workloads, databases, and persistent storage
- Nodes: 3x control plane nodes for high availability
- Management: Flux GitOps for continuous delivery
- Storage: Democratic CSI providing iSCSI persistent volumes
App Cluster (new)
- Purpose: Stateless application workloads
- Nodes: 1x control plane node (10.0.0.115)
- Future: VPS worker node for workload redundancy
- Networking: Cilium CNI with L2 LoadBalancer support
Network Layout
DB Cluster:
- talos-0lj-bma: 10.0.0.104
- talos-6qj-6v8: 10.0.0.103
- talos-mf1-tt5: 10.0.0.102
App Cluster:
- app-cp1: 10.0.0.115
- VIP: 10.0.0.170
- Future: VPS worker node
Network:
- Gateway: 10.0.0.1
- DNS: 10.0.0.1, 8.8.8.8
- Subnet: 10.0.0.0/24
The Implementation Journey
Phase 1: Initial Assessment
The first step was understanding the current state. Running diagnostics on the broken worker node revealed several issues:
| |
The certificate error indicated the node had been reset or reconfigured at some point, breaking trust with the cluster.
Phase 2: Worker Node Troubleshooting
Several recovery approaches were attempted, each providing valuable lessons:
Attempt 1: Remote Reset via talosctl
| |
This failed due to certificate mismatches. The node couldn’t be authenticated.
Attempt 2: Direct Disk Wipe from Ubuntu
Physical access was available, so Ubuntu was installed temporarily to wipe the disk:
| |
This failed because the disk was in use by the running Ubuntu system.
Attempt 3: Kexec Boot into Talos
Attempting to boot directly into Talos from the running Ubuntu system:
| |
This resulted in a black screen with no network connectivity - an unrecoverable state.
Successful Solution: USB ISO Boot
The reliable approach was booting from a Talos USB ISO:
- Created bootable USB with Talos installer
- Booted the physical machine from USB
- Node came up in maintenance mode at IP 10.0.0.115
- Applied fresh configuration from this clean state
This method worked because it provided a known-good Talos environment without any existing configuration conflicts.
Phase 3: App Cluster Creation
With the node booted in maintenance mode, the next step was creating a new cluster configuration.
Step 1: Generate Talos Configurations
| |
This creates the base configuration files needed for both control plane and worker nodes.
Step 2: Create Network Configuration Patch
Static IP addressing is essential for cluster nodes. A patch file defines the network configuration:
| |
The VIP (Virtual IP) at 10.0.0.170 provides a stable endpoint for the Kubernetes API server.
Step 3: Create Cilium CNI Patch
Cilium requires specific configuration to work with Talos:
| |
This tells Talos not to install a default CNI, allowing Cilium to be installed manually.
Step 4: Apply Configuration
| |
The --insecure flag is needed for the initial application since the node is in maintenance mode without established trust.
Step 5: Install Cilium CNI
| |
Cilium provides not just CNI functionality but also replaces kube-proxy for better performance.
Step 6: Merge Kubeconfig
To manage both clusters from a single terminal:
| |
Now switching between clusters is as simple as kubectl config use-context.
Phase 4: Repository Reorganization
With two clusters operational, the repository structure needed to reflect this architecture.
Step 1: Rename Directories for Clarity
| |
Step 2: Archive Old Files
Rather than deleting old configurations, they were archived for reference:
| |
Step 3: Clean GitOps Submodule
The gitops directory contained unnecessary files that should be removed:
| |
Phase 5: Security Audit
Before committing changes, a thorough security review was conducted.
Files Identified and Removed:
Kubernetes Credentials
1 2find . -name "kubeconfig" -type f find . -name "*kubeconfig*" -type fAll kubeconfig files grant cluster admin access and must never be committed.
Talos API Credentials
1find . -name "talosconfig" -type fTalosconfig files allow full control of the Talos nodes.
Plain-text Passwords
1 2grep -r "password:" . | grep -v ".git" # Found in Democratic CSI configs: Gl4sh33n*Several YAML files contained iSCSI storage passwords.
Machine Secrets and Tokens Talos machine configs contain bootstrap tokens and certificates that should never be shared.
Large Binary Files
1 2find . -name "*.iso" -o -name "*.raw*" -exec ls -lh {} \; # Found: metal-amd64.iso (100MB)
Updated .gitignore
A comprehensive .gitignore was created to prevent future leaks:
# Secrets - NEVER commit these
*.key
*.pem
age.agekey
**/age.agekey
kubeconfig
**/kubeconfig
talosconfig
**/talosconfig
*-secrets.yaml
*.enc.yaml
# SOPS decrypted files
*.dec
*.dec.yaml
*-decrypted*
# Democratic CSI (has passwords)
democratic-csi-iscsi.yaml
# Large binary files
*.iso
*.raw
*.raw.xz
*.img
*.qcow2
# Images and screenshots
*.jpg
*.jpeg
*.png
*.gif
# Archive and backups
archive/
backups/
# Build outputs
_out/
*.log
Phase 6: Documentation and Commit
Create README
A comprehensive README.md was created documenting:
- Architecture overview
- Quick start commands
- How to switch between clusters
- Management procedures
- Troubleshooting guides
Commit Changes
| |
Technology Stack
| Component | Version | Purpose |
|---|---|---|
| Talos Linux | v1.11.3 | Immutable Kubernetes OS |
| Kubernetes | v1.34.1 | Container orchestration |
| Cilium | v1.16.5 | CNI and L2 LoadBalancer |
| Flux | v2.x | GitOps continuous delivery |
| Democratic CSI | Latest | iSCSI storage provisioner |
| External Secrets | Latest | Azure Key Vault integration |
| SOPS + age | Latest | Secret encryption |
Key Decisions
Why Separate Clusters?
Workload Isolation Databases and stateful applications have different resource requirements and failure modes than stateless apps. Separating them prevents resource contention and limits blast radius.
Resource Management The DB cluster can use local NVMe storage for high-performance persistent volumes, while the app cluster can focus on compute and memory for application workloads.
Security A compromised application workload can’t directly access database credentials or persistent storage in the separate cluster.
Operational Flexibility Each cluster can be upgraded, maintained, or modified independently without affecting the other.
Why Single Control Plane for App Cluster?
Cost Efficiency Running three control plane nodes requires significant resources. For stateless applications, a single control plane with worker redundancy provides adequate availability.
Simplicity Managing a smaller cluster is operationally simpler, especially during the initial setup phase.
Extensibility If true control plane HA becomes necessary, additional control plane nodes can be added later.
Risk Profile The app cluster runs stateless workloads that can be quickly redeployed if the control plane fails, unlike stateful databases.
Why Talos Linux?
Immutability No SSH access, no package manager, no manual modifications. The entire system is declared via machine configs.
API-Driven
All operations happen through a gRPC API (talosctl), not shell scripts or Ansible playbooks.
Security Minimal attack surface with ~80MB OS footprint. No persistent shells or user accounts to compromise.
Kubernetes-Native Purpose-built for Kubernetes. Every component is designed specifically for running container workloads.
Declarative Machine configuration is versioned YAML. Changes are applied atomically, and rolling back is trivial.
Final Directory Structure
k8s-lab/
├── .devcontainer/ # VS Code dev container config
├── .git/ # Git repository
├── .gitignore # Comprehensive secrets exclusion
├── README.md # Cluster documentation
│
├── app-cluster/ # App cluster configs
│ ├── _out/ # Generated (gitignored)
│ │ ├── controlplane.yaml
│ │ ├── talosconfig
│ │ └── worker.yaml
│ ├── cilium-patch.yaml
│ └── controlplane-network-patch.yaml
│
├── db-cluster/ # DB cluster configs
│ ├── patches/
│ │ └── wk5.patch
│ ├── allow-scheduling-patch.yaml
│ ├── azure-cluster-secretstore.yaml
│ ├── azure-secretstore.yaml
│ ├── cilium-l2-config.yaml
│ ├── democratic-csi-talos-overrides.yaml
│ ├── etcd-patch.yaml
│ ├── iscsi-extension-patch.yaml
│ ├── linkding-external-secret.yaml
│ ├── network-patch.yaml
│ ├── schematic.yaml
│ └── wireguard-extension.yaml
│
├── gitops/ # GitOps manifests (submodule)
│ └── k8s-lab-cluster/
│ ├── apps/ # Application definitions
│ ├── base/ # Base configurations
│ ├── clusters/ # Cluster-specific configs
│ ├── infrastructure/ # Infrastructure components
│ ├── monitoring/ # Monitoring stack
│ └── workspaces/ # Workspace configs
│
├── docs/ # Documentation
│ └── multi-cluster-setup-guide.md
│
├── archive/ # Old files (gitignored)
│ ├── calico/
│ ├── mealie/
│ ├── deployments/
│ └── talos-prod-cluster/
│
└── backups/ # Config backups (gitignored)
Common Operations
Switch Between Clusters
| |
Access Talos Nodes
| |
Check Cluster Health
| |
Update Talos
| |
Update Kubernetes
| |
Lessons Learned
What Worked
USB ISO Boot Method This proved to be the most reliable way to install Talos on physical hardware. It provides a clean, known-good environment without existing configuration conflicts.
Static IP Addressing Using static IPs (10.0.0.102-104, 10.0.0.115) eliminated problems with dynamic IP changes and made cluster configuration predictable.
Configuration Patches Talos’s patch system allows overriding specific configuration sections without regenerating entire machine configs. This makes network and CNI customization straightforward.
Separate GitOps Repository Using a git submodule for GitOps manifests keeps cluster configuration separate from application deployments while maintaining the link between them.
Early Security Audit Checking for secrets before the first commit prevented credentials from entering the repository history.
What Didn’t Work
Remote Reset with Certificate Issues When certificates don’t match, Talos won’t allow remote operations. Physical access or known-good configs are required.
Disk Operations from Running OS Attempting to wipe the boot disk from the running system fails. The disk is in use and can’t be unmounted.
Kexec Boot Method While theoretically possible, kexec booting into Talos from another OS proved unreliable and resulted in unrecoverable states.
DHCP IP Allocation Dynamic IPs caused problems when node IPs changed. Static addressing solved this completely.
Best Practices
Always Use Static IPs Cluster nodes should have predictable, unchanging IP addresses. This simplifies configuration and troubleshooting.
Keep Secrets Out of Git Use comprehensive .gitignore patterns from the start. Once secrets enter git history, they’re difficult to remove completely.
Separate Cluster Configurations Distinct directories for each cluster (db-cluster/, app-cluster/) make it clear which configurations apply where.
Archive, Don’t Delete Keep old configurations in an archive/ directory. They’re useful for reference and don’t hurt anything if gitignored.
Document Network Topology Clear documentation of IPs, hostnames, and network layout prevents confusion and mistakes.
Use SOPS for GitOps Secrets Encrypted secrets can be safely stored in git while remaining secure. SOPS with age keys provides a good balance of security and usability.
Test in Maintenance Mode Before bootstrapping, verify all configurations work correctly. It’s easier to fix issues when the cluster isn’t running yet.
Troubleshooting Reference
Node Won’t Bootstrap
| |
Network Connectivity Issues
| |
Certificate Errors
| |
Storage Issues
| |
Next Steps
The multi-cluster foundation is now in place. Future enhancements include:
Add VPS Worker Node Configure a remote VPS to join the app cluster via WireGuard VPN for true workload redundancy.
Deploy Applications Migrate stateless applications to the app cluster and databases to the DB cluster.
Implement Monitoring Deploy Prometheus and Grafana for cluster and application metrics.
Configure Ingress Set up Traefik or Nginx Ingress Controller with TLS certificates.
Backup Strategy Implement automated backups for etcd and persistent volumes.
CI/CD Integration Connect Flux to automatically deploy from git commits.
References
Repository Links:
- Main Repo: github.com/t12-pybash/lab
- GitOps Repo: github.com/t12-pybash/k8s-lab-cluster
Tags: Kubernetes, Talos, Multi-Cluster, DevOps, GitOps, Infrastructure, Cilium, Flux