Table of Contents


Overview

This guide documents the complete process of transforming a single Kubernetes cluster with a broken worker node into a multi-cluster laboratory environment using Talos Linux. The journey involves troubleshooting, architectural redesign, and implementing proper GitOps practices while maintaining security throughout.

The Challenge

The starting point was a Kubernetes cluster with three control plane nodes and one worker node that had become unresponsive. Initial diagnostics revealed:

  • Worker node (talos-worker-1) in NotReady state
  • Certificate mismatches preventing remote management
  • Mixed workload types (stateful and stateless) on the same cluster
  • Disorganized repository structure with potential security issues

Rather than simply fixing the broken node, this became an opportunity to implement a better architecture: separate clusters for different workload types.

Architecture Design

Final Architecture

The redesigned environment consists of two distinct Kubernetes clusters:

DB Cluster (formerly talos-cluster)

  • Purpose: Stateful workloads, databases, and persistent storage
  • Nodes: 3x control plane nodes for high availability
  • Management: Flux GitOps for continuous delivery
  • Storage: Democratic CSI providing iSCSI persistent volumes

App Cluster (new)

  • Purpose: Stateless application workloads
  • Nodes: 1x control plane node (10.0.0.115)
  • Future: VPS worker node for workload redundancy
  • Networking: Cilium CNI with L2 LoadBalancer support

Network Layout

DB Cluster:
  - talos-0lj-bma: 10.0.0.104
  - talos-6qj-6v8: 10.0.0.103
  - talos-mf1-tt5: 10.0.0.102

App Cluster:
  - app-cp1: 10.0.0.115
  - VIP: 10.0.0.170
  - Future: VPS worker node

Network:
  - Gateway: 10.0.0.1
  - DNS: 10.0.0.1, 8.8.8.8
  - Subnet: 10.0.0.0/24

The Implementation Journey

Phase 1: Initial Assessment

The first step was understanding the current state. Running diagnostics on the broken worker node revealed several issues:

1
2
3
4
5
6
7
8
# Check node status
kubectl get nodes
# NAME              STATUS     ROLES    AGE   VERSION
# talos-worker-1    NotReady   <none>   45d   v1.34.1

# Attempt to access via Talos API
talosctl --nodes 10.0.0.111 version
# Error: x509: certificate signed by unknown authority

The certificate error indicated the node had been reset or reconfigured at some point, breaking trust with the cluster.

Phase 2: Worker Node Troubleshooting

Several recovery approaches were attempted, each providing valuable lessons:

Attempt 1: Remote Reset via talosctl

1
talosctl --nodes 10.0.0.111 reset --graceful=false --reboot

This failed due to certificate mismatches. The node couldn’t be authenticated.

Attempt 2: Direct Disk Wipe from Ubuntu

Physical access was available, so Ubuntu was installed temporarily to wipe the disk:

1
2
3
wipefs -a /dev/nvme0n1
dd if=talos-amd64.raw of=/dev/nvme0n1 bs=4M status=progress
# Error: Device or resource busy

This failed because the disk was in use by the running Ubuntu system.

Attempt 3: Kexec Boot into Talos

Attempting to boot directly into Talos from the running Ubuntu system:

1
2
kexec -l /boot/vmlinuz-talos --initrd=/boot/initramfs-talos.img
kexec -e

This resulted in a black screen with no network connectivity - an unrecoverable state.

Successful Solution: USB ISO Boot

The reliable approach was booting from a Talos USB ISO:

  1. Created bootable USB with Talos installer
  2. Booted the physical machine from USB
  3. Node came up in maintenance mode at IP 10.0.0.115
  4. Applied fresh configuration from this clean state

This method worked because it provided a known-good Talos environment without any existing configuration conflicts.

Phase 3: App Cluster Creation

With the node booted in maintenance mode, the next step was creating a new cluster configuration.

Step 1: Generate Talos Configurations

1
2
3
4
5
# Generate fresh cluster configs
talosctl gen config app-cluster https://10.0.0.115:6443 \
  --output-dir app-cluster/_out \
  --with-docs=false \
  --with-examples=false

This creates the base configuration files needed for both control plane and worker nodes.

Step 2: Create Network Configuration Patch

Static IP addressing is essential for cluster nodes. A patch file defines the network configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# app-cluster/controlplane-network-patch.yaml
machine:
  network:
    hostname: app-cp1
    interfaces:
      - interface: enp2s0f0
        addresses:
          - 10.0.0.115/24
        routes:
          - network: 0.0.0.0/0
            gateway: 10.0.0.1
        vip:
          ip: 10.0.0.170
    nameservers:
      - 10.0.0.1
      - 8.8.8.8

The VIP (Virtual IP) at 10.0.0.170 provides a stable endpoint for the Kubernetes API server.

Step 3: Create Cilium CNI Patch

Cilium requires specific configuration to work with Talos:

1
2
3
4
5
# app-cluster/cilium-patch.yaml
cluster:
  network:
    cni:
      name: none

This tells Talos not to install a default CNI, allowing Cilium to be installed manually.

Step 4: Apply Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Apply the configuration with all patches
talosctl --talosconfig app-cluster/_out/talosconfig \
  --nodes 10.0.0.115 \
  apply-config --insecure \
  --file app-cluster/_out/controlplane.yaml \
  --config-patch @app-cluster/controlplane-network-patch.yaml \
  --config-patch @app-cluster/cilium-patch.yaml

# Bootstrap the Kubernetes cluster
talosctl --talosconfig app-cluster/_out/talosconfig \
  --nodes 10.0.0.115 \
  bootstrap

The --insecure flag is needed for the initial application since the node is in maintenance mode without established trust.

Step 5: Install Cilium CNI

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Retrieve kubeconfig
talosctl --talosconfig app-cluster/_out/talosconfig \
  --nodes 10.0.0.115 \
  kubeconfig app-cluster/app-kubeconfig

# Install Cilium
KUBECONFIG=app-cluster/app-kubeconfig \
cilium install \
  --version 1.16.5 \
  --set ipam.mode=kubernetes \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=10.0.0.115 \
  --set k8sServicePort=6443

# Verify installation
KUBECONFIG=app-cluster/app-kubeconfig cilium status

Cilium provides not just CNI functionality but also replaces kube-proxy for better performance.

Step 6: Merge Kubeconfig

To manage both clusters from a single terminal:

1
2
3
4
5
6
7
8
9
# Merge kubeconfigs
KUBECONFIG=~/.kube/config:app-cluster/app-kubeconfig \
  kubectl config view --flatten > ~/.kube/config.new
mv ~/.kube/config.new ~/.kube/config

# Verify both contexts exist
kubectl config get-contexts
# * admin@app-cluster
#   admin@talos cluster

Now switching between clusters is as simple as kubectl config use-context.

Phase 4: Repository Reorganization

With two clusters operational, the repository structure needed to reflect this architecture.

Step 1: Rename Directories for Clarity

1
2
3
4
5
# Rename for clear purpose identification
mv talos-cluster db-cluster
mv k8s-lab-cluster gitops

# app-cluster already named appropriately

Step 2: Archive Old Files

Rather than deleting old configurations, they were archived for reference:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
mkdir -p archive

# Move old deployment files
mv calico/ archive/
mv mealie/ archive/
mv deployments/ archive/
mv talos-prod-cluster/ archive/

# Move loose YAML files
mv *.yaml archive/

Step 3: Clean GitOps Submodule

The gitops directory contained unnecessary files that should be removed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
cd gitops/k8s-lab-cluster

# Remove empty encrypted file
rm -f azure-credentials.enc.yaml      # 0 bytes

# Remove misplaced files
rm -f 1                                # Misnamed file
rm -f kube-prometheus-stack.yaml       # 181KB auto-generated
rm -f grafana-tls-secret.yaml          # Should be in proper directory

# Remove duplicate nested directory
rm -rf k8s-lab-cluster/                # 980KB duplicate

# Result: 4.5MB → 3.4MB (removed 1.1MB)

Phase 5: Security Audit

Before committing changes, a thorough security review was conducted.

Files Identified and Removed:

  1. Kubernetes Credentials

    1
    2
    
    find . -name "kubeconfig" -type f
    find . -name "*kubeconfig*" -type f
    

    All kubeconfig files grant cluster admin access and must never be committed.

  2. Talos API Credentials

    1
    
    find . -name "talosconfig" -type f
    

    Talosconfig files allow full control of the Talos nodes.

  3. Plain-text Passwords

    1
    2
    
    grep -r "password:" . | grep -v ".git"
    # Found in Democratic CSI configs: Gl4sh33n*
    

    Several YAML files contained iSCSI storage passwords.

  4. Machine Secrets and Tokens Talos machine configs contain bootstrap tokens and certificates that should never be shared.

  5. Large Binary Files

    1
    2
    
    find . -name "*.iso" -o -name "*.raw*" -exec ls -lh {} \;
    # Found: metal-amd64.iso (100MB)
    

Updated .gitignore

A comprehensive .gitignore was created to prevent future leaks:

# Secrets - NEVER commit these
*.key
*.pem
age.agekey
**/age.agekey
kubeconfig
**/kubeconfig
talosconfig
**/talosconfig
*-secrets.yaml
*.enc.yaml

# SOPS decrypted files
*.dec
*.dec.yaml
*-decrypted*

# Democratic CSI (has passwords)
democratic-csi-iscsi.yaml

# Large binary files
*.iso
*.raw
*.raw.xz
*.img
*.qcow2

# Images and screenshots
*.jpg
*.jpeg
*.png
*.gif

# Archive and backups
archive/
backups/

# Build outputs
_out/
*.log

Phase 6: Documentation and Commit

Create README

A comprehensive README.md was created documenting:

  • Architecture overview
  • Quick start commands
  • How to switch between clusters
  • Management procedures
  • Troubleshooting guides

Commit Changes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Add gitops as proper submodule
git submodule add [email protected]:t12-pybash/k8s-lab-cluster.git \
  gitops/k8s-lab-cluster

# Commit repository reorganization
git add -A
git commit -m "Reorganize repository into multi-cluster architecture

Restructured the repository to support separate DB and App clusters:
- Renamed talos-cluster → db-cluster for stateful workloads
- Created app-cluster directory for application workloads
- Renamed k8s-lab-cluster → gitops for GitOps manifests
- Removed old deployment files (calico, mealie, standalone yamls)
- Cleaned up gitops directory (removed duplicates and generated files)
- Added comprehensive README with architecture documentation
- Updated .gitignore to exclude secrets and build artifacts"

# Commit gitops submodule cleanup
cd gitops/k8s-lab-cluster
git add -A
git commit -m "Clean up unnecessary files and duplicates"
cd ../..

# Update submodule reference in main repo
git add gitops/k8s-lab-cluster
git commit -m "Update gitops submodule to cleaned version"

# Push all changes
git push origin main

Technology Stack

ComponentVersionPurpose
Talos Linuxv1.11.3Immutable Kubernetes OS
Kubernetesv1.34.1Container orchestration
Ciliumv1.16.5CNI and L2 LoadBalancer
Fluxv2.xGitOps continuous delivery
Democratic CSILatestiSCSI storage provisioner
External SecretsLatestAzure Key Vault integration
SOPS + ageLatestSecret encryption

Key Decisions

Why Separate Clusters?

Workload Isolation Databases and stateful applications have different resource requirements and failure modes than stateless apps. Separating them prevents resource contention and limits blast radius.

Resource Management The DB cluster can use local NVMe storage for high-performance persistent volumes, while the app cluster can focus on compute and memory for application workloads.

Security A compromised application workload can’t directly access database credentials or persistent storage in the separate cluster.

Operational Flexibility Each cluster can be upgraded, maintained, or modified independently without affecting the other.

Why Single Control Plane for App Cluster?

Cost Efficiency Running three control plane nodes requires significant resources. For stateless applications, a single control plane with worker redundancy provides adequate availability.

Simplicity Managing a smaller cluster is operationally simpler, especially during the initial setup phase.

Extensibility If true control plane HA becomes necessary, additional control plane nodes can be added later.

Risk Profile The app cluster runs stateless workloads that can be quickly redeployed if the control plane fails, unlike stateful databases.

Why Talos Linux?

Immutability No SSH access, no package manager, no manual modifications. The entire system is declared via machine configs.

API-Driven All operations happen through a gRPC API (talosctl), not shell scripts or Ansible playbooks.

Security Minimal attack surface with ~80MB OS footprint. No persistent shells or user accounts to compromise.

Kubernetes-Native Purpose-built for Kubernetes. Every component is designed specifically for running container workloads.

Declarative Machine configuration is versioned YAML. Changes are applied atomically, and rolling back is trivial.

Final Directory Structure

k8s-lab/
├── .devcontainer/              # VS Code dev container config
├── .git/                       # Git repository
├── .gitignore                  # Comprehensive secrets exclusion
├── README.md                   # Cluster documentation
│
├── app-cluster/                # App cluster configs
│   ├── _out/                   # Generated (gitignored)
│   │   ├── controlplane.yaml
│   │   ├── talosconfig
│   │   └── worker.yaml
│   ├── cilium-patch.yaml
│   └── controlplane-network-patch.yaml
│
├── db-cluster/                 # DB cluster configs
│   ├── patches/
│   │   └── wk5.patch
│   ├── allow-scheduling-patch.yaml
│   ├── azure-cluster-secretstore.yaml
│   ├── azure-secretstore.yaml
│   ├── cilium-l2-config.yaml
│   ├── democratic-csi-talos-overrides.yaml
│   ├── etcd-patch.yaml
│   ├── iscsi-extension-patch.yaml
│   ├── linkding-external-secret.yaml
│   ├── network-patch.yaml
│   ├── schematic.yaml
│   └── wireguard-extension.yaml
│
├── gitops/                     # GitOps manifests (submodule)
│   └── k8s-lab-cluster/
│       ├── apps/               # Application definitions
│       ├── base/               # Base configurations
│       ├── clusters/           # Cluster-specific configs
│       ├── infrastructure/     # Infrastructure components
│       ├── monitoring/         # Monitoring stack
│       └── workspaces/         # Workspace configs
│
├── docs/                       # Documentation
│   └── multi-cluster-setup-guide.md
│
├── archive/                    # Old files (gitignored)
│   ├── calico/
│   ├── mealie/
│   ├── deployments/
│   └── talos-prod-cluster/
│
└── backups/                    # Config backups (gitignored)

Common Operations

Switch Between Clusters

1
2
3
4
5
6
7
8
# Use DB cluster
kubectl config use-context "admin@talos cluster"

# Use App cluster
kubectl config use-context "admin@app-cluster"

# View all contexts
kubectl config get-contexts

Access Talos Nodes

1
2
3
4
5
6
7
# DB cluster nodes
talosctl --talosconfig db-cluster/talosconfig-working \
  --nodes 10.0.0.102 version

# App cluster node
talosctl --talosconfig app-cluster/_out/talosconfig \
  --nodes 10.0.0.115 version

Check Cluster Health

1
2
3
4
5
6
7
8
# Kubernetes nodes
kubectl get nodes -o wide

# Talos services
talosctl --nodes <ip> services

# Cilium status
cilium status --wait

Update Talos

1
2
talosctl --nodes 10.0.0.115 \
  upgrade --image ghcr.io/siderolabs/installer:v1.11.3

Update Kubernetes

1
2
talosctl --nodes 10.0.0.115 \
  upgrade-k8s --to 1.34.1

Lessons Learned

What Worked

USB ISO Boot Method This proved to be the most reliable way to install Talos on physical hardware. It provides a clean, known-good environment without existing configuration conflicts.

Static IP Addressing Using static IPs (10.0.0.102-104, 10.0.0.115) eliminated problems with dynamic IP changes and made cluster configuration predictable.

Configuration Patches Talos’s patch system allows overriding specific configuration sections without regenerating entire machine configs. This makes network and CNI customization straightforward.

Separate GitOps Repository Using a git submodule for GitOps manifests keeps cluster configuration separate from application deployments while maintaining the link between them.

Early Security Audit Checking for secrets before the first commit prevented credentials from entering the repository history.

What Didn’t Work

Remote Reset with Certificate Issues When certificates don’t match, Talos won’t allow remote operations. Physical access or known-good configs are required.

Disk Operations from Running OS Attempting to wipe the boot disk from the running system fails. The disk is in use and can’t be unmounted.

Kexec Boot Method While theoretically possible, kexec booting into Talos from another OS proved unreliable and resulted in unrecoverable states.

DHCP IP Allocation Dynamic IPs caused problems when node IPs changed. Static addressing solved this completely.

Best Practices

Always Use Static IPs Cluster nodes should have predictable, unchanging IP addresses. This simplifies configuration and troubleshooting.

Keep Secrets Out of Git Use comprehensive .gitignore patterns from the start. Once secrets enter git history, they’re difficult to remove completely.

Separate Cluster Configurations Distinct directories for each cluster (db-cluster/, app-cluster/) make it clear which configurations apply where.

Archive, Don’t Delete Keep old configurations in an archive/ directory. They’re useful for reference and don’t hurt anything if gitignored.

Document Network Topology Clear documentation of IPs, hostnames, and network layout prevents confusion and mistakes.

Use SOPS for GitOps Secrets Encrypted secrets can be safely stored in git while remaining secure. SOPS with age keys provides a good balance of security and usability.

Test in Maintenance Mode Before bootstrapping, verify all configurations work correctly. It’s easier to fix issues when the cluster isn’t running yet.

Troubleshooting Reference

Node Won’t Bootstrap

1
2
3
4
5
6
7
8
# Check if Kubernetes is running
talosctl --nodes <ip> service kubelet status

# View kubelet logs
talosctl --nodes <ip> logs kubelet

# Check etcd status
talosctl --nodes <ip> service etcd status

Network Connectivity Issues

1
2
3
4
5
6
7
8
9
# Check Cilium
cilium status
cilium connectivity test

# View Cilium logs
kubectl logs -n kube-system -l k8s-app=cilium

# Restart Cilium if needed
kubectl rollout restart daemonset/cilium -n kube-system

Certificate Errors

1
2
3
4
5
# Regenerate kubeconfig with new certs
talosctl --nodes <ip> kubeconfig --force

# Check certificate expiry
talosctl --nodes <ip> get certificates

Storage Issues

1
2
3
4
5
6
7
8
9
# Check Democratic CSI
kubectl get pods -n democratic-csi

# View CSI driver logs
kubectl logs -n democratic-csi -l app=democratic-csi

# List storage classes and PVs
kubectl get sc
kubectl get pv

Next Steps

The multi-cluster foundation is now in place. Future enhancements include:

  1. Add VPS Worker Node Configure a remote VPS to join the app cluster via WireGuard VPN for true workload redundancy.

  2. Deploy Applications Migrate stateless applications to the app cluster and databases to the DB cluster.

  3. Implement Monitoring Deploy Prometheus and Grafana for cluster and application metrics.

  4. Configure Ingress Set up Traefik or Nginx Ingress Controller with TLS certificates.

  5. Backup Strategy Implement automated backups for etcd and persistent volumes.

  6. CI/CD Integration Connect Flux to automatically deploy from git commits.

References

Repository Links:


Tags: Kubernetes, Talos, Multi-Cluster, DevOps, GitOps, Infrastructure, Cilium, Flux