Table of Contents


Overview

This guide documents building a production-ready four-node Talos Linux Kubernetes lab with multi-cluster architecture. The implementation features immutable infrastructure, GitOps workflows, and proper separation of stateful and stateless workloads across two distinct clusters.

What You’ll Learn:

  • Setting up Talos Linux on bare metal hardware
  • Designing multi-cluster architecture for workload isolation
  • Implementing Cilium CNI and Democratic CSI storage
  • GitOps deployment with Flux CD
  • Converting legacy k3s nodes to Talos (bonus section)

The Architecture

The lab consists of four physical nodes running Talos Linux, organized into two Kubernetes clusters with distinct purposes. This design provides workload isolation, independent scaling, and proper security boundaries.

Key Design Principles:

  • Immutable infrastructure with API-driven management
  • Separation of stateful and stateless workloads
  • High availability for critical database services
  • GitOps-driven deployments with version control
  • Comprehensive secret management with SOPS encryption

Architecture Design

Four-Node Talos Cluster Design

The environment consists of two distinct Kubernetes clusters:

DB Cluster (stateful workloads)

  • Purpose: Databases, persistent storage, stateful apps
  • Nodes: 3x control plane nodes (10.0.0.102-104)
  • Management: Flux GitOps for continuous delivery
  • Storage: Democratic CSI providing iSCSI persistent volumes
  • Config: lab/db-cluster/

App Cluster (stateless workloads)

  • Purpose: Stateless apps, web services
  • Nodes: 1x control plane node (10.0.0.115)
  • Networking: Cilium CNI with L2 LoadBalancer support
  • Config: lab/app-cluster/

Network Layout

Physical Hardware:

  • 4x bare metal nodes running Talos Linux
  • Organized into 2 Kubernetes clusters
  • Static IP addressing for stability
  • Shared network infrastructure (10.0.0.0/24)
DB Cluster (3 nodes):
  - talos-0lj-bma: 10.0.0.104 (control plane)
  - talos-6qj-6v8: 10.0.0.103 (control plane)
  - talos-mf1-tt5: 10.0.0.102 (control plane)

App Cluster (1 node):
  - app-cp1: 10.0.0.115 (control plane)
  - VIP: 10.0.0.170

Network:
  - Gateway: 10.0.0.1
  - DNS: 10.0.0.1, 8.8.8.8
  - Subnet: 10.0.0.0/24

See network configs: lab/db-cluster/network-patch.yaml

The Implementation Journey

Phase 1: Planning and Hardware Preparation

Before implementation, plan your cluster architecture and prepare hardware.

Hardware Requirements:

  • 4x physical machines (bare metal or VMs)
  • Each node: 4+ CPU cores, 8GB+ RAM, 50GB+ storage
  • Network connectivity on same subnet
  • USB drives for Talos installation

Download Talos ISO:

1
2
3
4
5
# Get latest Talos metal ISO
wget https://github.com/siderolabs/talos/releases/download/v1.11.3/metal-amd64.iso

# Create bootable USB
dd if=metal-amd64.iso of=/dev/sdX bs=4M status=progress

Network Planning:

  • DB Cluster: 10.0.0.102, 10.0.0.103, 10.0.0.104
  • App Cluster: 10.0.0.115
  • Gateway: 10.0.0.1
  • DNS: 10.0.0.1, 8.8.8.8

Phase 2: DB Cluster Setup (Three-Node HA)

The DB cluster provides high availability for stateful workloads with three control plane nodes.

Step 1: Boot Nodes into Maintenance Mode

For each of the three DB cluster nodes:

  1. Insert USB with Talos ISO
  2. Boot from USB
  3. Node comes up in maintenance mode
  4. Note the IP address assigned via DHCP

Step 2: Generate Cluster Configs

1
2
3
4
5
# Generate configs for db-cluster
talosctl gen config db-cluster https://10.0.0.102:6443 \
  --output-dir db-cluster/_out \
  --with-docs=false \
  --with-examples=false

This creates base configs for control plane and worker nodes.

See the DB cluster configs: lab/db-cluster/

Step 3: Create Network Patches

Each node needs unique network config with static IP:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# db-cluster/network-patch.yaml
machine:
  network:
    hostname: talos-0lj-bma  # Unique per node
    interfaces:
      - interface: eth0
        addresses:
          - 10.0.0.104/24  # Unique per node
        routes:
          - network: 0.0.0.0/0
            gateway: 10.0.0.1
    nameservers:
      - 10.0.0.1
      - 8.8.8.8

See actual network patches: lab/db-cluster/network-patch.yaml

Step 4: Apply Configs and Bootstrap

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Apply config to first control plane
talosctl --nodes 10.0.0.102 apply-config \
  --insecure \
  --file db-cluster/_out/controlplane.yaml \
  --config-patch @db-cluster/network-patch.yaml

# Bootstrap etcd on first node
talosctl --nodes 10.0.0.102 bootstrap

# Apply configs to remaining control planes
talosctl --nodes 10.0.0.103 apply-config --insecure ...
talosctl --nodes 10.0.0.104 apply-config --insecure ...

Step 5: Install Democratic CSI for Storage

1
2
3
4
5
# Get kubeconfig
talosctl --nodes 10.0.0.102 kubeconfig db-kubeconfig

# Install Democratic CSI for iSCSI persistent storage
kubectl apply -f db-cluster/democratic-csi-talos-overrides.yaml

The DB cluster is now ready with HA control plane and persistent storage.

Phase 3: App Cluster Setup (Single Control Plane)

The app cluster handles stateless workloads with a single control plane node for simplicity.

Step 1: Generate Talos Configurations

1
2
3
4
5
# Generate fresh cluster configs
talosctl gen config app-cluster https://10.0.0.115:6443 \
  --output-dir app-cluster/_out \
  --with-docs=false \
  --with-examples=false

This creates the base config files needed for control plane and worker nodes.

Note: The app cluster uses a single control plane since stateless apps can be quickly redeployed if needed. For production stateful workloads, use the HA DB cluster.

See the actual configs:

Step 2: Create Network Configuration Patch

Static IP addressing is essential for cluster nodes. A patch file defines the network config:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# app-cluster/controlplane-network-patch.yaml
machine:
  network:
    hostname: app-cp1
    interfaces:
      - interface: enp2s0f0
        addresses:
          - 10.0.0.115/24
        routes:
          - network: 0.0.0.0/0
            gateway: 10.0.0.1
        vip:
          ip: 10.0.0.170
    nameservers:
      - 10.0.0.1
      - 8.8.8.8

The VIP (Virtual IP) at 10.0.0.170 provides a stable endpoint for the Kubernetes API server.

Each of the 4 nodes has its own network patch with unique hostname and IP:

Step 3: Create Cilium CNI Patch

Cilium requires specific configuration to work with Talos:

1
2
3
4
5
# app-cluster/cilium-patch.yaml
cluster:
  network:
    cni:
      name: none

This tells Talos not to install a default CNI, allowing Cilium to be installed manually.

Step 4: Apply Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Apply the configuration with all patches
talosctl --talosconfig app-cluster/_out/talosconfig \
  --nodes 10.0.0.115 \
  apply-config --insecure \
  --file app-cluster/_out/controlplane.yaml \
  --config-patch @app-cluster/controlplane-network-patch.yaml \
  --config-patch @app-cluster/cilium-patch.yaml

# Bootstrap the Kubernetes cluster
talosctl --talosconfig app-cluster/_out/talosconfig \
  --nodes 10.0.0.115 \
  bootstrap

The --insecure flag is needed for the initial application since the node is in maintenance mode without established trust.

Step 5: Install Cilium CNI

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Retrieve kubeconfig
talosctl --talosconfig app-cluster/_out/talosconfig \
  --nodes 10.0.0.115 \
  kubeconfig app-cluster/app-kubeconfig

# Install Cilium
KUBECONFIG=app-cluster/app-kubeconfig \
cilium install \
  --version 1.16.5 \
  --set ipam.mode=kubernetes \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=10.0.0.115 \
  --set k8sServicePort=6443

# Verify installation
KUBECONFIG=app-cluster/app-kubeconfig cilium status

Cilium provides not just CNI functionality but also replaces kube-proxy for better performance.

Step 6: Merge Kubeconfig

To manage both clusters from a single terminal:

1
2
3
4
5
6
7
8
9
# Merge kubeconfigs
KUBECONFIG=~/.kube/config:app-cluster/app-kubeconfig \
  kubectl config view --flatten > ~/.kube/config.new
mv ~/.kube/config.new ~/.kube/config

# Verify both contexts exist
kubectl config get-contexts
# * admin@app-cluster
#   admin@talos cluster

Now switching between clusters is as simple as kubectl config use-context.

Phase 4: Repository Reorganization

With two clusters operational, the repository structure needed to reflect this architecture.

Step 1: Rename Directories for Clarity

1
2
3
4
5
# Rename for clear purpose identification
mv talos-cluster db-cluster
mv k8s-lab-cluster gitops

# app-cluster already named appropriately

Step 2: Archive Old Files

Rather than deleting old configurations, they were archived for reference:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
mkdir -p archive

# Move old deployment files
mv calico/ archive/
mv mealie/ archive/
mv deployments/ archive/
mv talos-prod-cluster/ archive/

# Move loose YAML files
mv *.yaml archive/

Step 3: Clean GitOps Submodule

The gitops directory contained unnecessary files that should be removed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
cd gitops/k8s-lab-cluster

# Remove empty encrypted file
rm -f azure-credentials.enc.yaml      # 0 bytes

# Remove misplaced files
rm -f 1                                # Misnamed file
rm -f kube-prometheus-stack.yaml       # 181KB auto-generated
rm -f grafana-tls-secret.yaml          # Should be in proper directory

# Remove duplicate nested directory
rm -rf k8s-lab-cluster/                # 980KB duplicate

# Result: 4.5MB → 3.4MB (removed 1.1MB)

Phase 5: Security Audit

Before committing changes, a thorough security review was conducted.

Files Identified and Removed:

  1. Kubernetes Credentials

    1
    2
    
    find . -name "kubeconfig" -type f
    find . -name "*kubeconfig*" -type f
    

    All kubeconfig files grant cluster admin access and must never be committed.

  2. Talos API Credentials

    1
    
    find . -name "talosconfig" -type f
    

    Talosconfig files allow full control of the Talos nodes.

  3. Plain-text Passwords

    1
    2
    
    grep -r "password:" . | grep -v ".git"
    # Found in Democratic CSI configs: Gl4sh33n*
    

    Several YAML files contained iSCSI storage passwords.

  4. Machine Secrets and Tokens Talos machine configs contain bootstrap tokens and certificates that should never be shared.

  5. Large Binary Files

    1
    2
    
    find . -name "*.iso" -o -name "*.raw*" -exec ls -lh {} \;
    # Found: metal-amd64.iso (100MB)
    

Updated .gitignore

A comprehensive .gitignore was created to prevent future leaks:

# Secrets - NEVER commit these
*.key
*.pem
age.agekey
**/age.agekey
kubeconfig
**/kubeconfig
talosconfig
**/talosconfig
*-secrets.yaml
*.enc.yaml

# SOPS decrypted files
*.dec
*.dec.yaml
*-decrypted*

# Democratic CSI (has passwords)
democratic-csi-iscsi.yaml

# Large binary files
*.iso
*.raw
*.raw.xz
*.img
*.qcow2

# Images and screenshots
*.jpg
*.jpeg
*.png
*.gif

# Archive and backups
archive/
backups/

# Build outputs
_out/
*.log

See the actual .gitignore: lab/.gitignore

Phase 6: Documentation and Commit

Create README

A comprehensive README.md was created documenting:

  • Architecture overview
  • Quick start commands
  • How to switch between clusters
  • Management procedures
  • Troubleshooting guides

Commit Changes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Add gitops as proper submodule
git submodule add [email protected]:t12-pybash/k8s-lab-cluster.git \
  gitops/k8s-lab-cluster

# Commit repository reorganization
git add -A
git commit -m "Reorganize repository into multi-cluster architecture

Restructured the repository to support separate DB and App clusters:
- Renamed talos-cluster → db-cluster for stateful workloads
- Created app-cluster directory for application workloads
- Renamed k8s-lab-cluster → gitops for GitOps manifests
- Removed old deployment files (calico, mealie, standalone yamls)
- Cleaned up gitops directory (removed duplicates and generated files)
- Added comprehensive README with architecture documentation
- Updated .gitignore to exclude secrets and build artifacts"

# Commit gitops submodule cleanup
cd gitops/k8s-lab-cluster
git add -A
git commit -m "Clean up unnecessary files and duplicates"
cd ../..

# Update submodule reference in main repo
git add gitops/k8s-lab-cluster
git commit -m "Update gitops submodule to cleaned version"

# Push all changes
git push origin main

Technology Stack

ComponentVersionPurpose
Talos Linuxv1.11.3Immutable Kubernetes OS
Kubernetesv1.34.1Container orchestration
Ciliumv1.16.5CNI and L2 LoadBalancer
Fluxv2.xGitOps continuous delivery
Democratic CSILatestiSCSI storage provisioner
External SecretsLatestAzure Key Vault integration
SOPS + ageLatestSecret encryption

Key Decisions

Why Separate Clusters?

Workload Isolation Databases and stateful applications have different resource requirements and failure modes than stateless apps. Separating them prevents resource contention and limits blast radius.

Resource Management The DB cluster can use local NVMe storage for high-performance persistent volumes, while the app cluster can focus on compute and memory for application workloads.

Security A compromised application workload can’t directly access database credentials or persistent storage in the separate cluster.

Operational Flexibility Each cluster can be upgraded, maintained, or modified independently without affecting the other.

Why Single Control Plane for App Cluster?

Cost Efficiency Running three control plane nodes requires significant resources. For stateless applications, a single control plane with worker redundancy provides adequate availability.

Simplicity Managing a smaller cluster is operationally simpler, especially during the initial setup phase.

Extensibility If true control plane HA becomes necessary, additional control plane nodes can be added later.

Risk Profile The app cluster runs stateless workloads that can be quickly redeployed if the control plane fails, unlike stateful databases.

Why Talos Linux?

Immutability No SSH access, no package manager, no manual modifications. The entire system is declared via machine configs.

API-Driven All operations happen through a gRPC API (talosctl), not shell scripts or Ansible playbooks.

Security Minimal attack surface with ~80MB OS footprint. No persistent shells or user accounts to compromise.

Kubernetes-Native Purpose-built for Kubernetes. Every component is designed specifically for running container workloads.

Declarative Machine config is versioned YAML. Changes are applied atomically, and rolling back is trivial.

Proven Reliability The three existing Talos nodes in the DB cluster had already proven Talos’s reliability. Converting the fourth node from k3s to Talos completed the homogeneous infrastructure.

See the complete setup: github.com/t12-pybash/lab

Final Directory Structure

k8s-lab/
├── .devcontainer/              # VS Code dev container config
├── .git/                       # Git repository
├── .gitignore                  # Comprehensive secrets exclusion
├── README.md                   # Cluster documentation
│
├── app-cluster/                # App cluster configs
│   ├── _out/                   # Generated (gitignored)
│   │   ├── controlplane.yaml
│   │   ├── talosconfig
│   │   └── worker.yaml
│   ├── cilium-patch.yaml
│   └── controlplane-network-patch.yaml
│
├── db-cluster/                 # DB cluster configs
│   ├── patches/
│   │   └── wk5.patch
│   ├── allow-scheduling-patch.yaml
│   ├── azure-cluster-secretstore.yaml
│   ├── azure-secretstore.yaml
│   ├── cilium-l2-config.yaml
│   ├── democratic-csi-talos-overrides.yaml
│   ├── etcd-patch.yaml
│   ├── iscsi-extension-patch.yaml
│   ├── linkding-external-secret.yaml
│   ├── network-patch.yaml
│   ├── schematic.yaml
│   └── wireguard-extension.yaml
│
├── gitops/                     # GitOps manifests (submodule)
│   └── k8s-lab-cluster/
│       ├── apps/               # Application definitions
│       ├── base/               # Base configurations
│       ├── clusters/           # Cluster-specific configs
│       ├── infrastructure/     # Infrastructure components
│       ├── monitoring/         # Monitoring stack
│       └── workspaces/         # Workspace configs
│
├── docs/                       # Documentation
│   └── multi-cluster-setup-guide.md
│
├── archive/                    # Old files (gitignored)
│   ├── calico/
│   ├── mealie/
│   ├── deployments/
│   └── talos-prod-cluster/
│
└── backups/                    # Config backups (gitignored)

Common Operations

Switch Between Clusters

1
2
3
4
5
6
7
8
# Use DB cluster
kubectl config use-context "admin@talos cluster"

# Use App cluster
kubectl config use-context "admin@app-cluster"

# View all contexts
kubectl config get-contexts

Access Talos Nodes

1
2
3
4
5
6
7
# DB cluster nodes
talosctl --talosconfig db-cluster/talosconfig-working \
  --nodes 10.0.0.102 version

# App cluster node
talosctl --talosconfig app-cluster/_out/talosconfig \
  --nodes 10.0.0.115 version

Check Cluster Health

1
2
3
4
5
6
7
8
# Kubernetes nodes
kubectl get nodes -o wide

# Talos services
talosctl --nodes <ip> services

# Cilium status
cilium status --wait

Update Talos

1
2
talosctl --nodes 10.0.0.115 \
  upgrade --image ghcr.io/siderolabs/installer:v1.11.3

Update Kubernetes

1
2
talosctl --nodes 10.0.0.115 \
  upgrade-k8s --to 1.34.1

Lessons Learned

What Worked

USB ISO Boot Method This proved to be the most reliable way to install Talos on physical hardware. It provides a clean, known-good environment without existing configuration conflicts.

Static IP Addressing Using static IPs (10.0.0.102-104, 10.0.0.115) eliminated problems with dynamic IP changes and made cluster configuration predictable.

Configuration Patches Talos’s patch system allows overriding specific configuration sections without regenerating entire machine configs. This makes network and CNI customization straightforward.

Separate GitOps Repository Using a git submodule for GitOps manifests keeps cluster configuration separate from application deployments while maintaining the link between them.

Early Security Audit Checking for secrets before the first commit prevented credentials from entering the repository history.

What Didn’t Work

Single Control Plane Risk For the app cluster, a single control plane is acceptable since workloads are stateless. If higher availability is needed, add more control plane nodes using the same process as the DB cluster.

Mixing Workload Types Keep stateful workloads on the DB cluster with HA. Stateless apps can run on either cluster, but separation provides better resource isolation.

DHCP IP Allocation Dynamic IPs caused problems when node IPs changed. Static addressing solved this completely.

Best Practices

Always Use Static IPs Cluster nodes should have predictable, unchanging IP addresses. This simplifies configuration and troubleshooting.

Keep Secrets Out of Git Use comprehensive .gitignore patterns from the start. Once secrets enter git history, they’re difficult to remove completely.

Separate Cluster Configurations Distinct directories for each cluster (db-cluster/, app-cluster/) make it clear which configurations apply where.

Archive, Don’t Delete Keep old configurations in an archive/ directory. They’re useful for reference and don’t hurt anything if gitignored.

Document Network Topology Clear documentation of IPs, hostnames, and network layout prevents confusion and mistakes.

Use SOPS for GitOps Secrets Encrypted secrets can be safely stored in git while remaining secure. SOPS with age keys provides a good balance of security and usability.

Test in Maintenance Mode Before bootstrapping, verify all configurations work correctly. It’s easier to fix issues when the cluster isn’t running yet.

Troubleshooting Reference

Node Won’t Bootstrap

1
2
3
4
5
6
7
8
# Check if Kubernetes is running
talosctl --nodes <ip> service kubelet status

# View kubelet logs
talosctl --nodes <ip> logs kubelet

# Check etcd status
talosctl --nodes <ip> service etcd status

Network Connectivity Issues

1
2
3
4
5
6
7
8
9
# Check Cilium
cilium status
cilium connectivity test

# View Cilium logs
kubectl logs -n kube-system -l k8s-app=cilium

# Restart Cilium if needed
kubectl rollout restart daemonset/cilium -n kube-system

Certificate Errors

1
2
3
4
5
# Regenerate kubeconfig with new certs
talosctl --nodes <ip> kubeconfig --force

# Check certificate expiry
talosctl --nodes <ip> get certificates

Storage Issues

1
2
3
4
5
6
7
8
9
# Check Democratic CSI
kubectl get pods -n democratic-csi

# View CSI driver logs
kubectl logs -n democratic-csi -l app=democratic-csi

# List storage classes and PVs
kubectl get sc
kubectl get pv

Next Steps

The multi-cluster foundation is now in place. Future enhancements include:

  1. Add VPS Worker Node Configure a remote VPS to join the app cluster via WireGuard VPN for true workload redundancy.

  2. Deploy Applications Migrate stateless applications to the app cluster and databases to the DB cluster.

  3. Implement Monitoring Deploy Prometheus and Grafana for cluster and application metrics.

  4. Configure Ingress Set up Traefik or Nginx Ingress Controller with TLS certificates.

  5. Backup Strategy Implement automated backups for etcd and persistent volumes.

  6. CI/CD Integration Connect Flux to automatically deploy from git commits.

References

Documentation

Live Implementation

Main Lab Repo: github.com/t12-pybash/lab

Key directories:

GitOps Repo: github.com/t12-pybash/k8s-lab-cluster

Explore the complete four-node Talos setup - from initial k3s conversion to multi-cluster production environment.


Tags: Kubernetes, Talos, Multi-Cluster, DevOps, GitOps, Infrastructure, Cilium, Flux