Infrastructure as Code (IaC) Part 3: Load-Balancing, Traffic-Shaping, and Persistant Storage
Part 3 of our Infrastructure as Code series, where we enhance our Kubernetes cluster with production-ready components including MetalLB load balancing, Istio service mesh for traffic management, and persistent storage solutions, all automated through Ansible.

Welcome to the third installment of our Infrastructure as Code series, where we transform our basic Kubernetes cluster into a production-ready platform capable of handling real-world workloads. In the previous articles, we provisioned virtual machines with Terraform and configured a functional Kubernetes cluster using Ansible. While our cluster can run pods and basic services, it lacks critical infrastructure components needed for enterprise applications.
This tutorial bridges that gap by implementing three essential production capabilities: load balancing with MetalLB to expose services externally, persistent storage for stateful applications, and traffic management through Istio service mesh for advanced routing and security. These components represent the foundation of modern cloud-native infrastructure, enabling features like blue-green deployments, A/B testing, automatic failover, and zero-downtime updates.
By the end of this article, you’ll have a decent understanding of how these technologies integrate with Kubernetes and how to automate their deployment using advanced Ansible techniques including group variables, preflight checks, and modular role design. We’ll also introduce concepts like Helm package management and validation playbooks that ensure your infrastructure deployments are both reliable and repeatable.
As always, you can find all the code examples and configuration files from this tutorial in our GitHub repository.
Prerequisites and Current State
This tutorial continues directly from where we left off in Part 2: Configuration Management with Ansible. You should have a fully functional 5-node Kubernetes cluster (1 master, 4 workers) deployed and configured through our Terraform and Ansible automation pipeline.
If you haven’t completed the previous parts of this series, I strongly recommend starting with Part 1: Introduction to Terraform and working through the complete sequence. The infrastructure components we’re adding in this tutorial depend on the specific cluster configuration, networking setup, and SSH key management established in those earlier tutorials.
Required Prerequisites:
- Completed Terraform infrastructure from Part 1 (5 VMs with static IPs)
- Functional Kubernetes cluster from Part 2 (kubeadm-based installation)
- Ansible control environment with kubernetes.core collection installed
- Master node with administrative kubeconfig in
/home/ubuntu/.kube/config
- All worker nodes joined to the cluster and in Ready state
You can verify your cluster is ready by running:
kubectl get nodes
You should see all 5 nodes in Ready status. If any nodes show NotReady or if kubectl commands fail, review Part 2 to ensure your cluster deployment completed successfully.
Project Directory
To begin, your project directory should look like this:
infrastructure-as-code/
├── Makefile
├── introduction-to-terraform/
| ├── main.tf # Primary resource definitions
| ├── variables.tf # Input variable declarations
| ├── outputs.tf # Output value definitions
| ├── locals.tf # Local value computations
| └── cloud-init/ # VM initialization templates
| ├── user-data.tpl # User and SSH configuration
| └── network-config.tpl # Static IP configuration
└── configuration-with-ansible/
├── ansible.cfg # Ansible configuration file with SSH settings and output formatting
├── generate_inventory.sh # Script to parse Terraform output and generate Ansible inventory
├── inventory.ini # Generated inventory file (created by generate_inventory.sh)
├── site.yml # Main Ansible playbook that orchestrates all roles
└── roles/ # Directory containing all Ansible roles
├── common/ # Role for common tasks across all nodes
│ └── tasks/
│ └── main.yml # Disables swap, loads kernel modules, sets sysctl parameters
├── containerd/ # Role for container runtime installation and configuration
│ └── tasks/
│ └── main.yml # Installs containerd and configures systemd cgroup driver
├── kubernetes/ # Role for Kubernetes component installation
│ └── tasks/
│ └── main.yml # Installs kubelet, kubeadm, kubectl with version pinning
├── control-plane/ # Role for Kubernetes master node setup
│ └── tasks/
│ └── main.yml # Runs kubeadm init, sets up kubeconfig, installs Calico CNI
└── worker/ # Role for Kubernetes worker node setup
└── tasks/
└── main.yml # Joins worker nodes to the cluster using kubeadm join
Note: All code files referenced from this point on are located within the
configuration-with-ansible/
folder.
Group Variables
Group variables in Ansible provide a powerful mechanism for defining configuration values that apply to multiple hosts or entire environments. Unlike host-specific variables that only affect individual machines, group variables allow you to establish consistent settings across logical groups of infrastructure components.
The group_vars/all.yml
file is particularly important because it defines variables that apply to every host in your inventory, regardless of their group membership. This is ideal for cluster-wide configuration like Kubernetes versions, network ranges, and component settings that must remain synchronized across all nodes.
Group variables also promote infrastructure as code best practices by centralizing configuration management. Rather than hardcoding values throughout your playbooks and roles, you define them once in a logical location where they can be easily reviewed, updated, and version-controlled. This approach reduces configuration drift and makes it easier to deploy consistent environments across development, staging, and production.
Let’s define some group variables. Start by creating group_vars/all.yml
:
---
# Cluster Configuration
cluster_domain: "local"
kubernetes_version: "1.28.*"
# Component Versions
helm_version: "3.18.0"
istio_version: "1.22.3"
# Network Configuration
metallb_ip_addresses:
- "192.168.122.200-192.168.122.220"
metallb_pool_name: "default-pool"
# Security Configuration
enable_pod_security_policies: true
enable_network_policies: true
Preflight Checks
Preflight checks are validation routines that verify system readiness before attempting complex deployments. They represent a critical best practice in infrastructure automation, catching potential problems early when they’re easier and cheaper to resolve. Rather than discovering resource constraints or configuration conflicts halfway through a multi-hour deployment, preflight checks fail fast with clear error messages.
In Kubernetes environments, preflight checks are particularly valuable because the platform has specific requirements for memory, CPU, disk space, network configuration, and kernel modules. These checks also validate that required network ports are available and that conflicting services aren’t already running.
The benefits of implementing preflight checks include:
- Reduced deployment failures by catching issues before they cause partial deployments
- Faster troubleshooting through specific error messages rather than generic deployment failures
- Improved reliability by ensuring consistent environmental prerequisites across all deployments
- Better user experience by providing actionable feedback when prerequisites aren’t met
Let’s add a new file roles/preflight/tasks/main.yml
:
---
# roles/preflight/tasks/main.yml
- name: Check minimum system requirements
assert:
that:
- ansible_memtotal_mb >= 2048
- ansible_processor_vcpus >= 2
- ansible_architecture == "x86_64"
fail_msg: "Insufficient resources: need 2GB RAM, 2 CPU cores, and x86_64 architecture"
- name: Check disk space
assert:
that:
- ansible_mounts | selectattr('mount', 'equalto', '/') | map(attribute='size_available') | first > 10000000000
fail_msg: "Insufficient disk space: need at least 20GB free space on root partition"
- name: Verify required ports are not in use
wait_for:
port: "{{ item }}"
state: stopped
timeout: 1
host: "{{ ansible_default_ipv4.address }}"
loop:
- 6443 # Kubernetes API
- 10250 # kubelet
- 10259 # kube-scheduler
- 10257 # kube-controller-manager
- 2379 # etcd
- 2380 # etcd
ignore_errors: true
register: port_check
- name: Report port conflicts
ansible.builtin.debug:
msg: "WARNING: Port {{ item.item }} appears to be in use"
when: item.failed is defined and item.failed
loop: "{{ port_check.results }}"
- name: Check container runtime prerequisites
ansible.builtin.command:
cmd: "{{ item }}"
loop:
- "modinfo overlay"
- "modinfo br_netfilter"
register: kernel_modules
failed_when: false
changed_when: false
- name: Verify kernel modules
assert:
that:
- kernel_modules.results[0].rc == 0
- kernel_modules.results[1].rc == 0
fail_msg: "Required kernel modules (overlay, br_netfilter) are not available"
- name: Check SELinux status
ansible.builtin.command:
cmd: getenforce
register: selinux_status
failed_when: false
changed_when: false
- name: Warn about SELinux
ansible.builtin.debug:
msg: "WARNING: SELinux is {{ selinux_status.stdout }}. Consider setting to permissive for Kubernetes"
when: selinux_status.rc == 0 and selinux_status.stdout == "Enforcing"
This basic preflight role validates multiple aspects of system readiness for Kubernetes deployment:
- Resource Validation: The
assert
tasks check that each node meets minimum hardware requirements (2GB RAM, 2 CPU cores, x86_64 architecture) and has sufficient disk space. These are hard requirements for Kubernetes functionality, and deployment will fail gracefully if they’re not met. - Port Conflict Detection: The
wait_for
task checks whether critical Kubernetes ports are already in use. While the task usesignore_errors: true
to prevent immediate failure, it registers results that are evaluated in subsequent tasks to provide warnings about potential conflicts. - Kernel Module Prerequisites: Kubernetes requires specific kernel modules for container networking (
br_netfilter
) and overlay filesystem support (overlay
). The preflight check verifies these modules are available before attempting container runtime installation. - Security Context Awareness: The SELinux check provides important feedback about security contexts that can interfere with Kubernetes operations. While not automatically remediated, this information helps administrators make informed decisions about security configuration.
Package Management with Helm
Helm serves as the “package manager for Kubernetes,” providing a standardized way to define, install, and manage complex applications on Kubernetes clusters. Think of Helm as the equivalent of apt, yum, or homebrew, but specifically designed for Kubernetes resources and applications.
Traditional Kubernetes deployments require manually managing multiple YAML files for different resources (deployments, services, configmaps, secrets, etc.). Helm packages these resources into “charts” that can be installed, upgraded, and rolled back as a single unit. This is particularly valuable for complex applications like monitoring stacks, databases, or service meshes that involve dozens of interconnected Kubernetes resources.
We’re using Helm in this tutorial because both MetalLB and Istio provide official Helm charts that significantly simplify their installation and configuration. Rather than manually downloading and applying multiple YAML manifests, Helm allows us to install these complex systems with a single command while maintaining the ability to customize their configuration through values files.
Let’s add a role for helm. Start by creating roles/helm/defaults/main.yml
:
---
# Which Helm version to install (no leading "v")
helm_version: "3.18.0"
# Where to install the helm binary
helm_install_dir: "/usr/local/bin"
Install Helm via Tasks
Now, create roles/helm/tasks/main.yml
:
---
# roles/helm/tasks/main.yml
- name: Check if Helm is already installed
ansible.builtin.stat:
path: "{{ helm_install_dir }}/helm"
register: helm_binary
- name: Download Helm tarball
ansible.builtin.get_url:
url: "https://get.helm.sh/helm-v{{ helm_version }}-linux-amd64.tar.gz"
dest: "/tmp/helm-v{{ helm_version }}.tar.gz"
mode: '0644'
when: not helm_binary.stat.exists
- name: Extract Helm binary from archive
ansible.builtin.unarchive:
src: "/tmp/helm-v{{ helm_version }}.tar.gz"
dest: "/tmp"
remote_src: yes
creates: "/tmp/linux-amd64/helm"
when: not helm_binary.stat.exists
- name: Install Helm executable to {{ helm_install_dir }}
ansible.builtin.copy:
src: "/tmp/linux-amd64/helm"
dest: "{{ helm_install_dir }}/helm"
mode: '0755'
remote_src: yes
when: not helm_binary.stat.exists
This Helm installation role demonstrates several important Ansible patterns for managing binary installations:
- Idempotency: The role first checks whether Helm is already installed using
ansible.builtin.stat
. This prevents unnecessary downloads and installations on subsequent playbook runs, making the automation more efficient and reliable. - Download and Extraction: The role downloads the official Helm release tarball and extracts it to a temporary directory. Using
remote_src: yes
with theunarchive
module tells Ansible that the source file is already on the remote host rather than copying it from the control machine. - Installation: The final task copies the Helm binary to the system
PATH
location with appropriate executable permissions. This makes Helm available to all users and subsequent automation tasks. - Conditional Execution: All download and installation tasks use
when: not helm_binary.stat.exists
to ensure they only run when Helm isn’t already present, demonstrating proper idempotent automation design.
Network Load-Balancing with MetalLB
MetalLB is a load balancer implementation specifically designed for bare-metal Kubernetes clusters that don’t have access to cloud provider load balancers like AWS ELB or Google Cloud Load Balancer. In traditional cloud environments, creating a Service with type: LoadBalancer
automatically provisions an external load balancer. In bare-metal or homelab environments, these services remain in “Pending” state indefinitely because no load balancer implementation is available.
MetalLB solves this problem by providing a software-based load balancer that can assign external IP addresses to services and announce those IPs to the local network. It operates in two primary modes: Layer 2 (L2) mode uses ARP/NDP announcements to make services accessible, while BGP mode integrates with network routers for more sophisticated routing scenarios.
For our homelab environment, L2 mode is ideal because it requires no special network configuration and works with standard home network equipment. MetalLB will assign IP addresses from a pool we define (192.168.122.200-220
in our configuration) and announce these IPs on the local network, making services accessible from outside the Kubernetes cluster.
Let’s add a role for metallb. Start by creating roles/metallb/defaults/main.yml
:
---
# roles/metallb/defaults/main.yml
# Address range(s) for your LoadBalancer services
metallb_ip_addresses:
- "192.168.122.200-192.168.122.220"
# Name of the IPPool
metallb_pool_name: default-pool
Isn’t that Redundant?
You might notice that we’re defining the same MetalLB variables in both group_vars/all.yml
and roles/metallb/defaults/main.yml
. This apparent redundancy actually represents an Ansible best practice that promotes role portability and maintainable automation.
Role defaults serve as fallback values that ensure a role can function independently, even if group variables aren’t defined. This makes roles more portable between different projects and environments. When you share or reuse a role, the defaults ensure it will work without requiring specific variable definitions in the consuming project.
The precedence order in Ansible means that group variables will override role defaults when both are present. This allows you to define environment-specific values in group variables while maintaining sensible defaults within the role itself. It’s similar to function parameters with default values in programming languages—the defaults provide safety while explicit parameters allow customization.
Define the Task
---
# roles/metallb/tasks/main.yml
- name: Add MetalLB Helm repository
kubernetes.core.helm_repository:
name: metallb
repo_url: https://metallb.github.io/metallb
state: present
- name: Create metallb-system namespace
kubernetes.core.k8s:
api_version: v1
kind: Namespace
name: metallb-system
state: present
- name: Label metallb-system for privileged Pod Security
kubernetes.core.k8s:
api_version: v1
kind: Namespace
name: metallb-system
merge_type: strategic-merge
definition:
metadata:
labels:
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/audit: privileged
pod-security.kubernetes.io/warn: privileged
- name: Install MetalLB chart via Helm
kubernetes.core.helm:
name: metallb
chart_ref: metallb/metallb
release_namespace: metallb-system
create_namespace: false
wait: true
state: present
- name: Wait for MetalLB Controller to be ready
ansible.builtin.command:
cmd: kubectl rollout status deployment/metallb-controller -n metallb-system --timeout=300s
changed_when: false
- name: Wait for MetalLB Speaker to be ready
ansible.builtin.command:
cmd: kubectl rollout status daemonset/metallb-speaker -n metallb-system --timeout=300s
changed_when: false
- name: Wait for MetalLB webhook to be ready
kubernetes.core.k8s_info:
api_version: v1
kind: Endpoints
name: metallb-webhook-service
namespace: metallb-system
register: webhook_ep
until: webhook_ep.resources | length > 0 and webhook_ep.resources[0].subsets is defined and webhook_ep.resources[0].subsets | length > 0
retries: 30
delay: 5
changed_when: false
- name: Configure MetalLB IPAddressPool
kubernetes.core.k8s:
state: present
definition:
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: "{{ metallb_pool_name }}"
namespace: metallb-system
spec:
addresses: "{{ metallb_ip_addresses }}"
- name: Configure MetalLB L2Advertisement
kubernetes.core.k8s:
state: present
definition:
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: default
namespace: metallb-system
spec:
ipAddressPools:
- "{{ metallb_pool_name }}"
This MetalLB deployment role demonstrates several advanced Kubernetes automation patterns:
Helm Repository Management: The role adds the official MetalLB Helm repository, enabling access to regularly updated charts. This is preferable to downloading static YAML files because it ensures you can easily upgrade to newer versions.
Namespace and Security Configuration: The role creates the metallb-system
namespace and applies appropriate Pod Security labels. In modern Kubernetes clusters with Pod Security Standards enabled, this labeling is crucial for allowing MetalLB’s privileged operations.
Orchestrated Installation: The Helm installation uses wait: true
to block until all MetalLB components are successfully deployed. This prevents race conditions in subsequent tasks that depend on MetalLB being operational.
Readiness Verification: The multiple wait tasks ensure that all MetalLB components (controller, speaker, webhook) are fully ready before proceeding. This is particularly important for the webhook, which must be operational before creating MetalLB custom resources.
Resource Configuration: The final tasks create MetalLB-specific Kubernetes resources (IPAddressPool and L2Advertisement) that configure how MetalLB should behave. These resources tell MetalLB which IP addresses to assign to services and how to announce them on the network.
Persistent Storage
Persistent storage in Kubernetes addresses one of the fundamental challenges in containerized environments: data persistence. By default, containers are ephemeral—when a pod is destroyed, any data stored within the container’s filesystem is lost forever. This works well for stateless applications but poses significant challenges for databases, file servers, and other applications that need to maintain data across pod restarts and rescheduling.
Kubernetes provides several persistent storage solutions through the concept of Persistent Volumes (PV) and Persistent Volume Claims (PVC). A PV represents a piece of networked storage that has been provisioned by an administrator or dynamically provisioned using Storage Classes. A PVC is a request for storage by a user, similar to how pods consume node resources.
Storage Classes enable dynamic provisioning by defining different “classes” of storage with specific characteristics like performance, backup policies, or replication levels. When a PVC references a Storage Class, Kubernetes automatically creates a suitable Persistent Volume to satisfy the request.
Benefits of implementing persistent storage include:
- Data durability across pod lifecycle events (restarts, rescheduling, scaling)
- Application state preservation for databases and stateful services
- Development environment consistency by maintaining data between cluster rebuilds
- Production readiness for applications requiring data persistence
Local Path Provisioner
The Local Path Provisioner is a simple yet effective storage solution that creates persistent volumes using local storage on Kubernetes nodes. Rather than requiring complex network-attached storage or cloud provider integrations, it dynamically provisions storage by creating directories on the local filesystem of worker nodes.
This approach is particularly well-suited for development environments, testing clusters, and homelab deployments where high-availability storage isn’t required but basic persistence is needed. When a PVC is created, the Local Path Provisioner automatically creates a directory on a node and mounts it into the requesting pod.
While local path storage doesn’t provide the redundancy or accessibility of networked storage solutions, it offers several advantages for our use case: zero external dependencies, simple configuration, and excellent performance since data access doesn’t involve network overhead. For learning and development purposes, it provides an ideal foundation for understanding Kubernetes storage concepts.
Let’s add a role for persistant storage. Start by creating roles/storage/tasks/main.yml
:
---
# roles/storage/tasks/main.yml
- name: Install Local Path Provisioner
kubernetes.core.k8s:
state: present
src: https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml
run_once: true
- name: Set local-path as the default StorageClass
kubernetes.core.k8s:
state: present
kind: StorageClass
name: local-path
definition:
metadata:
annotations:
storageclass.kubernetes.io/is-default-class: "true"
run_once: true
This storage role accomplishes two critical tasks with minimal complexity:
- Dynamic Provisioner Installation: The first task installs the Local Path Provisioner by applying its official Kubernetes manifests directly from the Rancher GitHub repository. This approach ensures we’re always using the most current stable version while avoiding the complexity of managing local manifest files.
- Default Storage Class Configuration: The second task configures the
local-path
Storage Class as the default for the cluster by adding thestorageclass.kubernetes.io/is-default-class: "true"
annotation. This means that any PVC that doesn’t explicitly specify a Storage Class will automatically use local path provisioning.
The run_once: true
directive ensures these operations are only performed once per playbook run, regardless of how many master nodes are in the inventory. This prevents conflicts and unnecessary duplicate operations in multi-master configurations.
Service Mesh with Istio
Before diving into the installation, it’s important to understand what we’re implementing. A service mesh is an infrastructure layer that handles service-to-service communication in a microservices architecture. Rather than requiring each service to implement its own communication, security, and observability logic, the service mesh provides these capabilities through a network of lightweight proxies.
Core Components of Istio
Data Plane: Consists of sidecar proxies (based on Envoy) that are deployed alongside each service instance. These proxies intercept all network traffic to and from the service, providing features like load balancing, circuit breaking, encryption, and telemetry collection.
Control Plane: The central management layer that configures and manages the sidecar proxies. This includes:
- Istiod: The main control plane component that handles configuration, service discovery, and certificate management
- Pilot: Service discovery and traffic management
- Citadel: Identity and security management
- Galley: Configuration management and validation
Key Benefits for Our Environment:
- Traffic Management: Intelligent routing, load balancing, and failover without code changes
- Security: Mutual TLS encryption between services and fine-grained access control
- Observability: Distributed tracing, metrics, and logging for all service interactions
- Resilience: Circuit breakers, timeouts, and retry policies to handle failures gracefully
This architecture allows us to add sophisticated networking capabilities to our existing microservices without modifying the application code, making it perfect for our environment.
Installing Istio
First, let’s create roles/istio/defaults/main.yml
:
---
# roles/istio/defaults/main.yml
# Which version of Istio to install
istio_version: "1.22.3"
# Profile to install (default, minimal, demo, etc)
istio_profile: "default"
# Where to unpack istioctl
istio_install_dir: "/usr/local/istio-{{ istio_version }}"
Now, define the task in roles/istio/tasks/main.yml
:
---
# roles/istio/tasks/main.yml
- name: Download Istio release archive
ansible.builtin.get_url:
url: "https://github.com/istio/istio/releases/download/{{ istio_version }}/istio-{{ istio_version }}-linux-amd64.tar.gz"
dest: "/tmp/istio-{{ istio_version }}.tar.gz"
mode: '0644'
- name: Unpack istioctl
ansible.builtin.unarchive:
src: "/tmp/istio-{{ istio_version }}.tar.gz"
dest: "/usr/local"
remote_src: yes
creates: "{{ istio_install_dir }}/bin/istioctl"
- name: Ensure istioctl is on PATH (symlink)
ansible.builtin.file:
src: "{{ istio_install_dir }}/bin/istioctl"
dest: /usr/local/bin/istioctl
state: link
- name: Check if Istio is already installed
kubernetes.core.k8s_info:
kind: Deployment
name: istiod
namespace: istio-system
register: istiod_deployment
ignore_errors: true
- name: Create istio-system namespace
kubernetes.core.k8s:
api_version: v1
kind: Namespace
name: istio-system
state: present
when: istiod_deployment.resources | default([]) | length == 0
- name: Install Istio control plane using generated manifest
ansible.builtin.shell:
cmd: >
istioctl manifest generate --set profile={{ istio_profile }} | kubectl apply -f -
when: istiod_deployment.resources | default([]) | length == 0
register: istio_install_result
changed_when: "'created' in istio_install_result.stdout or 'configured' in istio_install_result.stdout"
- name: Wait for Istiod deployment to be ready
ansible.builtin.command:
cmd: kubectl rollout status deployment/istiod -n istio-system --timeout=300s
changed_when: false
- name: Wait for Istio Ingress Gateway deployment to be ready
ansible.builtin.command:
cmd: kubectl rollout status deployment/istio-ingressgateway -n istio-system --timeout=300s
changed_when: false
This Istio installation role demonstrates several sophisticated automation techniques:
- Binary Management: The role downloads and installs the
istioctl
command-line tool, which is the primary interface for managing Istio installations. The symlink approach ensures the tool is available in the system PATH while maintaining version-specific installation directories. - Idempotent Installation: Before installing Istio, the role checks whether the
istiod
deployment already exists in theistio-system
namespace. This prevents duplicate installations and allows the playbook to be run safely multiple times. - Manifest-Based Installation: Rather than using Helm charts, this role uses
istioctl manifest generate
to create Kubernetes manifests and applies them directly. This approach provides more control over the installation process and better compatibility with automated environments. - Rollout Status Verification: The final tasks wait for both the main control plane (
istiod
) and the ingress gateway deployments to reach ready status. This ensures that Istio is fully operational before the playbook completes or proceeds to dependent tasks.
This installation method balances automation with control, providing a reliable way to deploy Istio while maintaining visibility into the installation process.
Validating Your Infrastructure
Infrastructure validation represents a critical but often overlooked aspect of automation pipelines. While deployment automation focuses on provisioning and configuring resources, validation automation ensures that the resulting infrastructure actually works as expected. This is particularly important in complex systems like Kubernetes clusters where components have intricate dependencies and failure modes.
Validation playbooks serve multiple purposes beyond simple success confirmation. They provide rapid feedback during troubleshooting, enable confidence in automated deployments, and create a foundation for monitoring and alerting systems. In production environments, validation checks can be extended to include performance benchmarks, security posture assessments, and compliance verification.
The validation role we’re implementing checks multiple layers of the infrastructure stack:
- Node-level health: Ensuring all Kubernetes nodes are in Ready state and can communicate with the control plane
- System component functionality: Verifying that critical pods in the kube-system namespace are running correctly
- Application component readiness: Confirming that MetalLB, Istio, and storage systems are operational and properly configured
- Service availability: Validating that load balancing and persistent storage are actually functional for workloads
This multi-layered approach provides confidence that the infrastructure can support application workloads reliably.
Create a new role for validation at roles/validation/tasks/main.yml
:
# filepath: roles/validation/tasks/main.yml
---
- name: Verify cluster nodes are ready
ansible.builtin.command:
cmd: kubectl get nodes --no-headers
register: nodes_status
changed_when: false
failed_when: "'NotReady' in nodes_status.stdout"
- name: Verify critical pods are running
ansible.builtin.command:
cmd: kubectl get pods -n kube-system --field-selector=status.phase!=Running --no-headers
register: failed_pods
changed_when: false
failed_when: failed_pods.stdout_lines | length > 0
- name: Verify Istio installation
kubernetes.core.k8s_info:
api_version: apps/v1
kind: Deployment
name: istiod
namespace: istio-system
register: istio_status
failed_when: >
istio_status.resources | length == 0 or
istio_status.resources[0].status.readyReplicas != istio_status.resources[0].status.replicas
- name: Verify MetalLB is operational
kubernetes.core.k8s_info:
api_version: apps/v1
kind: Deployment
name: metallb-controller
namespace: metallb-system
register: metallb_status
failed_when: >
metallb_status.resources | length == 0 or
metallb_status.resources[0].status.readyReplicas != metallb_status.resources[0].status.replicas
- name: Verify persistent storage is available
kubernetes.core.k8s_info:
kind: StorageClass
name: local-path
register: storage_class_status
failed_when: >
storage_class_status.resources | length == 0 or
storage_class_status.resources[0].metadata.get('annotations', {}).get('storageclass.kubernetes.io/is-default-class') != 'true'
- name: Display validation results
ansible.builtin.debug:
msg:
- "Cluster validation completed:"
- "Nodes ready: {{ nodes_status.stdout_lines | length }}"
- "Istio ready: {{ istio_status.resources[0].status.readyReplicas | default(0) }}/{{ istio_status.resources[0].status.replicas | default(0) }}"
- "MetalLB ready: {{ metallb_status.resources[0].status.readyReplicas | default(0) }}/{{ metallb_status.resources[0].status.replicas | default(0) }}"
- "Persistent Storage (StorageClass) ready: {{ 'yes' if storage_class_status.resources | length > 0 else 'no' }}"
This validation role implements health checking across multiple infrastructure layers:
- Cluster Health Verification: The first task ensures all Kubernetes nodes are in Ready state by checking the output of
kubectl get nodes
. Any node showing NotReady status will cause the playbook to fail with a clear error message, immediately highlighting cluster-level issues. - System Component Status: The second check looks for any pods in the critical
kube-system
namespace that aren’t in Running state. Since these pods provide essential cluster services (DNS, CNI, etc.), any failures here indicate serious cluster problems. - Application Component Health: The Istio and MetalLB checks use the
kubernetes.core.k8s_info
module to query deployment status directly from the Kubernetes API. These checks verify not just that the deployments exist, but that they have the expected number of ready replicas. - Storage Availability: The storage validation confirms that the Local Path Provisioner storage class is available and properly configured as the default. This ensures that applications requesting persistent storage will be able to obtain it.
- Reporting: The final debug task provides a summary of all validation results, giving administrators a quick overview of infrastructure health. This approach makes it easy to identify which components need attention if validation fails.
Updating our Playbook
Our enhanced site.yml
playbook introduces several advanced Ansible concepts that weren’t present in the simpler version from Part 2. Understanding these concepts is crucial for building maintainable, scalable automation pipelines.
- Tags and Role Organization: Tags provide fine-grained control over which parts of a playbook execute. Instead of running the entire playbook every time, you can target specific components (e.g.,
--tags networking
to only run MetalLB tasks). This is invaluable during development and troubleshooting when you need to iterate on specific components. - Environment Variables: The
environment
section sets up the necessary context for Kubernetes operations. TheKUBECONFIG
variable tells kubectl and other tools where to find cluster credentials, whileK8S_AUTH_KUBECONFIG
provides the same information to Ansible’s Kubernetes modules. - Variable Scoping: The
vars
section demonstrates how to define play-specific variables that override role defaults. This allows you to customize behavior without modifying role files directly. - Logical Play Organization: The playbook is structured in logical phases (preflight, setup, infrastructure, validation) that mirror real-world deployment workflows. This organization makes it easier to understand dependencies and troubleshoot issues when they occur.
Update site.yml
with the new tasks and add roles/tags to existing tasks:
---
- name: Pre-flight checks
hosts: all
gather_facts: true
roles:
- { role: preflight, tags: ['preflight', 'validation'] }
- name: Common setup
hosts: all
become: true
roles:
- { role: common, tags: ['common', 'setup'] }
- { role: containerd, tags: ['containerd', 'container-runtime'] }
- { role: kubernetes, tags: ['kubernetes', 'k8s'] }
- name: Control plane setup
hosts: masters
become: true
roles:
- { role: control-plane, tags: ['control-plane', 'masters'] }
- name: Worker nodes setup
hosts: workers
become: true
roles:
- { role: worker, tags: ['worker', 'nodes'] }
- name: Infrastructure and applications
hosts: masters
become: true
vars:
kubeconfig_path: /home/ubuntu/.kube/config
environment:
K8S_AUTH_KUBECONFIG: "{{ kubeconfig_path }}"
KUBECONFIG: "{{ kubeconfig_path }}"
PATH: "/usr/local/bin:{{ ansible_env.PATH }}"
roles:
- { role: helm, tags: ['helm', 'tools'] }
- { role: metallb, tags: ['metallb', 'networking', 'load-balancer'] }
- { role: storage, tags: ['storage', 'persistent-storage'] }
- { role: istio, tags: ['istio', 'service-mesh'] }
- name: Validation and health checks
hosts: masters
become: true
vars:
kubeconfig_path: /home/ubuntu/.kube/config
environment:
K8S_AUTH_KUBECONFIG: "{{ kubeconfig_path }}"
KUBECONFIG: "{{ kubeconfig_path }}"
PATH: "/usr/local/bin:{{ ansible_env.PATH }}"
roles:
- { role: validation, tags: ['validation', 'health-check'] }
Example Usage
The enhanced playbook with tags and role organization enables flexible execution patterns that are essential for real-world infrastructure management:
Complete Infrastructure Deployment:
# Run the entire playbook (same as before)
ANSIBLE_CONFIG=configuration-with-ansible/ansible.cfg ansible-playbook \
-i configuration-with-ansible/inventory.ini \
configuration-with-ansible/site.yml
Targeted Component Installation:
# Install only networking components (MetalLB)
ansible-playbook -i inventory.ini site.yml --tags networking
# Deploy only the service mesh
ansible-playbook -i inventory.ini site.yml --tags istio
# Run validation checks without changing anything
ansible-playbook -i inventory.ini site.yml --tags validation
Development and Troubleshooting:
# Skip preflight checks during development
ansible-playbook -i inventory.ini site.yml --skip-tags preflight
# Install tools and storage without service mesh
ansible-playbook -i inventory.ini site.yml --tags helm,storage
# Run only infrastructure components (skip basic cluster setup)
ansible-playbook -i inventory.ini site.yml --tags metallb,storage,istio
Multi-Environment Management:
# Target specific environments using different inventory files
ansible-playbook -i production-inventory.ini site.yml --tags validation
ansible-playbook -i staging-inventory.ini site.yml --tags networking,storage
This flexibility allows you to iterate quickly during development, perform targeted updates in production, and troubleshoot specific components without affecting the entire infrastructure.
Conclusion
In this third installment of our Infrastructure as Code series, we’ve transformed a basic Kubernetes cluster into a production-ready platform with essential enterprise capabilities. By implementing MetalLB for load balancing, Local Path Provisioner for persistent storage, and Istio for advanced traffic management, we’ve created a foundation that can support real-world application workloads.
The automation techniques we’ve explored, group variables, preflight checks, Helm integration, and basic validation, represent industry best practices that scale from homelab environments to enterprise deployments. The modular role structure and tag-based execution provide the flexibility needed to manage complex infrastructure while maintaining reliability and repeatability.
Our enhanced Ansible playbook now demonstrates sophisticated infrastructure automation patterns including environment-specific configuration, conditional task execution, and multi-layered validation. These skills are directly applicable to production environments where reliability, auditability, and maintainability are critical requirements.
In Part 4 of this series, we’ll complete our Infrastructure as Code pipeline by implementing GitHub Actions workflows that automatically trigger our Terraform and Ansible automation in response to pull requests and code changes. This will demonstrate how to build a complete CI/CD pipeline for infrastructure that provides the same level of automation and quality control typically reserved for application code.
The combination of version-controlled infrastructure definitions, automated testing and validation, and gitops-style deployment workflows represents the pinnacle of modern DevOps practices, enabling teams to manage infrastructure with the same rigor and efficiency they apply to software development.
Further Learning Resources
To deepen your understanding of the technologies and concepts covered in this tutorial, here are recommended resources for continued learning:
Ansible Advanced Topics:
- Ansible Best Practices - Official guide to structuring and organizing automation content
- Ansible Galaxy - Community repository of reusable roles and collections
- Ansible Vault - Encrypting sensitive data in playbooks and variable files
MetalLB Load Balancing:
- MetalLB Official Documentation - Comprehensive guide to configuration options and deployment modes
- MetalLB Concepts - Deep dive into L2 vs BGP modes and their trade-offs
- Kubernetes Load Balancer Services - Understanding how LoadBalancer services work
Istio Service Mesh:
- Istio Official Documentation - Complete reference for Istio concepts and configuration
- Istio Traffic Management - Advanced routing, load balancing, and traffic splitting
- Istio Security - Mutual TLS, authorization policies, and security best practices
- Envoy Proxy Documentation - Understanding the underlying proxy technology
Kubernetes Storage:
- Kubernetes Storage Concepts - Comprehensive guide to PVs, PVCs, and Storage Classes
- Container Storage Interface (CSI) - Modern standard for storage driver implementation
- Kubernetes Storage Best Practices - Production considerations for storage design
Infrastructure as Code:
- The Terraform Book - Comprehensive guide to infrastructure provisioning
- Ansible for DevOps - Practical automation patterns and best practices
- Infrastructure as Code Patterns - Design patterns for scalable automation

Aaron Mathis
Systems administrator and software engineer specializing in cloud development, AI/ML, and modern web technologies. Passionate about building scalable solutions and sharing knowledge with the developer community.
Related Articles
Discover more insights on similar topics