DeepThought.sh
Infrastructure

Infrastructure as Code (IaC) Part 3: Load-Balancing, Traffic-Shaping, and Persistant Storage

Part 3 of our Infrastructure as Code series, where we enhance our Kubernetes cluster with production-ready components including MetalLB load balancing, Istio service mesh for traffic management, and persistent storage solutions, all automated through Ansible.

Aaron Mathis
27 min read
Infrastructure as Code (IaC) Part 3: Load-Balancing, Traffic-Shaping, and Persistant Storage

Welcome to the third installment of our Infrastructure as Code series, where we transform our basic Kubernetes cluster into a production-ready platform capable of handling real-world workloads. In the previous articles, we provisioned virtual machines with Terraform and configured a functional Kubernetes cluster using Ansible. While our cluster can run pods and basic services, it lacks critical infrastructure components needed for enterprise applications.

This tutorial bridges that gap by implementing three essential production capabilities: load balancing with MetalLB to expose services externally, persistent storage for stateful applications, and traffic management through Istio service mesh for advanced routing and security. These components represent the foundation of modern cloud-native infrastructure, enabling features like blue-green deployments, A/B testing, automatic failover, and zero-downtime updates.

By the end of this article, you’ll have a decent understanding of how these technologies integrate with Kubernetes and how to automate their deployment using advanced Ansible techniques including group variables, preflight checks, and modular role design. We’ll also introduce concepts like Helm package management and validation playbooks that ensure your infrastructure deployments are both reliable and repeatable.

As always, you can find all the code examples and configuration files from this tutorial in our GitHub repository.


Prerequisites and Current State

This tutorial continues directly from where we left off in Part 2: Configuration Management with Ansible. You should have a fully functional 5-node Kubernetes cluster (1 master, 4 workers) deployed and configured through our Terraform and Ansible automation pipeline.

If you haven’t completed the previous parts of this series, I strongly recommend starting with Part 1: Introduction to Terraform and working through the complete sequence. The infrastructure components we’re adding in this tutorial depend on the specific cluster configuration, networking setup, and SSH key management established in those earlier tutorials.

Required Prerequisites:

  • Completed Terraform infrastructure from Part 1 (5 VMs with static IPs)
  • Functional Kubernetes cluster from Part 2 (kubeadm-based installation)
  • Ansible control environment with kubernetes.core collection installed
  • Master node with administrative kubeconfig in /home/ubuntu/.kube/config
  • All worker nodes joined to the cluster and in Ready state

You can verify your cluster is ready by running:

kubectl get nodes

You should see all 5 nodes in Ready status. If any nodes show NotReady or if kubectl commands fail, review Part 2 to ensure your cluster deployment completed successfully.

Project Directory

To begin, your project directory should look like this:

infrastructure-as-code/
├── Makefile
├── introduction-to-terraform/
|   ├── main.tf           # Primary resource definitions
|   ├── variables.tf      # Input variable declarations
|   ├── outputs.tf        # Output value definitions
|   ├── locals.tf         # Local value computations
|   └── cloud-init/       # VM initialization templates
|       ├── user-data.tpl     # User and SSH configuration
|       └── network-config.tpl # Static IP configuration
└── configuration-with-ansible/
    ├── ansible.cfg                    # Ansible configuration file with SSH settings and output formatting
    ├── generate_inventory.sh          # Script to parse Terraform output and generate Ansible inventory
    ├── inventory.ini                  # Generated inventory file (created by generate_inventory.sh)
    ├── site.yml                       # Main Ansible playbook that orchestrates all roles
    └── roles/                         # Directory containing all Ansible roles
        ├── common/                    # Role for common tasks across all nodes
   └── tasks/
       └── main.yml           # Disables swap, loads kernel modules, sets sysctl parameters
        ├── containerd/                # Role for container runtime installation and configuration
   └── tasks/
       └── main.yml           # Installs containerd and configures systemd cgroup driver
        ├── kubernetes/                # Role for Kubernetes component installation
   └── tasks/
       └── main.yml           # Installs kubelet, kubeadm, kubectl with version pinning
        ├── control-plane/             # Role for Kubernetes master node setup
   └── tasks/
       └── main.yml           # Runs kubeadm init, sets up kubeconfig, installs Calico CNI
        └── worker/                    # Role for Kubernetes worker node setup
            └── tasks/
                └── main.yml           # Joins worker nodes to the cluster using kubeadm join

Note: All code files referenced from this point on are located within the configuration-with-ansible/ folder.


Group Variables

Group variables in Ansible provide a powerful mechanism for defining configuration values that apply to multiple hosts or entire environments. Unlike host-specific variables that only affect individual machines, group variables allow you to establish consistent settings across logical groups of infrastructure components.

The group_vars/all.yml file is particularly important because it defines variables that apply to every host in your inventory, regardless of their group membership. This is ideal for cluster-wide configuration like Kubernetes versions, network ranges, and component settings that must remain synchronized across all nodes.

Group variables also promote infrastructure as code best practices by centralizing configuration management. Rather than hardcoding values throughout your playbooks and roles, you define them once in a logical location where they can be easily reviewed, updated, and version-controlled. This approach reduces configuration drift and makes it easier to deploy consistent environments across development, staging, and production.

Let’s define some group variables. Start by creating group_vars/all.yml:

---
# Cluster Configuration
cluster_domain: "local"
kubernetes_version: "1.28.*"

# Component Versions
helm_version: "3.18.0"
istio_version: "1.22.3"

# Network Configuration
metallb_ip_addresses: 
  - "192.168.122.200-192.168.122.220"
metallb_pool_name: "default-pool"

# Security Configuration
enable_pod_security_policies: true
enable_network_policies: true

Preflight Checks

Preflight checks are validation routines that verify system readiness before attempting complex deployments. They represent a critical best practice in infrastructure automation, catching potential problems early when they’re easier and cheaper to resolve. Rather than discovering resource constraints or configuration conflicts halfway through a multi-hour deployment, preflight checks fail fast with clear error messages.

In Kubernetes environments, preflight checks are particularly valuable because the platform has specific requirements for memory, CPU, disk space, network configuration, and kernel modules. These checks also validate that required network ports are available and that conflicting services aren’t already running.

The benefits of implementing preflight checks include:

  • Reduced deployment failures by catching issues before they cause partial deployments
  • Faster troubleshooting through specific error messages rather than generic deployment failures
  • Improved reliability by ensuring consistent environmental prerequisites across all deployments
  • Better user experience by providing actionable feedback when prerequisites aren’t met

Let’s add a new file roles/preflight/tasks/main.yml:

---
# roles/preflight/tasks/main.yml

- name: Check minimum system requirements
  assert:
    that:
      - ansible_memtotal_mb >= 2048
      - ansible_processor_vcpus >= 2
      - ansible_architecture == "x86_64"
    fail_msg: "Insufficient resources: need 2GB RAM, 2 CPU cores, and x86_64 architecture"

- name: Check disk space
  assert:
    that:
      - ansible_mounts | selectattr('mount', 'equalto', '/') | map(attribute='size_available') | first > 10000000000
    fail_msg: "Insufficient disk space: need at least 20GB free space on root partition"

- name: Verify required ports are not in use
  wait_for:
    port: "{{ item }}"
    state: stopped
    timeout: 1
    host: "{{ ansible_default_ipv4.address }}"
  loop:
    - 6443  # Kubernetes API
    - 10250 # kubelet
    - 10259 # kube-scheduler
    - 10257 # kube-controller-manager
    - 2379  # etcd
    - 2380  # etcd
  ignore_errors: true
  register: port_check

- name: Report port conflicts
  ansible.builtin.debug:
    msg: "WARNING: Port {{ item.item }} appears to be in use"
  when: item.failed is defined and item.failed
  loop: "{{ port_check.results }}"

- name: Check container runtime prerequisites
  ansible.builtin.command:
    cmd: "{{ item }}"
  loop:
    - "modinfo overlay"
    - "modinfo br_netfilter"
  register: kernel_modules
  failed_when: false
  changed_when: false

- name: Verify kernel modules
  assert:
    that:
      - kernel_modules.results[0].rc == 0
      - kernel_modules.results[1].rc == 0
    fail_msg: "Required kernel modules (overlay, br_netfilter) are not available"

- name: Check SELinux status
  ansible.builtin.command:
    cmd: getenforce
  register: selinux_status
  failed_when: false
  changed_when: false

- name: Warn about SELinux
  ansible.builtin.debug:
    msg: "WARNING: SELinux is {{ selinux_status.stdout }}. Consider setting to permissive for Kubernetes"
  when: selinux_status.rc == 0 and selinux_status.stdout == "Enforcing"

This basic preflight role validates multiple aspects of system readiness for Kubernetes deployment:

  • Resource Validation: The assert tasks check that each node meets minimum hardware requirements (2GB RAM, 2 CPU cores, x86_64 architecture) and has sufficient disk space. These are hard requirements for Kubernetes functionality, and deployment will fail gracefully if they’re not met.
  • Port Conflict Detection: The wait_for task checks whether critical Kubernetes ports are already in use. While the task uses ignore_errors: true to prevent immediate failure, it registers results that are evaluated in subsequent tasks to provide warnings about potential conflicts.
  • Kernel Module Prerequisites: Kubernetes requires specific kernel modules for container networking (br_netfilter) and overlay filesystem support (overlay). The preflight check verifies these modules are available before attempting container runtime installation.
  • Security Context Awareness: The SELinux check provides important feedback about security contexts that can interfere with Kubernetes operations. While not automatically remediated, this information helps administrators make informed decisions about security configuration.

Package Management with Helm

Helm serves as the “package manager for Kubernetes,” providing a standardized way to define, install, and manage complex applications on Kubernetes clusters. Think of Helm as the equivalent of apt, yum, or homebrew, but specifically designed for Kubernetes resources and applications.

Traditional Kubernetes deployments require manually managing multiple YAML files for different resources (deployments, services, configmaps, secrets, etc.). Helm packages these resources into “charts” that can be installed, upgraded, and rolled back as a single unit. This is particularly valuable for complex applications like monitoring stacks, databases, or service meshes that involve dozens of interconnected Kubernetes resources.

We’re using Helm in this tutorial because both MetalLB and Istio provide official Helm charts that significantly simplify their installation and configuration. Rather than manually downloading and applying multiple YAML manifests, Helm allows us to install these complex systems with a single command while maintaining the ability to customize their configuration through values files.

Let’s add a role for helm. Start by creating roles/helm/defaults/main.yml:

---
# Which Helm version to install (no leading "v")
helm_version: "3.18.0"

# Where to install the helm binary
helm_install_dir: "/usr/local/bin"

Install Helm via Tasks

Now, create roles/helm/tasks/main.yml:

---
# roles/helm/tasks/main.yml
- name: Check if Helm is already installed
  ansible.builtin.stat:
    path: "{{ helm_install_dir }}/helm"
  register: helm_binary

- name: Download Helm tarball
  ansible.builtin.get_url:
    url: "https://get.helm.sh/helm-v{{ helm_version }}-linux-amd64.tar.gz"
    dest: "/tmp/helm-v{{ helm_version }}.tar.gz"
    mode: '0644'
  when: not helm_binary.stat.exists

- name: Extract Helm binary from archive
  ansible.builtin.unarchive:
    src: "/tmp/helm-v{{ helm_version }}.tar.gz"
    dest: "/tmp"
    remote_src: yes
    creates: "/tmp/linux-amd64/helm"
  when: not helm_binary.stat.exists

- name: Install Helm executable to {{ helm_install_dir }}
  ansible.builtin.copy:
    src: "/tmp/linux-amd64/helm"
    dest: "{{ helm_install_dir }}/helm"
    mode: '0755'
    remote_src: yes
  when: not helm_binary.stat.exists

This Helm installation role demonstrates several important Ansible patterns for managing binary installations:

  • Idempotency: The role first checks whether Helm is already installed using ansible.builtin.stat. This prevents unnecessary downloads and installations on subsequent playbook runs, making the automation more efficient and reliable.
  • Download and Extraction: The role downloads the official Helm release tarball and extracts it to a temporary directory. Using remote_src: yes with the unarchive module tells Ansible that the source file is already on the remote host rather than copying it from the control machine.
  • Installation: The final task copies the Helm binary to the system PATH location with appropriate executable permissions. This makes Helm available to all users and subsequent automation tasks.
  • Conditional Execution: All download and installation tasks use when: not helm_binary.stat.exists to ensure they only run when Helm isn’t already present, demonstrating proper idempotent automation design.

Network Load-Balancing with MetalLB

MetalLB is a load balancer implementation specifically designed for bare-metal Kubernetes clusters that don’t have access to cloud provider load balancers like AWS ELB or Google Cloud Load Balancer. In traditional cloud environments, creating a Service with type: LoadBalancer automatically provisions an external load balancer. In bare-metal or homelab environments, these services remain in “Pending” state indefinitely because no load balancer implementation is available.

MetalLB solves this problem by providing a software-based load balancer that can assign external IP addresses to services and announce those IPs to the local network. It operates in two primary modes: Layer 2 (L2) mode uses ARP/NDP announcements to make services accessible, while BGP mode integrates with network routers for more sophisticated routing scenarios.

For our homelab environment, L2 mode is ideal because it requires no special network configuration and works with standard home network equipment. MetalLB will assign IP addresses from a pool we define (192.168.122.200-220 in our configuration) and announce these IPs on the local network, making services accessible from outside the Kubernetes cluster.

Let’s add a role for metallb. Start by creating roles/metallb/defaults/main.yml:

---
# roles/metallb/defaults/main.yml
# Address range(s) for your LoadBalancer services
metallb_ip_addresses: 
  - "192.168.122.200-192.168.122.220"
# Name of the IPPool
metallb_pool_name: default-pool

Isn’t that Redundant?

You might notice that we’re defining the same MetalLB variables in both group_vars/all.yml and roles/metallb/defaults/main.yml. This apparent redundancy actually represents an Ansible best practice that promotes role portability and maintainable automation.

Role defaults serve as fallback values that ensure a role can function independently, even if group variables aren’t defined. This makes roles more portable between different projects and environments. When you share or reuse a role, the defaults ensure it will work without requiring specific variable definitions in the consuming project.

The precedence order in Ansible means that group variables will override role defaults when both are present. This allows you to define environment-specific values in group variables while maintaining sensible defaults within the role itself. It’s similar to function parameters with default values in programming languages—the defaults provide safety while explicit parameters allow customization.

Define the Task

---
# roles/metallb/tasks/main.yml

- name: Add MetalLB Helm repository
  kubernetes.core.helm_repository:
    name: metallb
    repo_url: https://metallb.github.io/metallb
    state: present

- name: Create metallb-system namespace
  kubernetes.core.k8s:
    api_version: v1
    kind: Namespace
    name: metallb-system
    state: present

- name: Label metallb-system for privileged Pod Security
  kubernetes.core.k8s:
    api_version: v1
    kind: Namespace
    name: metallb-system
    merge_type: strategic-merge
    definition:
      metadata:
        labels:
          pod-security.kubernetes.io/enforce: privileged
          pod-security.kubernetes.io/audit:    privileged
          pod-security.kubernetes.io/warn:     privileged

- name: Install MetalLB chart via Helm
  kubernetes.core.helm:
    name: metallb
    chart_ref: metallb/metallb
    release_namespace: metallb-system
    create_namespace: false
    wait: true
    state: present

- name: Wait for MetalLB Controller to be ready
  ansible.builtin.command:
    cmd: kubectl rollout status deployment/metallb-controller -n metallb-system --timeout=300s
  changed_when: false

- name: Wait for MetalLB Speaker to be ready
  ansible.builtin.command:
    cmd: kubectl rollout status daemonset/metallb-speaker -n metallb-system --timeout=300s
  changed_when: false

- name: Wait for MetalLB webhook to be ready
  kubernetes.core.k8s_info:
    api_version: v1
    kind: Endpoints
    name: metallb-webhook-service
    namespace: metallb-system
  register: webhook_ep
  until: webhook_ep.resources | length > 0 and webhook_ep.resources[0].subsets is defined and webhook_ep.resources[0].subsets | length > 0
  retries: 30
  delay: 5
  changed_when: false

- name: Configure MetalLB IPAddressPool
  kubernetes.core.k8s:
    state: present
    definition:
      apiVersion: metallb.io/v1beta1
      kind: IPAddressPool
      metadata:
        name: "{{ metallb_pool_name }}"
        namespace: metallb-system
      spec:
        addresses: "{{ metallb_ip_addresses }}"

- name: Configure MetalLB L2Advertisement
  kubernetes.core.k8s:
    state: present
    definition:
      apiVersion: metallb.io/v1beta1
      kind: L2Advertisement
      metadata:
        name: default
        namespace: metallb-system
      spec:
        ipAddressPools:
          - "{{ metallb_pool_name }}"

This MetalLB deployment role demonstrates several advanced Kubernetes automation patterns:

Helm Repository Management: The role adds the official MetalLB Helm repository, enabling access to regularly updated charts. This is preferable to downloading static YAML files because it ensures you can easily upgrade to newer versions.

Namespace and Security Configuration: The role creates the metallb-system namespace and applies appropriate Pod Security labels. In modern Kubernetes clusters with Pod Security Standards enabled, this labeling is crucial for allowing MetalLB’s privileged operations.

Orchestrated Installation: The Helm installation uses wait: true to block until all MetalLB components are successfully deployed. This prevents race conditions in subsequent tasks that depend on MetalLB being operational.

Readiness Verification: The multiple wait tasks ensure that all MetalLB components (controller, speaker, webhook) are fully ready before proceeding. This is particularly important for the webhook, which must be operational before creating MetalLB custom resources.

Resource Configuration: The final tasks create MetalLB-specific Kubernetes resources (IPAddressPool and L2Advertisement) that configure how MetalLB should behave. These resources tell MetalLB which IP addresses to assign to services and how to announce them on the network.


Persistent Storage

Persistent storage in Kubernetes addresses one of the fundamental challenges in containerized environments: data persistence. By default, containers are ephemeral—when a pod is destroyed, any data stored within the container’s filesystem is lost forever. This works well for stateless applications but poses significant challenges for databases, file servers, and other applications that need to maintain data across pod restarts and rescheduling.

Kubernetes provides several persistent storage solutions through the concept of Persistent Volumes (PV) and Persistent Volume Claims (PVC). A PV represents a piece of networked storage that has been provisioned by an administrator or dynamically provisioned using Storage Classes. A PVC is a request for storage by a user, similar to how pods consume node resources.

Storage Classes enable dynamic provisioning by defining different “classes” of storage with specific characteristics like performance, backup policies, or replication levels. When a PVC references a Storage Class, Kubernetes automatically creates a suitable Persistent Volume to satisfy the request.

Benefits of implementing persistent storage include:

  • Data durability across pod lifecycle events (restarts, rescheduling, scaling)
  • Application state preservation for databases and stateful services
  • Development environment consistency by maintaining data between cluster rebuilds
  • Production readiness for applications requiring data persistence

Local Path Provisioner

The Local Path Provisioner is a simple yet effective storage solution that creates persistent volumes using local storage on Kubernetes nodes. Rather than requiring complex network-attached storage or cloud provider integrations, it dynamically provisions storage by creating directories on the local filesystem of worker nodes.

This approach is particularly well-suited for development environments, testing clusters, and homelab deployments where high-availability storage isn’t required but basic persistence is needed. When a PVC is created, the Local Path Provisioner automatically creates a directory on a node and mounts it into the requesting pod.

While local path storage doesn’t provide the redundancy or accessibility of networked storage solutions, it offers several advantages for our use case: zero external dependencies, simple configuration, and excellent performance since data access doesn’t involve network overhead. For learning and development purposes, it provides an ideal foundation for understanding Kubernetes storage concepts.

Let’s add a role for persistant storage. Start by creating roles/storage/tasks/main.yml:

---
# roles/storage/tasks/main.yml

- name: Install Local Path Provisioner
  kubernetes.core.k8s:
    state: present
    src: https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml
  run_once: true

- name: Set local-path as the default StorageClass
  kubernetes.core.k8s:
    state: present
    kind: StorageClass
    name: local-path
    definition:
      metadata:
        annotations:
          storageclass.kubernetes.io/is-default-class: "true"
  run_once: true

This storage role accomplishes two critical tasks with minimal complexity:

  • Dynamic Provisioner Installation: The first task installs the Local Path Provisioner by applying its official Kubernetes manifests directly from the Rancher GitHub repository. This approach ensures we’re always using the most current stable version while avoiding the complexity of managing local manifest files.
  • Default Storage Class Configuration: The second task configures the local-path Storage Class as the default for the cluster by adding the storageclass.kubernetes.io/is-default-class: "true" annotation. This means that any PVC that doesn’t explicitly specify a Storage Class will automatically use local path provisioning.

The run_once: true directive ensures these operations are only performed once per playbook run, regardless of how many master nodes are in the inventory. This prevents conflicts and unnecessary duplicate operations in multi-master configurations.


Service Mesh with Istio

Before diving into the installation, it’s important to understand what we’re implementing. A service mesh is an infrastructure layer that handles service-to-service communication in a microservices architecture. Rather than requiring each service to implement its own communication, security, and observability logic, the service mesh provides these capabilities through a network of lightweight proxies.

Core Components of Istio

Data Plane: Consists of sidecar proxies (based on Envoy) that are deployed alongside each service instance. These proxies intercept all network traffic to and from the service, providing features like load balancing, circuit breaking, encryption, and telemetry collection.

Control Plane: The central management layer that configures and manages the sidecar proxies. This includes:

  • Istiod: The main control plane component that handles configuration, service discovery, and certificate management
  • Pilot: Service discovery and traffic management
  • Citadel: Identity and security management
  • Galley: Configuration management and validation

Key Benefits for Our Environment:

  • Traffic Management: Intelligent routing, load balancing, and failover without code changes
  • Security: Mutual TLS encryption between services and fine-grained access control
  • Observability: Distributed tracing, metrics, and logging for all service interactions
  • Resilience: Circuit breakers, timeouts, and retry policies to handle failures gracefully

This architecture allows us to add sophisticated networking capabilities to our existing microservices without modifying the application code, making it perfect for our environment.

Installing Istio

First, let’s create roles/istio/defaults/main.yml:

---
# roles/istio/defaults/main.yml

# Which version of Istio to install
istio_version: "1.22.3"

# Profile to install (default, minimal, demo, etc)
istio_profile: "default"

# Where to unpack istioctl
istio_install_dir: "/usr/local/istio-{{ istio_version }}"

Now, define the task in roles/istio/tasks/main.yml:

---
# roles/istio/tasks/main.yml

- name: Download Istio release archive
  ansible.builtin.get_url:
    url: "https://github.com/istio/istio/releases/download/{{ istio_version }}/istio-{{ istio_version }}-linux-amd64.tar.gz"
    dest: "/tmp/istio-{{ istio_version }}.tar.gz"
    mode: '0644'

- name: Unpack istioctl
  ansible.builtin.unarchive:
    src: "/tmp/istio-{{ istio_version }}.tar.gz"
    dest: "/usr/local"
    remote_src: yes
    creates: "{{ istio_install_dir }}/bin/istioctl"

- name: Ensure istioctl is on PATH (symlink)
  ansible.builtin.file:
    src: "{{ istio_install_dir }}/bin/istioctl"
    dest: /usr/local/bin/istioctl
    state: link

- name: Check if Istio is already installed
  kubernetes.core.k8s_info:
    kind: Deployment
    name: istiod
    namespace: istio-system
  register: istiod_deployment
  ignore_errors: true

- name: Create istio-system namespace
  kubernetes.core.k8s:
    api_version: v1
    kind: Namespace
    name: istio-system
    state: present
  when: istiod_deployment.resources | default([]) | length == 0

- name: Install Istio control plane using generated manifest
  ansible.builtin.shell:
    cmd: >
      istioctl manifest generate --set profile={{ istio_profile }} | kubectl apply -f -
  when: istiod_deployment.resources | default([]) | length == 0
  register: istio_install_result
  changed_when: "'created' in istio_install_result.stdout or 'configured' in istio_install_result.stdout"

- name: Wait for Istiod deployment to be ready
  ansible.builtin.command:
    cmd: kubectl rollout status deployment/istiod -n istio-system --timeout=300s
  changed_when: false

- name: Wait for Istio Ingress Gateway deployment to be ready
  ansible.builtin.command:
    cmd: kubectl rollout status deployment/istio-ingressgateway -n istio-system --timeout=300s
  changed_when: false

This Istio installation role demonstrates several sophisticated automation techniques:

  • Binary Management: The role downloads and installs the istioctl command-line tool, which is the primary interface for managing Istio installations. The symlink approach ensures the tool is available in the system PATH while maintaining version-specific installation directories.
  • Idempotent Installation: Before installing Istio, the role checks whether the istiod deployment already exists in the istio-system namespace. This prevents duplicate installations and allows the playbook to be run safely multiple times.
  • Manifest-Based Installation: Rather than using Helm charts, this role uses istioctl manifest generate to create Kubernetes manifests and applies them directly. This approach provides more control over the installation process and better compatibility with automated environments.
  • Rollout Status Verification: The final tasks wait for both the main control plane (istiod) and the ingress gateway deployments to reach ready status. This ensures that Istio is fully operational before the playbook completes or proceeds to dependent tasks.

This installation method balances automation with control, providing a reliable way to deploy Istio while maintaining visibility into the installation process.


Validating Your Infrastructure

Infrastructure validation represents a critical but often overlooked aspect of automation pipelines. While deployment automation focuses on provisioning and configuring resources, validation automation ensures that the resulting infrastructure actually works as expected. This is particularly important in complex systems like Kubernetes clusters where components have intricate dependencies and failure modes.

Validation playbooks serve multiple purposes beyond simple success confirmation. They provide rapid feedback during troubleshooting, enable confidence in automated deployments, and create a foundation for monitoring and alerting systems. In production environments, validation checks can be extended to include performance benchmarks, security posture assessments, and compliance verification.

The validation role we’re implementing checks multiple layers of the infrastructure stack:

  • Node-level health: Ensuring all Kubernetes nodes are in Ready state and can communicate with the control plane
  • System component functionality: Verifying that critical pods in the kube-system namespace are running correctly
  • Application component readiness: Confirming that MetalLB, Istio, and storage systems are operational and properly configured
  • Service availability: Validating that load balancing and persistent storage are actually functional for workloads

This multi-layered approach provides confidence that the infrastructure can support application workloads reliably.

Create a new role for validation at roles/validation/tasks/main.yml:

# filepath: roles/validation/tasks/main.yml
---
- name: Verify cluster nodes are ready
  ansible.builtin.command:
    cmd: kubectl get nodes --no-headers
  register: nodes_status
  changed_when: false
  failed_when: "'NotReady' in nodes_status.stdout"

- name: Verify critical pods are running
  ansible.builtin.command:
    cmd: kubectl get pods -n kube-system --field-selector=status.phase!=Running --no-headers
  register: failed_pods
  changed_when: false
  failed_when: failed_pods.stdout_lines | length > 0

- name: Verify Istio installation
  kubernetes.core.k8s_info:
    api_version: apps/v1
    kind: Deployment
    name: istiod
    namespace: istio-system
  register: istio_status
  failed_when: >
    istio_status.resources | length == 0 or
    istio_status.resources[0].status.readyReplicas != istio_status.resources[0].status.replicas

- name: Verify MetalLB is operational
  kubernetes.core.k8s_info:
    api_version: apps/v1
    kind: Deployment
    name: metallb-controller
    namespace: metallb-system
  register: metallb_status
  failed_when: >
    metallb_status.resources | length == 0 or
    metallb_status.resources[0].status.readyReplicas != metallb_status.resources[0].status.replicas

- name: Verify persistent storage is available
  kubernetes.core.k8s_info:
    kind: StorageClass
    name: local-path
  register: storage_class_status
  failed_when: >
    storage_class_status.resources | length == 0 or
    storage_class_status.resources[0].metadata.get('annotations', {}).get('storageclass.kubernetes.io/is-default-class') != 'true'

- name: Display validation results
  ansible.builtin.debug:
    msg:
      - "Cluster validation completed:"
      - "Nodes ready: {{ nodes_status.stdout_lines | length }}"
      - "Istio ready: {{ istio_status.resources[0].status.readyReplicas | default(0) }}/{{ istio_status.resources[0].status.replicas | default(0) }}"
      - "MetalLB ready: {{ metallb_status.resources[0].status.readyReplicas | default(0) }}/{{ metallb_status.resources[0].status.replicas | default(0) }}"
      - "Persistent Storage (StorageClass) ready: {{ 'yes' if storage_class_status.resources | length > 0 else 'no' }}"

This validation role implements health checking across multiple infrastructure layers:

  • Cluster Health Verification: The first task ensures all Kubernetes nodes are in Ready state by checking the output of kubectl get nodes. Any node showing NotReady status will cause the playbook to fail with a clear error message, immediately highlighting cluster-level issues.
  • System Component Status: The second check looks for any pods in the critical kube-system namespace that aren’t in Running state. Since these pods provide essential cluster services (DNS, CNI, etc.), any failures here indicate serious cluster problems.
  • Application Component Health: The Istio and MetalLB checks use the kubernetes.core.k8s_info module to query deployment status directly from the Kubernetes API. These checks verify not just that the deployments exist, but that they have the expected number of ready replicas.
  • Storage Availability: The storage validation confirms that the Local Path Provisioner storage class is available and properly configured as the default. This ensures that applications requesting persistent storage will be able to obtain it.
  • Reporting: The final debug task provides a summary of all validation results, giving administrators a quick overview of infrastructure health. This approach makes it easy to identify which components need attention if validation fails.

Updating our Playbook

Our enhanced site.yml playbook introduces several advanced Ansible concepts that weren’t present in the simpler version from Part 2. Understanding these concepts is crucial for building maintainable, scalable automation pipelines.

  • Tags and Role Organization: Tags provide fine-grained control over which parts of a playbook execute. Instead of running the entire playbook every time, you can target specific components (e.g., --tags networking to only run MetalLB tasks). This is invaluable during development and troubleshooting when you need to iterate on specific components.
  • Environment Variables: The environment section sets up the necessary context for Kubernetes operations. The KUBECONFIG variable tells kubectl and other tools where to find cluster credentials, while K8S_AUTH_KUBECONFIG provides the same information to Ansible’s Kubernetes modules.
  • Variable Scoping: The vars section demonstrates how to define play-specific variables that override role defaults. This allows you to customize behavior without modifying role files directly.
  • Logical Play Organization: The playbook is structured in logical phases (preflight, setup, infrastructure, validation) that mirror real-world deployment workflows. This organization makes it easier to understand dependencies and troubleshoot issues when they occur.

Update site.yml with the new tasks and add roles/tags to existing tasks:

---
- name: Pre-flight checks
  hosts: all
  gather_facts: true
  roles:
    - { role: preflight, tags: ['preflight', 'validation'] }

- name: Common setup
  hosts: all
  become: true
  roles:
    - { role: common, tags: ['common', 'setup'] }
    - { role: containerd, tags: ['containerd', 'container-runtime'] }
    - { role: kubernetes, tags: ['kubernetes', 'k8s'] }

- name: Control plane setup
  hosts: masters
  become: true
  roles:
    - { role: control-plane, tags: ['control-plane', 'masters'] }

- name: Worker nodes setup
  hosts: workers
  become: true
  roles:
    - { role: worker, tags: ['worker', 'nodes'] }

- name: Infrastructure and applications
  hosts: masters
  become: true
  vars:
    kubeconfig_path: /home/ubuntu/.kube/config
  environment:
    K8S_AUTH_KUBECONFIG: "{{ kubeconfig_path }}"
    KUBECONFIG: "{{ kubeconfig_path }}"
    PATH: "/usr/local/bin:{{ ansible_env.PATH }}"
  roles:
    - { role: helm, tags: ['helm', 'tools'] }
    - { role: metallb, tags: ['metallb', 'networking', 'load-balancer'] }
    - { role: storage, tags: ['storage', 'persistent-storage'] }
    - { role: istio, tags: ['istio', 'service-mesh'] }

- name: Validation and health checks
  hosts: masters
  become: true
  vars:
    kubeconfig_path: /home/ubuntu/.kube/config
  environment:
    K8S_AUTH_KUBECONFIG: "{{ kubeconfig_path }}"
    KUBECONFIG: "{{ kubeconfig_path }}"
    PATH: "/usr/local/bin:{{ ansible_env.PATH }}"
  roles:
    - { role: validation, tags: ['validation', 'health-check'] }

Example Usage

The enhanced playbook with tags and role organization enables flexible execution patterns that are essential for real-world infrastructure management:

Complete Infrastructure Deployment:

# Run the entire playbook (same as before)
ANSIBLE_CONFIG=configuration-with-ansible/ansible.cfg ansible-playbook \
    -i configuration-with-ansible/inventory.ini \
    configuration-with-ansible/site.yml

Targeted Component Installation:

# Install only networking components (MetalLB)
ansible-playbook -i inventory.ini site.yml --tags networking

# Deploy only the service mesh
ansible-playbook -i inventory.ini site.yml --tags istio

# Run validation checks without changing anything
ansible-playbook -i inventory.ini site.yml --tags validation

Development and Troubleshooting:

# Skip preflight checks during development
ansible-playbook -i inventory.ini site.yml --skip-tags preflight

# Install tools and storage without service mesh
ansible-playbook -i inventory.ini site.yml --tags helm,storage

# Run only infrastructure components (skip basic cluster setup)
ansible-playbook -i inventory.ini site.yml --tags metallb,storage,istio

Multi-Environment Management:

# Target specific environments using different inventory files
ansible-playbook -i production-inventory.ini site.yml --tags validation
ansible-playbook -i staging-inventory.ini site.yml --tags networking,storage

This flexibility allows you to iterate quickly during development, perform targeted updates in production, and troubleshoot specific components without affecting the entire infrastructure.


Conclusion

In this third installment of our Infrastructure as Code series, we’ve transformed a basic Kubernetes cluster into a production-ready platform with essential enterprise capabilities. By implementing MetalLB for load balancing, Local Path Provisioner for persistent storage, and Istio for advanced traffic management, we’ve created a foundation that can support real-world application workloads.

The automation techniques we’ve explored, group variables, preflight checks, Helm integration, and basic validation, represent industry best practices that scale from homelab environments to enterprise deployments. The modular role structure and tag-based execution provide the flexibility needed to manage complex infrastructure while maintaining reliability and repeatability.

Our enhanced Ansible playbook now demonstrates sophisticated infrastructure automation patterns including environment-specific configuration, conditional task execution, and multi-layered validation. These skills are directly applicable to production environments where reliability, auditability, and maintainability are critical requirements.

In Part 4 of this series, we’ll complete our Infrastructure as Code pipeline by implementing GitHub Actions workflows that automatically trigger our Terraform and Ansible automation in response to pull requests and code changes. This will demonstrate how to build a complete CI/CD pipeline for infrastructure that provides the same level of automation and quality control typically reserved for application code.

The combination of version-controlled infrastructure definitions, automated testing and validation, and gitops-style deployment workflows represents the pinnacle of modern DevOps practices, enabling teams to manage infrastructure with the same rigor and efficiency they apply to software development.


Further Learning Resources

To deepen your understanding of the technologies and concepts covered in this tutorial, here are recommended resources for continued learning:

Ansible Advanced Topics:

MetalLB Load Balancing:

Istio Service Mesh:

Kubernetes Storage:

Infrastructure as Code:

Aaron Mathis

Aaron Mathis

Systems administrator and software engineer specializing in cloud development, AI/ML, and modern web technologies. Passionate about building scalable solutions and sharing knowledge with the developer community.

Related Articles

Discover more insights on similar topics