Learning Kubernetes with KubeADM - Part 2: Storage, Ingress, Monitoring, Security, and Maintenance

Welcome back to our series on learning Kubernetes with KubeADM! In Part 1, we successfully built a functional three-node Kubernetes cluster using kubeadm, complete with a control plane node (master-1) and two worker nodes (worker-1, worker-2). We established the fundamental infrastructure with Flannel networking and verified basic cluster operations.

Now we’re ready to transform our basic cluster into a production-ready platform. In this tutorial, we’ll add enterprise-grade capabilities including persistent storage, external access through ingress controllers, monitoring with Prometheus, Loki, and Grafana, security policies, and essential maintenance procedures.

By the end of this guide, you’ll have a Kubernetes cluster that mirrors production environments as closely as a homelab can, equipped with the tools and knowledge to deploy, monitor, and maintain real-world applications.

If you haven’t completed Part 1, please do so before proceeding, as this tutorial builds directly upon that foundation. You can find all the code examples in our GitHub repository.

Prerequisites and Current State

Before we begin, let’s verify our cluster is healthy and ready for the next phase. SSH into your master-1 node and run:

# Verify all nodes are ready
kubectl get nodes

# Check system pods are running
kubectl get pods -A

# Verify cluster info
kubectl cluster-info

You should see output similar to this, confirming all three nodes are in a “Ready” state:

NAME       STATUS   ROLES           AGE   VERSION
master-1   Ready    control-plane   12h   v1.31.0
worker-1   Ready    <none>          12h   v1.31.0
worker-2   Ready    <none>          12h   v1.31.0

Installing Helm: The Kubernetes Package Manager

Before diving into our advanced configurations, we need to install Helm, the de facto standard package manager for Kubernetes. Helm simplifies the deployment and management of complex applications by using templates called charts.

Think of Helm as the “apt” or “yum” for Kubernetes - it allows us to install, upgrade, and manage applications with simple commands rather than manually crafting dozens of YAML files.

On your master-1 node, install Helm:

# Download and install Helm
curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list

sudo apt update
sudo apt install helm

# Verify installation
helm version

# Add commonly used Helm repositories
helm repo add stable https://charts.helm.sh/stable
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add bitnami https://charts.bitnami.com/bitnami

# Update repository information
helm repo update

With Helm installed, we can now deploy complex applications with single commands, making our setup much more manageable and following industry best practices.

Objective 1: Persistent Storage with Local Path Provisioner

One of the fundamental requirements for any production Kubernetes cluster is persistent storage. While our cluster can run stateless applications perfectly, any application that needs to persist data (databases, file uploads, logs) requires persistent volumes.

In cloud environments, you’d typically use dynamic provisioning with cloud storage classes. In our homelab environment, we’ll use the Local Path Provisioner, which creates persistent volumes using local storage on our nodes. This approach is perfect for learning and development environments.

Understanding Kubernetes Storage Concepts

Before implementing storage, let’s understand the key concepts:

Persistent Volume (PV): A cluster-wide storage resource
Persistent Volume Claim (PVC): A request for storage by a pod
Storage Class: Defines how storage is dynamically provisioned
Dynamic Provisioning: Automatic creation of PVs when PVCs are requested

Installing Local Path Provisioner

The Local Path Provisioner automatically creates persistent volumes on local storage when applications request them. This eliminates the need to manually create volumes for each application.

# Deploy Local Path Provisioner
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.28/deploy/local-path-storage.yaml

# Wait for the provisioner to be ready
kubectl wait --for=condition=ready pod -l app=local-path-provisioner -n local-path-storage --timeout=300s

The kubectl patch command allows you to update specific fields of a Kubernetes resource without recreating it. In this case, we need to modify the metadata of the local-path storage class to add an annotation:

Remember:

metadata: Contains identifying information about the resource, such as its name, labels, and annotations.
annotations: Key-value pairs that store additional information for Kubernetes or external tools. While annotations don’t directly change how the resource operates, they can influence how it’s handled.

# Verify the storage class was created
kubectl get storageclass

# Set it as the default storage class
kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

By adding the annotation storageclass.kubernetes.io/is-default-class: "true", you’re telling Kubernetes to treat this storage class as the default. This means that any PersistentVolumeClaim created without a specified storage class will automatically use the local-path storage class.

# Verify it's now marked as default
kubectl get storageclass

You should see output like:

NAME                   PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)   rancher.io/local-path   Delete          WaitForFirstConsumer   false                  2m

Testing Persistent Storage

Let’s test our storage setup with a simple application that requires persistent data. Create a namespace storage-test:

# Create a test namespace
kubectl create namespace storage-test

A namespace in Kubernetes is a logical partition within a cluster that provides a way to divide resources between multiple users, teams, or environments. It helps organize and isolate workloads, making it easier to manage access, resource quotas, and policies for different groups or applications.

Now, create a PVC to test dynamic provisioning:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
  namespace: storage-test
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
EOF

… and check it’s status:

kubectl get pvc -n storage-test

Initially, the PVC will be in “Pending” status because Local Path Provisioner uses the “WaitForFirstConsumer” binding mode. This means the PV is only created when a pod actually uses the PVC.

Now let’s create a test pod that writes data to persistent storage:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: storage-test-pod
  namespace: storage-test
spec:
  containers:
  - name: test-container
    image: busybox
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo \$(date) >> /data/timestamps.txt; sleep 30; done"]
    volumeMounts:
    - name: test-volume
      mountPath: /data
  volumes:
  - name: test-volume
    persistentVolumeClaim:
      claimName: test-pvc
EOF

… and wait for the pod to be ready:

kubectl wait --for=condition=ready pod storage-test-pod -n storage-test --timeout=300s

Now, check that the PVC is now bound and that the persistent volume was created:

# Check that the PVC is now bound
kubectl get pvc -n storage-test

# Check the persistent volume was created
kubectl get pv

Let’s verify that data is being written and persisted:

# Check the data being written
kubectl exec -n storage-test storage-test-pod -- tail -f /data/timestamps.txt

kubectl exec allows you to run commands inside a running container within a pod, making it useful for debugging, inspecting files, or interacting with your application directly from the command line.

Press Ctrl+C to stop tailing, then delete the pod:

kubectl delete pod storage-test-pod -n storage-test

Now create a new pod using the same PVC to verify data persistence:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: storage-test-pod-2
  namespace: storage-test
spec:
  containers:
  - name: test-container
    image: busybox
    command: ["/bin/sh"]
    args: ["-c", "echo 'Previous data:'; cat /data/timestamps.txt; echo 'Adding new entry'; echo \$(date) >> /data/timestamps.txt; sleep 3600"]
    volumeMounts:
    - name: test-volume
      mountPath: /data
  volumes:
  - name: test-volume
    persistentVolumeClaim:
      claimName: test-pvc
EOF

Check that the previous data persisted:

kubectl logs -n storage-test storage-test-pod-2

Perfect! You should see the timestamps from the previous pod, confirming that data persisted even after the original pod was deleted.

Deploying a Stateful Application

Now let’s deploy a real stateful application, PostgreSQL, to demonstrate practical storage usage. Start by creating a database namespace:

kubectl create namespace database

Now, deeploy PostgreSQL with persistent storage:

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: database
spec:
  serviceName: postgres
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:15
        env:
        - name: POSTGRES_DB
          value: testdb
        - name: POSTGRES_USER
          value: testuser
        - name: POSTGRES_PASSWORD
          value: testpass123
        ports:
        - containerPort: 5432
          name: postgres
        volumeMounts:
        - name: postgres-storage
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: postgres-storage
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 2Gi
---
apiVersion: v1
kind: Service
metadata:
  name: postgres-service
  namespace: database
spec:
  selector:
    app: postgres
  ports:
    - port: 5432
      targetPort: 5432
  type: ClusterIP
EOF

Wait for PostgreSQL to be ready and then check the StatefulSet and PVC:

# Wait for PostgreSQL to be ready
kubectl wait --for=condition=ready pod -l app=postgres -n database --timeout=300s

# Check the StatefulSet and PVC
kubectl get statefulset -n database
kubectl get pvc -n database
kubectl get pv

Let’s test the database functionality by connecting to PostgreSQL and creating test data:

# Connect to PostgreSQL and create test data
kubectl exec -it postgres-0 -n database -- psql -U testuser -d testdb -c "
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO users (username) VALUES 
    ('alice'),
    ('bob'),
    ('charlie');

SELECT * FROM users;
"

This successful test confirms that your Kubernetes cluster can reliably provide persistent storage for stateful workloads. With dynamic provisioning in place, applications like databases and file servers can safely retain data even as pods are restarted, rescheduled, or replaced. This capability is essential for running production-grade services and lays the groundwork for deploying more advanced stateful applications in your cluster.

Cleanup

Before moving on to objective 2, let’s clean up the resources created in Objective 1 (Persistent Storage), including the PostgreSQL database, test pods, PVCs, and namespaces. Run the following commands:

# Delete the PostgreSQL StatefulSet and its service
kubectl delete statefulset postgres -n database
kubectl delete service postgres-service -n database

# Delete the database namespace (removes all resources in it)
kubectl delete namespace database

# Delete the test pods and PVC in storage-test namespace
kubectl delete pod storage-test-pod -n storage-test --ignore-not-found
kubectl delete pod storage-test-pod-2 -n storage-test --ignore-not-found
kubectl delete pvc test-pvc -n storage-test

# Delete the storage-test namespace (removes all resources in it)
kubectl delete namespace storage-test

Objective 2: External Access with NGINX Ingress Controller

Currently, our applications are only accessible from within the cluster. In production environments, you need to expose services to external users. While LoadBalancer and NodePort services provide basic external access, Ingress controllers offer advanced features like SSL termination, path-based routing, and virtual hosting.

NGINX Ingress Controller is the most popular choice for Kubernetes ingress, providing production-ready features and excellent performance.

Understanding Ingress Concepts

Before deploying the ingress controller, let’s understand the key concepts:

Ingress Controller: The component that implements ingress rules (NGINX, Traefik, etc.)
Ingress Resource: Kubernetes object that defines routing rules
Service: Kubernetes service that the ingress routes traffic to
TLS Termination: Handling HTTPS/SSL at the ingress level

Installing NGINX Ingress Controller

We’ll use Helm to install the NGINX Ingress Controller:

# Install NGINX Ingress Controller
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.service.type=NodePort \
  --set controller.service.nodePorts.http=30080 \
  --set controller.service.nodePorts.https=30443 \
  --set controller.config.use-service-upstream="true"

# Wait for the controller to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/component=controller -n ingress-nginx --timeout=300s

# Check the ingress controller status
kubectl get pods -n ingress-nginx
kubectl get svc -n ingress-nginx

We’re using NodePort instead of LoadBalancer since we don’t have a cloud load balancer in our homelab. The ingress will be accessible on ports 30080 (HTTP) and 30443 (HTTPS) on any node.

Deploying Test Applications

Let’s deploy a couple of web applications to demonstrate ingress functionality. Start by creating a web-apps namespace:

kubectl create namespace web-apps

Now, deploy the google ‘hello app’ sample:

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-app
  namespace: web-apps
spec:
  replicas: 2
  selector:
    matchLabels:
      app: hello-app
  template:
    metadata:
      labels:
        app: hello-app
    spec:
      containers:
      - name: hello-app
        image: gcr.io/google-samples/hello-app:1.0
        ports:
        - containerPort: 8080
          name: postgres
        volumeMounts:
        - name: hello-app-storage
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: hello-app-storage
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
  name: hello-service
  namespace: web-apps
spec:
  selector:
    app: hello-app
  ports:
    - port: 80
      targetPort: 8080
  type: ClusterIP
EOF

… as well as a second application (Echo Server):

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo-app
  namespace: web-apps
spec:
  replicas: 2
  selector:
    matchLabels:
      app: echo-app
  template:
    metadata:
      labels:
        app: echo-app
    spec:
      containers:
      - name: echo-app
        image: ealen/echo-server:latest
        ports:
        - containerPort: 80
        env:
        - name: PORT
          value: "80"
---
apiVersion: v1
kind: Service
metadata:
  name: echo-service
  namespace: web-apps
spec:
  selector:
    app: echo-app
  ports:
    - port: 80
      targetPort: 80
  type: ClusterIP
EOF

Wait for both deployments to be ready…

kubectl wait --for=condition=available deployment/hello-app -n web-apps --timeout=300s
kubectl wait --for=condition=available deployment/echo-app -n web-apps --timeout=300s

Creating Ingress Resources

Now let’s create ingress resources to expose our applications externally.

Remember: ingress resources are kubernetes objects that define routing rules.

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-apps-ingress
  namespace: web-apps
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  ingressClassName: nginx
  rules:
  - http:
      paths:
      - path: /hello
        pathType: Prefix
        backend:
          service:
            name: hello-service
            port:
              number: 80
      - path: /echo
        pathType: Prefix
        backend:
          service:
            name: echo-service
            port:
              number: 80
EOF

Check the ingress status:

# Check the ingress status
kubectl get ingress -n web-apps
kubectl describe ingress web-apps-ingress -n web-apps

You should see output similiar to:

Name:             web-apps-ingress
Labels:           <none>
Namespace:        web-apps
Address:          
Ingress Class:    nginx
Default backend:  <default>
Rules:
  Host        Path  Backends
  ----        ----  --------
  *           
              /hello   hello-service:80 (10.244.1.9:8080,10.244.3.11:8080)
              /echo    echo-service:80 (10.244.1.10:80,10.244.3.12:80)
Annotations:  nginx.ingress.kubernetes.io/rewrite-target: /
Events:
  Type    Reason  Age   From                      Message
  ----    ------  ----  ----                      -------
  Normal  Sync    9s    nginx-ingress-controller  Scheduled for sync

Testing External Access

Let’s test that our applications are accessible through the ingress. First, get the node IP:

NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
echo "Testing ingress access on $NODE_IP:30080"

Test the hello application:

curl http://$NODE_IP:30080/hello

Test the echo application:

curl http://$NODE_IP:30080/echo | jq

You should see responses from both applications, confirming that ingress-based routing is working correctly.

Setting Up Host-Based Routing

For a more realistic setup, let’s configure host-based routing. First, we need to set up local DNS resolution:

# Get the node IP:
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')

# Add entries to /etc/hosts on your host machine (not in the VM)
echo "You'll need to add these entries to your host machine's /etc/hosts file:"
echo "$NODE_IP hello.local"
echo "$NODE_IP echo.local"
echo "$NODE_IP dashboard.local"

On your host machine (not the VMs), add these entries to /etc/hosts:

# Add these lines to /etc/hosts (replace with your actual node IP)
192.168.122.37 hello.local
192.168.122.37 echo.local
192.168.122.37 dashboard.local

Now, back on master-1, create host-based ingress rules:

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: host-based-ingress
  namespace: web-apps
spec:
  ingressClassName: nginx
  rules:
  - host: hello.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: hello-service
            port:
              number: 80
  - host: echo.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: echo-service
            port:
              number: 80
EOF

This YAML manifest defines a Kubernetes Ingress resource named host-based-ingress in the web-apps namespace. It uses the NGINX ingress controller (ingressClassName: nginx) to route external HTTP traffic to different backend services based on the requested host name:

Requests to hello.local are forwarded to the hello-service on port 80.
Requests to echo.local are forwarded to the echo-service on port 80.

Each rule matches all paths (/) using the Prefix path type. This setup allows you to expose multiple services under different hostnames using a single ingress controller.

Test the host-based routing from your host machine’s browser:

Cleanup

Before moving on to objective 3, let’s clean up the resources created in this section, leaving the nginx ingress controller intact. Run the following commands on master-1:

# Delete the hello-app deployment in the web-apps namespace
kubectl delete deployment hello-app -n web-apps

# Delete the hello-service service in the web-apps namespace
kubectl delete service hello-service -n web-apps

# Delete the echo-app deployment in the web-apps namespace
kubectl delete deployment echo-app -n web-apps

# Delete the echo-service service in the web-apps namespace
kubectl delete service echo-service -n web-apps

# Delete the path-based ingress resource in the web-apps namespace
kubectl delete ingress web-apps-ingress -n web-apps

# Delete the host-based ingress resource in the web-apps namespace
kubectl delete ingress host-based-ingress -n web-apps

# Delete the entire web-apps namespace and all its resources
kubectl delete namespace web-apps

Now, remove the entries that we added to /etc/hosts on your host machine:

# Remove any line with 'hello\.local' in it from /etc/hosts
sudo sed -i '/hello\.local/d' /etc/hosts

# Remove any line with 'dashboard\.local' in it from /etc/hosts
sudo sed -i '/dashboard\.local/d' /etc/hosts

# Remove any line with 'echo\.local' in it from /etc/hosts
sudo sed -i '/echo\.local/d' /etc/hosts

Objective 3: Monitoring with Prometheus, Loki, and Grafana

Monitoring is essential for maintaining healthy Kubernetes clusters. Prometheus has become the standard for Kubernetes monitoring, providing powerful metrics collection, alerting, and querying capabilities. Combined with Loki’s log aggregation and Grafana’s visualization features, we’ll have complete observability into our cluster.

Understanding the Monitoring Stack

Our monitoring stack will include:

Prometheus: Metrics collection and storage
Grafana: Metrics visualization and dashboards
Loki: Log aggregation and querying
Promtail: Log collection agent for Loki
Node Exporter: Hardware and OS metrics from each node
kube-state-metrics: Kubernetes object metrics
Alertmanager: Alert routing and notification

Installing the Prometheus Stack

We’ll use the Prometheus Community Helm chart, which includes everything we need. First, create a namespace monitoring:

kubectl create namespace monitoring

Next, install the Prometheus stack, including Prometheus, Grafana, Alertmanager, and exporters:

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.service.type=NodePort \
  --set prometheus.service.nodePort=30090 \
  --set grafana.service.type=NodePort \
  --set grafana.service.nodePort=30030 \
  --set alertmanager.service.type=NodePort \
  --set alertmanager.service.nodePort=30093

Installing Loki for Log Aggregation

Now let’s add Loki to our monitoring stack for centralized log management. We’ll install Loki using its Helm chart:

# Add Grafana Helm repository for Loki
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Loki with Promtail
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set loki.service.type=NodePort \
  --set loki.service.nodePort=30031 \
  --set promtail.enabled=true \
  --set grafana.enabled=false

Wait for all components to be ready (this may take several minutes):

kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=prometheus -n monitoring --timeout=600s
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=grafana -n monitoring --timeout=600s
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=loki -n monitoring --timeout=600s

Now check the status of monitoring components:

kubectl get pods -n monitoring
kubectl get svc -n monitoring

Accessing Monitoring Interfaces

Let’s set up access to our monitoring tools. First, get the IP address of the node:

NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')

and get the url’s for monitoring tools:

echo "Monitoring URLs:"
echo "Prometheus: http://$NODE_IP:30090"
echo "Grafana: http://$NODE_IP:30030"
echo "Alertmanager: http://$NODE_IP:30093"
echo "Loki: http://$NODE_IP:30031"

In order to access Grafana, you will need to get the admin password:

kubectl --namespace monitoring get secrets prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo

… and Viola! we have a beautiful dashboard with rich metrics for our Kubernetes cluster:

Creating Ingress for Monitoring Tools

Let’s make our monitoring tools accessible through ingress with proper hostnames:

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: monitoring-ingress
  namespace: monitoring
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  ingressClassName: nginx
  rules:
  - host: grafana.homelab.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: prometheus-grafana
            port:
              number: 80
  - host: prometheus.homelab.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: prometheus-kube-prometheus-prometheus
            port:
              number: 9090
  - host: alertmanager.homelab.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: prometheus-kube-prometheus-alertmanager
            port:
              number: 9093
  - host: loki.homelab.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: loki
            port:
              number: 3100
EOF

Add these to your host machine’s /etc/hosts file, replacing the ip address with your node’s IP address:

192.168.122.37 grafana.local
192.168.122.37 prometheus.local
192.168.122.37 alertmanager.local
192.168.122.37 loki.local

Configuring Loki Data Source in Grafana

Now we need to configure Loki as a data source in Grafana for log visualization:

# Get Loki service URL for internal cluster communication
LOKI_URL="http://loki.monitoring.svc.cluster.local:3100"

# Create a configmap to add Loki as a datasource
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasource-loki
  namespace: monitoring
  labels:
    grafana_datasource: "1"
data:
  loki.yaml: |
    apiVersion: 1
    datasources:
    - name: Loki
      type: loki
      access: proxy
      url: http://loki.monitoring.svc.cluster.local:3100
      isDefault: false
      editable: true
EOF

# Restart Grafana to pick up the new datasource
kubectl rollout restart deployment prometheus-grafana -n monitoring
kubectl wait --for=condition=available deployment/prometheus-grafana -n monitoring --timeout=300s

Exploring Prometheus Metrics

Access Prometheus at http://prometheus.local:30080 and explore some basic queries:

Basic PromQL Queries to Try:

Node CPU Usage:

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory Usage Percentage:

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Pod Count by Namespace:
```
count by (namespace) (kube_pod_info)
```

Cluster CPU Requests vs Limits:

sum(kube_pod_container_resource_requests{resource="cpu"})

Exploring Loki Logs

Access Loki directly at http://loki.local:30080 to explore log queries, or use Grafana’s Explore feature:

Basic LogQL Queries to Try:

All logs from a specific namespace:
```
{namespace="kube-system"}
```
Logs containing error messages:
```
{} |= "error" | line_format "{{.}}"
```
Logs from specific pods:
```
{pod=~"prometheus.*"}
```

Configuring Grafana Dashboards

Access Grafana at http://grafana.local:30080 (username: admin, password from the previous command) and import some pre-built dashboards:

Import Kubernetes Cluster Overview Dashboard:
- Go to “Dashboards”
- Click ‘New’ → ‘Import’
- Use dashboard ID: 7249
- Click “Load”
- Select prometheus data source
- Click “Import”
Import Node Exporter Dashboard:
- Go to “Dashboards”
- Click ‘New’ → ‘Import’
- Use dashboard ID: 1860
- Select prometheus data source
- Click “Import”
Import Kubernetes Pod Overview:
- Go to “Dashboards”
- Click ‘New’ → ‘Import’
- Use dashboard ID: 6417
- Select prometheus data source
- Click “Import”
Import Loki Logs Dashboard:
- Go to “Dashboards”
- Click ‘New’ → ‘Import’
- Use dashboard ID: 13407
- Click “Load”
- Select Loki data source
- Click “Import”

Creating Custom Alerts

Let’s create a custom alert for high CPU usage. On master-1, create a Prometheus rule:

cat <<EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: custom-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
  - name: custom.rules
    rules:
    - alert: HighCPUUsage
      expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage detected"
        description: "CPU usage is above 80% for more than 2 minutes on {{ \$labels.instance }}"
    
    - alert: HighMemoryUsage
      expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High memory usage detected"
        description: "Memory usage is above 85% for more than 5 minutes on {{ \$labels.instance }}"
EOF

Check that the rule was created. You should see custom alerts listed in the results of the following command:

kubectl get prometheusrule -n monitoring

Generating Test Load

Let’s create some load to test our monitoring and alerts. First, recreate the web-apps namespace since we cleaned it up earlier:

kubectl create namespace web-apps

Now deploy a CPU stress test:

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cpu-stress
  namespace: web-apps
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cpu-stress
  template:
    metadata:
      labels:
        app: cpu-stress
    spec:
      containers:
      - name: cpu-stress
        image: containerstack/cpustress
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 1000m
            memory: 256Mi
        command: ["/bin/sh"]
        args: ["-c", "stress --cpu 2 --timeout 600s"]
EOF

Now, check your monitoring dashboards to see the increased CPU usage and corresponding log entries in Loki! The stress test will run for 10 minutes, giving you time to explore both metrics in Prometheus/Grafana and logs in Loki.

Objective 4: Security with RBAC and Network Policies

Security is paramount in any Kubernetes environment. In this section, we’ll implement Role-Based Access Control (RBAC) and Network Policies to secure our cluster against unauthorized access and limit network traffic between components.

Understanding Kubernetes Security

Key security concepts we’ll implement:

RBAC (Role-Based Access Control): Controls who can access what resources
Network Policies: Controls network traffic between pods
Pod Security Standards: Controls what pods can do
Service Accounts: Provides identity for pods and services

Setting Up RBAC

Let’s create different user roles for our cluster. Start by creating a namespace for our RBAC examples:

kubectl create namespace rbac-demo

Next, create service accounts for developers and viewers. A service account in Kubernetes is a special type of account used by processes running in pods to interact securely with the Kubernetes API:

# Create a service account for developers
kubectl create serviceaccount developer -n rbac-demo

# Create a service account for viewers
kubectl create serviceaccount viewer -n rbac-demo

Now, create a Role that allows full access to pods, services, and deployments in rbac-demo namespace:

cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: rbac-demo
  name: developer-role
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps", "secrets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
EOF

Finally, create a Role that only allows read access:

cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: rbac-demo
  name: viewer-role
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch"]
EOF

Now that we have created roles, we still need to bind them to the service accounts we created:

cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: developer-binding
  namespace: rbac-demo
subjects:
- kind: ServiceAccount
  name: developer
  namespace: rbac-demo
roleRef:
  kind: Role
  name: developer-role
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: viewer-binding
  namespace: rbac-demo
subjects:
- kind: ServiceAccount
  name: viewer
  namespace: rbac-demo
roleRef:
  kind: Role
  name: viewer-role
  apiGroup: rbac.authorization.k8s.io
EOF

Testing RBAC Permissions

Let’s test our RBAC setup by creating pods that use different service accounts.

First, we create a pod using the developer service account:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: developer-pod
  namespace: rbac-demo
spec:
  serviceAccountName: developer
  containers:
  - name: kubectl
    image: bitnami/kubectl:latest
    command: ["/bin/sleep", "3600"]
EOF

Next, create a pod using the viewer service account:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: viewer-pod
  namespace: rbac-demo
spec:
  serviceAccountName: viewer
  containers:
  - name: kubectl
    image: bitnami/kubectl:latest
    command: ["/bin/sleep", "3600"]
EOF

Wait for pods to be ready…

kubectl wait --for=condition=ready pod developer-pod -n rbac-demo --timeout=300s
kubectl wait --for=condition=ready pod viewer-pod -n rbac-demo --timeout=300s

Now let’s test our RBAC permissions, starting with the developer first. If we have configured our policies correctly, this should work:

kubectl exec -n rbac-demo developer-pod -- kubectl get pods -n rbac-demo
kubectl exec -n rbac-demo developer-pod -- kubectl create deployment test-app --image=nginx -n rbac-demo

Moving on, let’s test the viewer role. You should get an error message:

kubectl exec -n rbac-demo viewer-pod -- kubectl get pods -n rbac-demo
kubectl exec -n rbac-demo viewer-pod -- kubectl create deployment viewer-test --image=nginx -n rbac-demo

Implementing Network Policies

Network policies control traffic flow between pods. By default, Kubernetes allows all traffic, but we can restrict this for better security. Since we installed Calico as our CNI in Part 1, we have full NetworkPolicy support, which enables us to implement fine-grained traffic controls between pods.

Let’s start by creating a namespace for network policy testing:

kubectl create namespace network-test

Next, let’s deploy a frontend and backend application:

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend
  namespace: network-test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: frontend
      role: frontend
  template:
    metadata:
      labels:
        app: frontend
        role: frontend
    spec:
      containers:
      - name: frontend
        image: nginx
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: frontend-service
  namespace: network-test
spec:
  selector:
    app: frontend
  ports:
    - port: 80
      targetPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend
  namespace: network-test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: backend
      role: backend
  template:
    metadata:
      labels:
        app: backend
        role: backend
    spec:
      containers:
      - name: backend
        image: nginx
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: backend-service
  namespace: network-test
spec:
  selector:
    app: backend
  ports:
    - port: 80
      targetPort: 80
EOF

Wait for deployments…

kubectl wait --for=condition=available deployment/frontend -n network-test --timeout=300s
kubectl wait --for=condition=available deployment/backend -n network-test --timeout=300s

Testing Default (Open) Network Access

Before implementing network policies, let’s verify that all pods can communicate freely (default Kubernetes behavior):

First, create a test pod to simulate external access:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: network-test-pod
  namespace: network-test
  labels:
    app: test
spec:
  containers:
  - name: netshoot
    image: nicolaka/netshoot
    command: ["/bin/sleep", "3600"]
EOF

kubectl wait --for=condition=ready pod network-test-pod -n network-test --timeout=300s

Test 1: External pod can reach backend (should work with default policy)

kubectl exec -n network-test network-test-pod -- timeout 5 curl -s backend-service || echo "Connection failed"

Test 2: External pod can reach frontend (should work with default policy)

kubectl exec -n network-test network-test-pod -- timeout 5 curl -s frontend-service || echo "Connection failed"

Test 3: Frontend can reach backend (should work with default policy)

kubectl exec -n network-test $FRONTEND_POD -- timeout 5 curl -s backend-service && \
  echo "✓ ALLOWED as expected" || echo "Unexpected failure!"

Adding Network Policies with Calico

Now let’s create network policies to control traffic. Calico’s NetworkPolicy implementation allows us to implement a zero-trust architecture using an implicit-deny strategy. This means that by default, all traffic is blocked unless specifically allowed.

Note: Calico provides excellent NetworkPolicy support, including both Kubernetes NetworkPolicy and Calico’s own enhanced GlobalNetworkPolicy resources. For this tutorial, we’ll use standard Kubernetes NetworkPolicy resources for compatibility.

To accomplish this, we must first define a default policy that denies all traffic:

# Create a default deny-all network policy
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: network-test
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
EOF

Testing Network Isolation (Deny-All Policy)

Let’s verify that our deny-all policy is working:

Test 1: External pod to backend (should FAIL)

kubectl exec -n network-test network-test-pod -- timeout 5 curl -s backend-service && \
  echo "Unexpected success!" || echo "✓ BLOCKED as expected"

Test 2: External pod to frontend (should FAIL)

kubectl exec -n network-test network-test-pod -- timeout 5 curl -s frontend-service && \
  echo "Unexpected success!" || echo "✓ BLOCKED as expected"

Test 3: Frontend to backend (should FAIL)

kubectl exec -n network-test $FRONTEND_POD -- timeout 5 curl -s backend-service && \
  echo "Unexpected success!" || echo "✓ BLOCKED as expected"

Allowing Routes

Now, that we have confirmed that our Deny-All policy is working, we begin the process of explicitly adding acceptable network routes.

Let’s start with allowing DNS:

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-any
  namespace: network-test
spec:
  podSelector: {}
  policyTypes: [Egress]
  egress:
    - ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
EOF

Next, let’s allow frontend pods to communicate with backend pods (ingress to backend):

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
  namespace: network-test
spec:
  podSelector:
    matchLabels:
      role: backend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              role: frontend
      ports:
        - protocol: TCP
          port: 80
EOF

Now we allow frontend pods to make outbound connections (egress from frontend)

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-egress-backend
  namespace: network-test
spec:
  podSelector:
    matchLabels:
      role: frontend
  policyTypes:
    - Egress
  egress:
    - to:
        - podSelector:
            matchLabels:
              role: backend
      ports:
        - protocol: TCP
          port: 80
EOF

We also need to allow external access to frontend (ingress to frontend)

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-any-to-frontend
  namespace: network-test
spec:
  podSelector:
    matchLabels:
      role: frontend
  policyTypes:
    - Ingress
  ingress:
    - {}   
EOF

Finally, in order to test External to Frontend traffic, we need to allow test pod to make outbound connections:

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-test-egress-any
  namespace: network-test
spec:
  podSelector:
    matchLabels:
      app: test
  policyTypes:
    - Egress
  egress:
    - {}                        # allow all egress   
EOF

Testing Frontend-to-Backend Communication

Let’s test communication between the frontend and the backend:

Test 1: External pod to backend (should still FAIL)

kubectl exec -n network-test network-test-pod -- timeout 5 curl -s backend-service && \
  echo "Unexpected success!" || echo "✓ BLOCKED as expected"

Test 2: Frontend to backend (should now SUCCEED)

kubectl exec -n network-test $FRONTEND_POD -- timeout 5 curl -s backend-service && \
  echo "✓ ALLOWED as expected" || echo "Unexpected failure!"

Testing External Access to Frontend

Test 1: External pod to frontend (should now SUCCEED)

kubectl exec -n network-test network-test-pod -- timeout 5 curl -s frontend-service && \
  echo "✓ ALLOWED as expected" || echo "Unexpected failure!"

Test 2: External pod to backend (should still FAIL)

kubectl exec -n network-test network-test-pod -- timeout 5 curl -s backend-service && \
  echo "Unexpected success!" || echo "✓ BLOCKED as expected"

Verifying Calico NetworkPolicy Implementation

Let’s also verify that Calico is properly enforcing our network policies:

# Check that Calico is running and ready
echo "=== Calico Status ==="
kubectl get pods -n kube-system -l k8s-app=calico-node

# View the applied network policies
echo "=== Applied Network Policies ==="
kubectl get networkpolicies -n network-test -o wide

# Show network policy details
echo "=== Network Policy Details ==="
for policy in $(kubectl get networkpolicies -n network-test -o jsonpath='{.items[*].metadata.name}'); do
    echo "--- Policy: $policy ---"
    kubectl describe networkpolicy $policy -n network-test | grep -A 10 -B 2 "PodSelector\|Allowing\|Policy Types"
    echo
done

Pod Security Standards

Pod Security Standards in Kubernetes are a set of built-in policies that define different levels of security controls for pods running in a cluster. These standards, Privileged, Baseline, and Restricted, help administrators enforce best practices by limiting what pods can do, such as restricting privilege escalation, enforcing non-root containers, and controlling access to host resources.

By applying these standards, you can reduce the risk of security vulnerabilities and ensure workloads adhere to organizational or compliance requirements.

Start by creating a namespace secure-apps:

kubectl create namespace secure-apps

Next, we need to label the namespace to enforce restricted security.

kubectl label namespace secure-apps pod-security.kubernetes.io/enforce=restricted
kubectl label namespace secure-apps pod-security.kubernetes.io/audit=restricted
kubectl label namespace secure-apps pod-security.kubernetes.io/warn=restricted

These commands label the secure-apps namespace to apply Kubernetes Pod Security Standards:

enforce=restricted: Blocks pods that don’t meet the strictest security requirements.
audit=restricted: Logs violations of the restricted policy for auditing purposes.
warn=restricted: Issues warnings when a pod would violate the restricted policy.

This setup helps ensure only secure pods are allowed, while also providing visibility into potential security issues.

Now, try to create a privileged pod (this should fail).

cat <<EOF | kubectl apply -f - || echo "Expected: Pod security policy violation"
apiVersion: v1
kind: Pod
metadata:
  name: privileged-pod
  namespace: secure-apps
spec:
  containers:
  - name: privileged
    image: nginx
    securityContext:
      privileged: true
EOF

Now try to create a compliant secure pod:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
  namespace: secure-apps
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: secure-app
    image: nginx:latest
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: false
    ports:
    - containerPort: 8080
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"
EOF

kubectl wait --for=condition=ready pod secure-pod -n secure-apps --timeout=300s
echo "Secure pod created successfully!"

Objective 5: Maintenance Operations

In this final section, we’ll cover essential maintenance operations that every Kubernetes administrator needs to know: backup strategies, cluster upgrades, and troubleshooting techniques.

Backup Strategies

Regular backups are crucial for disaster recovery. We’ll cover backing up etcd (the cluster’s database) and persistent volume data. A comprehensive backup strategy should include both the cluster state (etcd) and application data (persistent volumes).

Setting Up etcd Backup

etcd stores all Kubernetes cluster state, making it the most critical component to backup. This includes all your deployments, services, secrets, and configuration data.

Install etcd-client on master-1:

sudo apt-get update
sudo apt-get install -y etcd-client

Then create a backup script: etcd-backup.sh

cat <<EOF > ~/etcd-backup.sh
#!/bin/bash
set -e

BACKUP_DIR="/var/backups/etcd"
BACKUP_FILE="etcd-backup-\$(date +%Y%m%d-%H%M%S).db"

# Create backup directory
sudo mkdir -p \$BACKUP_DIR

# Create etcd backup
sudo ETCDCTL_API=3 etcdctl snapshot save \$BACKUP_DIR/\$BACKUP_FILE \\
  --endpoints=https://127.0.0.1:2379 \\
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \\
  --cert=/etc/kubernetes/pki/etcd/server.crt \\
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify backup
sudo ETCDCTL_API=3 etcdctl snapshot status \$BACKUP_DIR/\$BACKUP_FILE

echo "Backup created: \$BACKUP_DIR/\$BACKUP_FILE"

# Clean up old backups (keep last 7 days)
sudo find \$BACKUP_DIR -name "etcd-backup-*.db" -mtime +7 -delete

EOF

The script automates the process of backing up the etcd database used by Kubernetes:

It creates a timestamped backup file in the /var/backups/etcd directory, using secure credentials to connect to the etcd server.
After saving the backup, it verifies the backup file’s status to ensure integrity.
Finally, it cleans up old backup files, keeping only those from the last 7 days to manage disk space efficiently. This helps maintain regular, secure, and manageable etcd backups for disaster recovery.

Make the script executable and run it:

chmod +x ~/etcd-backup.sh

# Run the backup script
sudo ~/etcd-backup.sh

You should see output similiar to:

Snapshot saved at /var/backups/etcd/etcd-backup-20250716-192759.db
3452ebb4, 67954, 1484, 20 MB
Backup created: /var/backups/etcd/etcd-backup-20250716-192759.db

Testing etcd Backup and Restore

Let’s test our etcd backup by simulating a cluster disaster and restoring from backup. Warning: This test will temporarily disrupt your cluster, so ensure you have a current backup first.

# First, create a fresh etcd backup
echo "=== Creating Fresh etcd Backup ==="
sudo ~/etcd-backup.sh

# Create some test resources to verify restore
echo "=== Creating Test Resources ==="
kubectl create namespace backup-test
kubectl create deployment test-app --image=nginx --replicas=2 -n backup-test
kubectl create service clusterip test-service --tcp=80:80 -n backup-test
kubectl create configmap test-config --from-literal=message="Hello from backup test" -n backup-test

# Verify test resources exist
echo "=== Verifying Test Resources ==="
kubectl get all -n backup-test
kubectl get configmap test-config -n backup-test -o yaml

# Get the latest backup file
LATEST_BACKUP=$(sudo ls -1t /var/backups/etcd/etcd-backup-*.db | head -1)
echo "Using backup file: $LATEST_BACKUP"

# Stop etcd to simulate disaster
echo "=== Simulating etcd Disaster ==="
echo "Stopping etcd..."
sudo systemctl stop etcd

# Move current etcd data (simulate corruption/loss)
sudo mv /var/lib/etcd /var/lib/etcd.backup.$(date +%s)

# Restore from backup
echo "=== Restoring from etcd Backup ==="
sudo ETCDCTL_API=3 etcdctl snapshot restore $LATEST_BACKUP \
  --data-dir /var/lib/etcd \
  --initial-cluster=master-1=https://$(hostname -I | awk '{print $1}'):2380 \
  --initial-cluster-token=etcd-cluster-1 \
  --initial-advertise-peer-urls=https://$(hostname -I | awk '{print $1}'):2380 \
  --name=master-1

# Fix ownership of restored data
sudo chown -R etcd:etcd /var/lib/etcd

# Restart etcd
echo "=== Restarting etcd ==="
sudo systemctl start etcd

# Wait for etcd to be healthy
echo "Waiting for etcd to become healthy..."
sleep 10

# Check etcd health
sudo ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Restart kubelet to reconnect to etcd
sudo systemctl restart kubelet

# Wait for cluster to be responsive
echo "Waiting for cluster to become responsive..."
sleep 30

# Verify cluster is working
echo "=== Verifying Cluster Restoration ==="
kubectl cluster-info
kubectl get nodes

# Verify our test resources were restored
echo "=== Verifying Test Resources Restored ==="
kubectl get all -n backup-test
kubectl get configmap test-config -n backup-test -o jsonpath='{.data.message}'
echo

# Cleanup test resources
echo "=== Cleaning Up Test Resources ==="
kubectl delete namespace backup-test

echo "=== etcd Backup and Restore Test Completed Successfully! ==="

Alternative Safe Test Method

If you prefer a less disruptive test, you can verify backup integrity without actually restoring:

# Create a test backup
echo "=== Creating Test Backup ==="
sudo ~/etcd-backup.sh

# Get the latest backup
LATEST_BACKUP=$(sudo ls -1t /var/backups/etcd/etcd-backup-*.db | head -1)
echo "Testing backup: $LATEST_BACKUP"

# Verify backup integrity
echo "=== Verifying Backup Integrity ==="
sudo ETCDCTL_API=3 etcdctl snapshot status $LATEST_BACKUP --write-out=table

# Test restore to a temporary location (without actually using it)
echo "=== Testing Restore Process ==="
sudo ETCDCTL_API=3 etcdctl snapshot restore $LATEST_BACKUP \
  --data-dir /tmp/etcd-restore-test \
  --initial-cluster=master-1=https://127.0.0.1:2380 \
  --initial-cluster-token=test-token \
  --initial-advertise-peer-urls=https://127.0.0.1:2380 \
  --name=master-1

# Check if restore created the directory structure
echo "=== Verifying Restore Structure ==="
sudo ls -la /tmp/etcd-restore-test/

# Cleanup test restore
sudo rm -rf /tmp/etcd-restore-test

echo "=== Backup Verification Completed Successfully! ==="

Note: The first test method actually restores etcd and will temporarily disrupt your cluster (about 1-2 minutes). The second method only verifies backup integrity without disruption. Both methods confirm that your backup strategy is working correctly.

Setting Up Automated Backups

Now let’s set up automated backups using cron:

# Set up cron job for daily backups at 2 AM
echo "0 2 * * * /home/ubuntu/etcd-backup >> /var/log/kubernetes-backup.log 2>&1" | sudo crontab -
# Verify crontab job
sudo crontab -l

Cluster Upgrades

Upgrading Kubernetes clusters requires careful planning and execution. We’ll demonstrate upgrading from the current version to a newer patch version.

Checking Current Versions

First, check what version of kubeadm, kubectl, and kubelet you have installed:

# Check current versions
kubectl version 
kubeadm version
kubelet --version

Our version is: 1.31.9-1.1 currently. Now check for available upgrades:

# Check available upgrades
sudo apt update
sudo apt-cache madison kubeadm | head -5

You should see output similiar to:

ubuntu@master-1:~$ sudo apt-cache madison kubeadm | head -5
   kubeadm | 1.31.11-1.1 | https://pkgs.k8s.io/core:/stable:/v1.31/deb  Packages
   kubeadm | 1.31.10-1.1 | https://pkgs.k8s.io/core:/stable:/v1.31/deb  Packages
   kubeadm | 1.31.9-1.1 | https://pkgs.k8s.io/core:/stable:/v1.31/deb  Packages
   kubeadm | 1.31.8-1.1 | https://pkgs.k8s.io/core:/stable:/v1.31/deb  Packages
   kubeadm | 1.31.7-1.1 | https://pkgs.k8s.io/core:/stable:/v1.31/deb  Packages

You will need to find the version to upgrade to (and for the commands below) from this list.

In our case, it is 3.31.1-1.1

Upgrading the Control Plane

First, upgrade kubeadm on the control plane:

sudo apt-mark unhold kubeadm
sudo apt update
sudo apt install -y kubeadm=1.31.11-1.1
sudo apt-mark hold kubeadm

# Check the upgrade plan
sudo kubeadm upgrade plan

# Apply the upgrade (replace with actual available version)
sudo kubeadm upgrade apply v1.31.11 --yes

# Drain the control plane node
kubectl drain master-1 --ignore-daemonsets --delete-emptydir-data

# Upgrade kubelet and kubectl
sudo apt-mark unhold kubelet kubectl
sudo apt update
sudo apt install -y kubelet=1.31.11-1.1 kubectl=1.31.11-1.1
sudo apt-mark hold kubelet kubectl

# Restart kubelet
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# Uncordon the node
kubectl uncordon master-1

# Verify the control plane upgrade
kubectl get nodes
kubectl version --short

Upgrading Worker Nodes

For each worker node, perform these steps:

# SSH into worker-1
ssh ubuntu@<worker-1-ip>

# Upgrade kubeadm
sudo apt-mark unhold kubeadm
sudo apt update
sudo apt install -y kubeadm=1.31.11-1.1
sudo apt-mark hold kubeadm

# Upgrade the node configuration
sudo kubeadm upgrade node

# From the control plane, drain the worker node
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data

# Back on worker-1, upgrade kubelet and kubectl
sudo apt-mark unhold kubelet kubectl
sudo apt update
sudo apt install -y kubelet=1.31.11-1.1 kubectl=1.31.11-1.1
sudo apt-mark hold kubelet kubectl

# Restart kubelet
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# From control plane, uncordon the worker
kubectl uncordon worker-1

# Repeat for worker-2

Troubleshooting Techniques

Let’s cover common troubleshooting scenarios and techniques:

Creating a Troubleshooting Toolkit

The troubleshooting-toolkit is a temporary pod based on the nicolaka/netshoot container image. This image is a “Swiss-army knife” for network troubleshooting; it comes pre-loaded with dozens of useful tools (ping, dig, curl, tcpdump, mtr, etc.) that are often missing from standard application containers.

The command: ["/bin/sleep", "3600"] instruction keeps the pod running for an hour for use. The NET_ADMIN capability gives it the necessary permissions to perform advanced network diagnostics.

# Deploy a troubleshooting pod with useful tools
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: troubleshooting-toolkit
  namespace: default
spec:
  containers:
  - name: toolkit
    image: nicolaka/netshoot
    command: ["/bin/sleep", "3600"]
    securityContext:
      capabilities:
        add: ["NET_ADMIN"]
  restartPolicy: Never
EOF

kubectl wait --for=condition=ready pod troubleshooting-toolkit --timeout=300s

To use the toolk kit, simply execute commands inside the pod with kubectl exec. This allows you to diagnose cluster networking issues from the perspective of a pod running within the cluster.

Some examples of common use cases are:

Testing DNS Resolution:

Check if a service name resolves correctly from within the cluster.

kubectl exec -it troubleshooting-toolkit -- dig my-service.my-namespace.svc.cluster.local

Checking Pod-to-Service Connectivity:

Verify if you can reach a service’s ClusterIP.

kubectl exec -it troubleshooting-toolkit -- curl http://my-service.my-namespace.svc.cluster.local
kubectl exec -it troubleshooting-toolkit -- curl http://<service-cluster-ip>:<port>

Pinging Another Pod’s IP:

Test basic network reachability to another pod.

kubectl exec -it troubleshooting-toolkit -- ping <other-pod-ip>

When you’re finished, you can delete the pod with kubectl delete pod troubleshooting-toolkit.

Common Troubleshooting Commands

Below are some common troubleshooting commands that you should learn as you dive into the world of Kubernetes orchestration.

Check cluster health

kubectl get componentstatuses
kubectl cluster-info
kubectl get events --sort-by=.metadata.creationTimestamp

Check node health

kubectl describe nodes
kubectl top nodes  # Requires metrics-server

Check pod issues

kubectl get pods --all-namespaces
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # Previous container logs

Check resource usage

kubectl top pods --all-namespaces
kubectl describe resourcequota --all-namespaces

Network troubleshooting

kubectl exec -it troubleshooting-toolkit -- nslookup kubernetes.default
kubectl exec -it troubleshooting-toolkit -- ping <pod-ip>
kubectl exec -it troubleshooting-toolkit -- netstat -tlnp

Check persistent volumes

kubectl get pv
kubectl get pvc --all-namespaces
kubectl describe pv <volume-name>

Setting Up Log Aggregation

For better troubleshooting, let’s set up centralized logging by deploying a log aggregator into our Kubernetes cluster using a Fluentd DaemonSet.

Start by creating a logging namespace:

# Create a simple log aggregation setup using Fluentd
kubectl create namespace logging

Then, create a ConfigMap for the Loki configuration:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: logging
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type cri
      </parse>
    </source>

    <filter kubernetes.**>
      @type kubernetes_metadata
    </filter>

    <match kubernetes.**>
      @type loki
      url "http://loki.monitoring.svc.cluster.local:3100"
      flush_interval 10s
    </match>
EOF

Note: The Loki service URL is http://loki.monitoring.svc.cluster.local:3100. This uses the full Kubernetes DNS name to reach the loki service in the monitoring namespace from the logging namespace.

Now deploy the daemonset:

# Deploy Fluentd DaemonSet for log collection
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  selector:
    matchLabels:
      name: fluentd
  template:
    metadata:
      labels:
        name: fluentd
    spec:
      # Add a service account for proper permissions
      serviceAccountName: fluentd 
      containers:
      - name: fluentd
        # Use a generic image, not the Elasticsearch-specific one
        image: fluent/fluentd-kubernetes-daemonset:v1-debian-amd64
        volumeMounts:
        # Mount the configuration file from the ConfigMap
        - name: config-volume
          mountPath: /fluentd/etc/fluent.conf
          subPath: fluent.conf
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      # Define the ConfigMap as a volume source
      - name: config-volume
        configMap:
          name: fluentd-config
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
EOF

A DaemonSet is a Kubernetes workload that ensures a copy of a pod runs on every single node in the cluster. In this case, it deploys a Fluentd pod on each node to act as a log collection agent.

The key to how it works is in the volumes and volumeMounts sections:

hostPath: This gives the Fluentd pod direct access to directories on the underlying host node.
Directory Access: It specifically mounts /var/log and /var/lib/docker/containers from the node into the pod. This is where all container logs are written.

By doing this, each Fluentd pod can see and collect the logs from all other pods running on the same node. The environment variables (FLUENT_ELASTICSEARCH_HOST and FLUENT_ELASTICSEARCH_PORT) tell Fluentd where to forward these collected logs, in this case, to the Loki service we setup earlier.

Creating Health Checks

It can be very useful to script a tool that gives you a detailed glimpse at the overall health of your clusters, nodes, pods, etc. The script below captures resource usage, pod statuses, recent events, and more.

Create cluster-health-check.sh:

cat <<EOF > ~/cluster-health-check.sh
#!/bin/bash

echo "=== Kubernetes Cluster Health Check ==="
echo "Date: \$(date)"
echo

echo "=== Cluster Info ==="
kubectl cluster-info
echo

echo "=== Node Status ==="
kubectl get nodes -o wide
echo

echo "=== Pod Status by Namespace ==="
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
echo

echo "=== Recent Events ==="
kubectl get events --sort-by=.metadata.creationTimestamp | tail -10
echo

echo "=== Resource Usage ==="
echo "Node resource usage:"
kubectl top nodes 2>/dev/null || echo "Metrics server not available"
echo

echo "=== Persistent Volume Status ==="
kubectl get pv
echo

echo "=== Critical System Pods ==="
kubectl get pods -n kube-system | grep -E "(etcd|apiserver|controller|scheduler)"
echo

echo "=== Network Policy Status ==="
kubectl get networkpolicies --all-namespaces
echo

echo "Health check completed."
EOF

Now make it executable and test it:

chmod +x ~/cluster-health-check.sh
~/cluster-health-check.sh

Conclusion and Next Steps

Congratulations! You’ve successfully transformed your basic Kubernetes cluster into a well-oiled, production-ready platform. Let’s review what we’ve accomplished:

What We’ve Built

Persistent Storage: Implemented dynamic volume provisioning with Local Path Provisioner, enabling stateful applications like databases
External Access: Deployed NGINX Ingress Controller with both path-based and host-based routing for external application access
Comprehensive Monitoring: Set up Prometheus, Loki, and Grafana stack for complete cluster observability with custom alerts and log aggregation.
Security: Implemented RBAC for access control, Network Policies for traffic isolation, and Pod Security Standards for container security
Maintenance Operations: Established backup procedures, upgrade processes, and troubleshooting toolkits

Current Cluster Capabilities

Your cluster now provides enterprise-grade features:

High Availability: Multi-node setup with monitoring and alerting
Security: Role-based access control and network isolation
Observability: Rich metrics, dashboards, and alerting
External Access: Production-ready ingress with host-based routing
Data Persistence: Reliable storage for stateful applications
Maintainability: Automated backups and systematic upgrade procedures

Best Practices Learned

Throughout this tutorial, you’ve implemented industry best practices:

Infrastructure as Code: All configurations defined in YAML manifests
Security by Default: RBAC, Network Policies, and Pod Security Standards
Observability First: Comprehensive monitoring before issues arise
Backup Strategy: Regular automated backups of critical data
Systematic Upgrades: Planned, tested upgrade procedures

Next Steps for Advanced Learning

Your homelab cluster is now ready for advanced exploration:

Service Mesh: Implement Istio or Linkerd for advanced traffic management
GitOps: Set up ArgoCD for automated application deployment
Advanced Storage: Explore CSI drivers and distributed storage solutions
Multi-Cluster: Set up cluster federation or multi-cluster management
Advanced Security: Implement OPA Gatekeeper for policy as code
CI/CD Integration: Connect Jenkins, GitLab, or GitHub Actions

Community and Resources

Continue your Kubernetes journey with these resources:

Official Documentation: kubernetes.io
Community Forums: Kubernetes Slack, Stack Overflow, Reddit
Training Platforms: Cloud provider training, CNCF courses
Certifications: CKA, CKAD, CKS certifications
Local Meetups: Join Kubernetes and Cloud Native meetups

Final Thoughts

You’ve built more than just a Kubernetes cluster, you’ve created a scalable learning platform that mirrors real-world production environments. The skills and patterns you’ve learned here are directly applicable to any Kubernetes environment, from small startups to large enterprises.

Your homelab cluster serves as both a learning environment and a testing ground for new technologies and approaches. Continue experimenting, breaking things, and rebuilding them - this hands-on experience is invaluable for mastering Kubernetes.

The foundation you’ve established provides a solid base for exploring advanced Kubernetes concepts and integrating with the broader cloud-native ecosystem. Whether you’re preparing for certification, building production systems, or simply satisfying your curiosity about container orchestration, your cluster is ready for the next phase of your journey.

Remember that Kubernetes is a rapidly evolving ecosystem. Stay curious, keep experimenting, and don’t hesitate to tear down and rebuild your cluster as you learn new concepts. The automation scripts from Part 1 make it easy to start fresh whenever needed.

Happy clustering, and welcome to the exciting world of Kubernetes operations!

Join us for Part 3, where we deploy Google’s Online Boutique microservices demo application, showcasing a realistic multi-service architecture in preparation for service mesh implementation.

Prerequisites and Current State

Installing Helm: The Kubernetes Package Manager

Objective 1: Persistent Storage with Local Path Provisioner

Understanding Kubernetes Storage Concepts

Installing Local Path Provisioner

Remember:

Testing Persistent Storage

Deploying a Stateful Application

Cleanup

Objective 2: External Access with NGINX Ingress Controller

Understanding Ingress Concepts

Installing NGINX Ingress Controller

Deploying Test Applications

Creating Ingress Resources

Testing External Access

Setting Up Host-Based Routing

Cleanup

Objective 3: Monitoring with Prometheus, Loki, and Grafana

Understanding the Monitoring Stack

Installing the Prometheus Stack

Installing Loki for Log Aggregation

Accessing Monitoring Interfaces

Creating Ingress for Monitoring Tools

Configuring Loki Data Source in Grafana

Exploring Prometheus Metrics

Exploring Loki Logs

Configuring Grafana Dashboards

Creating Custom Alerts

Generating Test Load

Objective 4: Security with RBAC and Network Policies

Understanding Kubernetes Security

Setting Up RBAC

Testing RBAC Permissions

Implementing Network Policies

Testing Default (Open) Network Access

Adding Network Policies with Calico

Testing Network Isolation (Deny-All Policy)

Allowing Routes

Testing Frontend-to-Backend Communication

Testing External Access to Frontend

Test 1: External pod to frontend (should now SUCCEED)

Test 2: External pod to backend (should still FAIL)

Verifying Calico NetworkPolicy Implementation

Pod Security Standards

Objective 5: Maintenance Operations

Backup Strategies

Setting Up etcd Backup

Testing etcd Backup and Restore

Alternative Safe Test Method

Setting Up Automated Backups

Cluster Upgrades

Checking Current Versions

Upgrading the Control Plane

Upgrading Worker Nodes

Troubleshooting Techniques

Creating a Troubleshooting Toolkit

Common Troubleshooting Commands

Setting Up Log Aggregation

Creating Health Checks

Conclusion and Next Steps

What We’ve Built

Current Cluster Capabilities

Best Practices Learned

Next Steps for Advanced Learning

Community and Resources

Final Thoughts

Share this article

Aaron Mathis

Related Articles

Learning Kubernetes with KubeADM - Part 4: Implementing Istio Service Mesh

Learning Kubernetes with KubeADM - Part 3: Launching an Online Boutique with Helm

Learning Kubernetes with KubeADM - Part 1: Automate the creation of a homelab environment

AKS and Terraform: A poor man's guide to deploying kubernetes in Azure

Infrastructure as Code in Azure: Enterprise CI/CD Pipelines and Multi-Environment Automation

Infrastructure as Code in Azure: Security Hardening and Configuration With Ansible