Learning Kubernetes with KubeADM - Part 2: Storage, Ingress, Monitoring, Security, and Maintenance
Part 2 of our series on getting started with Kubernetes and kubeadm where we add persistent storage, ingress, monitoring with Prometheus, Loki, and Grafana, RBAC security, backup strategies, and perform cluster upgrades

Welcome back to our series on learning Kubernetes with KubeADM! In Part 1, we successfully built a functional three-node Kubernetes cluster using kubeadm, complete with a control plane node (master-1) and two worker nodes (worker-1, worker-2). We established the fundamental infrastructure with Flannel networking and verified basic cluster operations.
Now we’re ready to transform our basic cluster into a production-ready platform. In this tutorial, we’ll add enterprise-grade capabilities including persistent storage, external access through ingress controllers, monitoring with Prometheus, Loki, and Grafana, security policies, and essential maintenance procedures.
By the end of this guide, you’ll have a Kubernetes cluster that mirrors production environments as closely as a homelab can, equipped with the tools and knowledge to deploy, monitor, and maintain real-world applications.
If you haven’t completed Part 1, please do so before proceeding, as this tutorial builds directly upon that foundation. You can find all the code examples in our GitHub repository.
Prerequisites and Current State
Before we begin, let’s verify our cluster is healthy and ready for the next phase. SSH into your master-1 node and run:
# Verify all nodes are ready
kubectl get nodes
# Check system pods are running
kubectl get pods -A
# Verify cluster info
kubectl cluster-info
You should see output similar to this, confirming all three nodes are in a “Ready” state:
NAME STATUS ROLES AGE VERSION
master-1 Ready control-plane 12h v1.31.0
worker-1 Ready <none> 12h v1.31.0
worker-2 Ready <none> 12h v1.31.0
Installing Helm: The Kubernetes Package Manager
Before diving into our advanced configurations, we need to install Helm, the de facto standard package manager for Kubernetes. Helm simplifies the deployment and management of complex applications by using templates called charts.
Think of Helm as the “apt” or “yum” for Kubernetes - it allows us to install, upgrade, and manage applications with simple commands rather than manually crafting dozens of YAML files.
On your master-1 node, install Helm:
# Download and install Helm
curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt update
sudo apt install helm
# Verify installation
helm version
# Add commonly used Helm repositories
helm repo add stable https://charts.helm.sh/stable
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add bitnami https://charts.bitnami.com/bitnami
# Update repository information
helm repo update
With Helm installed, we can now deploy complex applications with single commands, making our setup much more manageable and following industry best practices.
Objective 1: Persistent Storage with Local Path Provisioner
One of the fundamental requirements for any production Kubernetes cluster is persistent storage. While our cluster can run stateless applications perfectly, any application that needs to persist data (databases, file uploads, logs) requires persistent volumes.
In cloud environments, you’d typically use dynamic provisioning with cloud storage classes. In our homelab environment, we’ll use the Local Path Provisioner, which creates persistent volumes using local storage on our nodes. This approach is perfect for learning and development environments.
Understanding Kubernetes Storage Concepts
Before implementing storage, let’s understand the key concepts:
- Persistent Volume (PV): A cluster-wide storage resource
- Persistent Volume Claim (PVC): A request for storage by a pod
- Storage Class: Defines how storage is dynamically provisioned
- Dynamic Provisioning: Automatic creation of PVs when PVCs are requested
Installing Local Path Provisioner
The Local Path Provisioner automatically creates persistent volumes on local storage when applications request them. This eliminates the need to manually create volumes for each application.
# Deploy Local Path Provisioner
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.28/deploy/local-path-storage.yaml
# Wait for the provisioner to be ready
kubectl wait --for=condition=ready pod -l app=local-path-provisioner -n local-path-storage --timeout=300s
The kubectl patch
command allows you to update specific fields of a Kubernetes resource without recreating it. In this case, we need to modify the metadata of the local-path storage class to add an annotation:
Remember:
- metadata: Contains identifying information about the resource, such as its name, labels, and annotations.
- annotations: Key-value pairs that store additional information for Kubernetes or external tools. While annotations don’t directly change how the resource operates, they can influence how it’s handled.
# Verify the storage class was created
kubectl get storageclass
# Set it as the default storage class
kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
By adding the annotation storageclass.kubernetes.io/is-default-class: "true"
, you’re telling Kubernetes to treat this storage class as the default. This means that any PersistentVolumeClaim created without a specified storage class will automatically use the local-path storage class.
# Verify it's now marked as default
kubectl get storageclass
You should see output like:
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-path (default) rancher.io/local-path Delete WaitForFirstConsumer false 2m
Testing Persistent Storage
Let’s test our storage setup with a simple application that requires persistent data. Create a namespace storage-test
:
# Create a test namespace
kubectl create namespace storage-test
A namespace in Kubernetes is a logical partition within a cluster that provides a way to divide resources between multiple users, teams, or environments. It helps organize and isolate workloads, making it easier to manage access, resource quotas, and policies for different groups or applications.
Now, create a PVC to test dynamic provisioning:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
namespace: storage-test
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
EOF
… and check it’s status:
kubectl get pvc -n storage-test
Initially, the PVC will be in “Pending” status because Local Path Provisioner uses the “WaitForFirstConsumer” binding mode. This means the PV is only created when a pod actually uses the PVC.
Now let’s create a test pod that writes data to persistent storage:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: storage-test-pod
namespace: storage-test
spec:
containers:
- name: test-container
image: busybox
command: ["/bin/sh"]
args: ["-c", "while true; do echo \$(date) >> /data/timestamps.txt; sleep 30; done"]
volumeMounts:
- name: test-volume
mountPath: /data
volumes:
- name: test-volume
persistentVolumeClaim:
claimName: test-pvc
EOF
… and wait for the pod to be ready:
kubectl wait --for=condition=ready pod storage-test-pod -n storage-test --timeout=300s
Now, check that the PVC is now bound and that the persistent volume was created:
# Check that the PVC is now bound
kubectl get pvc -n storage-test
# Check the persistent volume was created
kubectl get pv
Let’s verify that data is being written and persisted:
# Check the data being written
kubectl exec -n storage-test storage-test-pod -- tail -f /data/timestamps.txt
kubectl exec
allows you to run commands inside a running container within a pod, making it useful for debugging, inspecting files, or interacting with your application directly from the command line.
Press Ctrl+C
to stop tailing, then delete the pod:
kubectl delete pod storage-test-pod -n storage-test
Now create a new pod using the same PVC to verify data persistence:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: storage-test-pod-2
namespace: storage-test
spec:
containers:
- name: test-container
image: busybox
command: ["/bin/sh"]
args: ["-c", "echo 'Previous data:'; cat /data/timestamps.txt; echo 'Adding new entry'; echo \$(date) >> /data/timestamps.txt; sleep 3600"]
volumeMounts:
- name: test-volume
mountPath: /data
volumes:
- name: test-volume
persistentVolumeClaim:
claimName: test-pvc
EOF
Check that the previous data persisted:
kubectl logs -n storage-test storage-test-pod-2
Perfect! You should see the timestamps from the previous pod, confirming that data persisted even after the original pod was deleted.
Deploying a Stateful Application
Now let’s deploy a real stateful application, PostgreSQL, to demonstrate practical storage usage. Start by creating a database
namespace:
kubectl create namespace database
Now, deeploy PostgreSQL with persistent storage:
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: database
spec:
serviceName: postgres
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:15
env:
- name: POSTGRES_DB
value: testdb
- name: POSTGRES_USER
value: testuser
- name: POSTGRES_PASSWORD
value: testpass123
ports:
- containerPort: 5432
name: postgres
volumeMounts:
- name: postgres-storage
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: postgres-storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 2Gi
---
apiVersion: v1
kind: Service
metadata:
name: postgres-service
namespace: database
spec:
selector:
app: postgres
ports:
- port: 5432
targetPort: 5432
type: ClusterIP
EOF
Wait for PostgreSQL to be ready and then check the StatefulSet and PVC:
# Wait for PostgreSQL to be ready
kubectl wait --for=condition=ready pod -l app=postgres -n database --timeout=300s
# Check the StatefulSet and PVC
kubectl get statefulset -n database
kubectl get pvc -n database
kubectl get pv
Let’s test the database functionality by connecting to PostgreSQL and creating test data:
# Connect to PostgreSQL and create test data
kubectl exec -it postgres-0 -n database -- psql -U testuser -d testdb -c "
CREATE TABLE users (
id SERIAL PRIMARY KEY,
username VARCHAR(50) UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO users (username) VALUES
('alice'),
('bob'),
('charlie');
SELECT * FROM users;
"
This successful test confirms that your Kubernetes cluster can reliably provide persistent storage for stateful workloads. With dynamic provisioning in place, applications like databases and file servers can safely retain data even as pods are restarted, rescheduled, or replaced. This capability is essential for running production-grade services and lays the groundwork for deploying more advanced stateful applications in your cluster.
Cleanup
Before moving on to objective 2, let’s clean up the resources created in Objective 1 (Persistent Storage), including the PostgreSQL database, test pods, PVCs, and namespaces. Run the following commands:
# Delete the PostgreSQL StatefulSet and its service
kubectl delete statefulset postgres -n database
kubectl delete service postgres-service -n database
# Delete the database namespace (removes all resources in it)
kubectl delete namespace database
# Delete the test pods and PVC in storage-test namespace
kubectl delete pod storage-test-pod -n storage-test --ignore-not-found
kubectl delete pod storage-test-pod-2 -n storage-test --ignore-not-found
kubectl delete pvc test-pvc -n storage-test
# Delete the storage-test namespace (removes all resources in it)
kubectl delete namespace storage-test
Objective 2: External Access with NGINX Ingress Controller
Currently, our applications are only accessible from within the cluster. In production environments, you need to expose services to external users. While LoadBalancer
and NodePort
services provide basic external access, Ingress controllers offer advanced features like SSL termination, path-based routing, and virtual hosting.
NGINX Ingress Controller is the most popular choice for Kubernetes ingress, providing production-ready features and excellent performance.
Understanding Ingress Concepts
Before deploying the ingress controller, let’s understand the key concepts:
- Ingress Controller: The component that implements ingress rules (NGINX, Traefik, etc.)
- Ingress Resource: Kubernetes object that defines routing rules
- Service: Kubernetes service that the ingress routes traffic to
- TLS Termination: Handling HTTPS/SSL at the ingress level
Installing NGINX Ingress Controller
We’ll use Helm to install the NGINX Ingress Controller:
# Install NGINX Ingress Controller
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--create-namespace \
--set controller.service.type=NodePort \
--set controller.service.nodePorts.http=30080 \
--set controller.service.nodePorts.https=30443 \
--set controller.config.use-service-upstream="true"
# Wait for the controller to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/component=controller -n ingress-nginx --timeout=300s
# Check the ingress controller status
kubectl get pods -n ingress-nginx
kubectl get svc -n ingress-nginx
We’re using NodePort instead of LoadBalancer since we don’t have a cloud load balancer in our homelab. The ingress will be accessible on ports 30080 (HTTP) and 30443 (HTTPS) on any node.
Deploying Test Applications
Let’s deploy a couple of web applications to demonstrate ingress functionality. Start by creating a web-apps
namespace:
kubectl create namespace web-apps
Now, deploy the google ‘hello app’ sample:
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: hello-app
namespace: web-apps
spec:
replicas: 2
selector:
matchLabels:
app: hello-app
template:
metadata:
labels:
app: hello-app
spec:
containers:
- name: hello-app
image: gcr.io/google-samples/hello-app:1.0
ports:
- containerPort: 8080
name: postgres
volumeMounts:
- name: hello-app-storage
mountPath: /data
volumeClaimTemplates:
- metadata:
name: hello-app-storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: hello-service
namespace: web-apps
spec:
selector:
app: hello-app
ports:
- port: 80
targetPort: 8080
type: ClusterIP
EOF
… as well as a second application (Echo Server):
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: echo-app
namespace: web-apps
spec:
replicas: 2
selector:
matchLabels:
app: echo-app
template:
metadata:
labels:
app: echo-app
spec:
containers:
- name: echo-app
image: ealen/echo-server:latest
ports:
- containerPort: 80
env:
- name: PORT
value: "80"
---
apiVersion: v1
kind: Service
metadata:
name: echo-service
namespace: web-apps
spec:
selector:
app: echo-app
ports:
- port: 80
targetPort: 80
type: ClusterIP
EOF
Wait for both deployments to be ready…
kubectl wait --for=condition=available deployment/hello-app -n web-apps --timeout=300s
kubectl wait --for=condition=available deployment/echo-app -n web-apps --timeout=300s
Creating Ingress Resources
Now let’s create ingress resources to expose our applications externally.
Remember: ingress resources are kubernetes objects that define routing rules.
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-apps-ingress
namespace: web-apps
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
ingressClassName: nginx
rules:
- http:
paths:
- path: /hello
pathType: Prefix
backend:
service:
name: hello-service
port:
number: 80
- path: /echo
pathType: Prefix
backend:
service:
name: echo-service
port:
number: 80
EOF
Check the ingress status:
# Check the ingress status
kubectl get ingress -n web-apps
kubectl describe ingress web-apps-ingress -n web-apps
You should see output similiar to:
Name: web-apps-ingress
Labels: <none>
Namespace: web-apps
Address:
Ingress Class: nginx
Default backend: <default>
Rules:
Host Path Backends
---- ---- --------
*
/hello hello-service:80 (10.244.1.9:8080,10.244.3.11:8080)
/echo echo-service:80 (10.244.1.10:80,10.244.3.12:80)
Annotations: nginx.ingress.kubernetes.io/rewrite-target: /
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Sync 9s nginx-ingress-controller Scheduled for sync
Testing External Access
Let’s test that our applications are accessible through the ingress. First, get the node IP:
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
echo "Testing ingress access on $NODE_IP:30080"
Test the hello application:
curl http://$NODE_IP:30080/hello
Test the echo application:
curl http://$NODE_IP:30080/echo | jq
You should see responses from both applications, confirming that ingress-based routing is working correctly.
Setting Up Host-Based Routing
For a more realistic setup, let’s configure host-based routing. First, we need to set up local DNS resolution:
# Get the node IP:
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
# Add entries to /etc/hosts on your host machine (not in the VM)
echo "You'll need to add these entries to your host machine's /etc/hosts file:"
echo "$NODE_IP hello.local"
echo "$NODE_IP echo.local"
echo "$NODE_IP dashboard.local"
On your host machine (not the VMs), add these entries to /etc/hosts
:
# Add these lines to /etc/hosts (replace with your actual node IP)
192.168.122.37 hello.local
192.168.122.37 echo.local
192.168.122.37 dashboard.local
Now, back on master-1
, create host-based ingress rules:
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: host-based-ingress
namespace: web-apps
spec:
ingressClassName: nginx
rules:
- host: hello.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: hello-service
port:
number: 80
- host: echo.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: echo-service
port:
number: 80
EOF
This YAML manifest defines a Kubernetes Ingress resource named host-based-ingress in the web-apps
namespace. It uses the NGINX ingress controller (ingressClassName: nginx
) to route external HTTP traffic to different backend services based on the requested host name:
- Requests to
hello.local
are forwarded to the hello-service on port 80. - Requests to
echo.local
are forwarded to the echo-service on port 80.
Each rule matches all paths (/
) using the Prefix
path type. This setup allows you to expose multiple services under different hostnames using a single ingress controller.
Test the host-based routing from your host machine’s browser:
Cleanup
Before moving on to objective 3, let’s clean up the resources created in this section, leaving the nginx ingress controller intact. Run the following commands on master-1
:
# Delete the hello-app deployment in the web-apps namespace
kubectl delete deployment hello-app -n web-apps
# Delete the hello-service service in the web-apps namespace
kubectl delete service hello-service -n web-apps
# Delete the echo-app deployment in the web-apps namespace
kubectl delete deployment echo-app -n web-apps
# Delete the echo-service service in the web-apps namespace
kubectl delete service echo-service -n web-apps
# Delete the path-based ingress resource in the web-apps namespace
kubectl delete ingress web-apps-ingress -n web-apps
# Delete the host-based ingress resource in the web-apps namespace
kubectl delete ingress host-based-ingress -n web-apps
# Delete the entire web-apps namespace and all its resources
kubectl delete namespace web-apps
Now, remove the entries that we added to /etc/hosts
on your host machine:
# Remove any line with 'hello\.local' in it from /etc/hosts
sudo sed -i '/hello\.local/d' /etc/hosts
# Remove any line with 'dashboard\.local' in it from /etc/hosts
sudo sed -i '/dashboard\.local/d' /etc/hosts
# Remove any line with 'echo\.local' in it from /etc/hosts
sudo sed -i '/echo\.local/d' /etc/hosts
Objective 3: Monitoring with Prometheus, Loki, and Grafana
Monitoring is essential for maintaining healthy Kubernetes clusters. Prometheus has become the standard for Kubernetes monitoring, providing powerful metrics collection, alerting, and querying capabilities. Combined with Loki’s log aggregation and Grafana’s visualization features, we’ll have complete observability into our cluster.
Understanding the Monitoring Stack
Our monitoring stack will include:
- Prometheus: Metrics collection and storage
- Grafana: Metrics visualization and dashboards
- Loki: Log aggregation and querying
- Promtail: Log collection agent for Loki
- Node Exporter: Hardware and OS metrics from each node
- kube-state-metrics: Kubernetes object metrics
- Alertmanager: Alert routing and notification
Installing the Prometheus Stack
We’ll use the Prometheus Community Helm chart, which includes everything we need. First, create a namespace monitoring
:
kubectl create namespace monitoring
Next, install the Prometheus stack, including Prometheus, Grafana, Alertmanager, and exporters:
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.service.type=NodePort \
--set prometheus.service.nodePort=30090 \
--set grafana.service.type=NodePort \
--set grafana.service.nodePort=30030 \
--set alertmanager.service.type=NodePort \
--set alertmanager.service.nodePort=30093
Installing Loki for Log Aggregation
Now let’s add Loki to our monitoring stack for centralized log management. We’ll install Loki using its Helm chart:
# Add Grafana Helm repository for Loki
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Loki with Promtail
helm install loki grafana/loki-stack \
--namespace monitoring \
--set loki.service.type=NodePort \
--set loki.service.nodePort=30031 \
--set promtail.enabled=true \
--set grafana.enabled=false
Wait for all components to be ready (this may take several minutes):
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=prometheus -n monitoring --timeout=600s
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=grafana -n monitoring --timeout=600s
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=loki -n monitoring --timeout=600s
Now check the status of monitoring components:
kubectl get pods -n monitoring
kubectl get svc -n monitoring
Accessing Monitoring Interfaces
Let’s set up access to our monitoring tools. First, get the IP address of the node:
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
and get the url’s for monitoring tools:
echo "Monitoring URLs:"
echo "Prometheus: http://$NODE_IP:30090"
echo "Grafana: http://$NODE_IP:30030"
echo "Alertmanager: http://$NODE_IP:30093"
echo "Loki: http://$NODE_IP:30031"
In order to access Grafana, you will need to get the admin password:
kubectl --namespace monitoring get secrets prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo
… and Viola! we have a beautiful dashboard with rich metrics for our Kubernetes cluster:

Creating Ingress for Monitoring Tools
Let’s make our monitoring tools accessible through ingress with proper hostnames:
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: monitoring-ingress
namespace: monitoring
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
ingressClassName: nginx
rules:
- host: grafana.homelab.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-grafana
port:
number: 80
- host: prometheus.homelab.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-kube-prometheus-prometheus
port:
number: 9090
- host: alertmanager.homelab.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-kube-prometheus-alertmanager
port:
number: 9093
- host: loki.homelab.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: loki
port:
number: 3100
EOF
Add these to your host machine’s /etc/hosts
file, replacing the ip address with your node’s IP address:
192.168.122.37 grafana.local
192.168.122.37 prometheus.local
192.168.122.37 alertmanager.local
192.168.122.37 loki.local
Configuring Loki Data Source in Grafana
Now we need to configure Loki as a data source in Grafana for log visualization:
# Get Loki service URL for internal cluster communication
LOKI_URL="http://loki.monitoring.svc.cluster.local:3100"
# Create a configmap to add Loki as a datasource
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasource-loki
namespace: monitoring
labels:
grafana_datasource: "1"
data:
loki.yaml: |
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki.monitoring.svc.cluster.local:3100
isDefault: false
editable: true
EOF
# Restart Grafana to pick up the new datasource
kubectl rollout restart deployment prometheus-grafana -n monitoring
kubectl wait --for=condition=available deployment/prometheus-grafana -n monitoring --timeout=300s
Exploring Prometheus Metrics
Access Prometheus at http://prometheus.local:30080 and explore some basic queries:
Basic PromQL Queries to Try:
-
Node CPU Usage:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
-
Memory Usage Percentage:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
-
Pod Count by Namespace:
count by (namespace) (kube_pod_info)
-
Cluster CPU Requests vs Limits:
sum(kube_pod_container_resource_requests{resource="cpu"})
Exploring Loki Logs
Access Loki directly at http://loki.local:30080 to explore log queries, or use Grafana’s Explore feature:
Basic LogQL Queries to Try:
-
All logs from a specific namespace:
{namespace="kube-system"}
-
Logs containing error messages:
{} |= "error" | line_format "{{.}}"
-
Logs from specific pods:
{pod=~"prometheus.*"}
Configuring Grafana Dashboards
Access Grafana at http://grafana.local:30080 (username: admin
, password from the previous command) and import some pre-built dashboards:
-
Import Kubernetes Cluster Overview Dashboard:
- Go to “Dashboards”
- Click ‘New’ → ‘Import’
- Use dashboard ID:
7249
- Click “Load”
- Select prometheus data source
- Click “Import”
-
Import Node Exporter Dashboard:
- Go to “Dashboards”
- Click ‘New’ → ‘Import’
- Use dashboard ID:
1860
- Select prometheus data source
- Click “Import”
-
Import Kubernetes Pod Overview:
- Go to “Dashboards”
- Click ‘New’ → ‘Import’
- Use dashboard ID:
6417
- Select prometheus data source
- Click “Import”
-
Import Loki Logs Dashboard:
- Go to “Dashboards”
- Click ‘New’ → ‘Import’
- Use dashboard ID:
13407
- Click “Load”
- Select Loki data source
- Click “Import”
Creating Custom Alerts
Let’s create a custom alert for high CPU usage. On master-1
, create a Prometheus rule:
cat <<EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: custom-alerts
namespace: monitoring
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: custom.rules
rules:
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 2 minutes on {{ \$labels.instance }}"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 85% for more than 5 minutes on {{ \$labels.instance }}"
EOF
Check that the rule was created. You should see custom alerts
listed in the results of the following command:
kubectl get prometheusrule -n monitoring
Generating Test Load
Let’s create some load to test our monitoring and alerts. First, recreate the web-apps namespace since we cleaned it up earlier:
kubectl create namespace web-apps
Now deploy a CPU stress test:
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: cpu-stress
namespace: web-apps
spec:
replicas: 1
selector:
matchLabels:
app: cpu-stress
template:
metadata:
labels:
app: cpu-stress
spec:
containers:
- name: cpu-stress
image: containerstack/cpustress
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 1000m
memory: 256Mi
command: ["/bin/sh"]
args: ["-c", "stress --cpu 2 --timeout 600s"]
EOF
Now, check your monitoring dashboards to see the increased CPU usage and corresponding log entries in Loki! The stress test will run for 10 minutes, giving you time to explore both metrics in Prometheus/Grafana and logs in Loki.
Objective 4: Security with RBAC and Network Policies
Security is paramount in any Kubernetes environment. In this section, we’ll implement Role-Based Access Control (RBAC) and Network Policies to secure our cluster against unauthorized access and limit network traffic between components.
Understanding Kubernetes Security
Key security concepts we’ll implement:
- RBAC (Role-Based Access Control): Controls who can access what resources
- Network Policies: Controls network traffic between pods
- Pod Security Standards: Controls what pods can do
- Service Accounts: Provides identity for pods and services
Setting Up RBAC
Let’s create different user roles for our cluster. Start by creating a namespace for our RBAC examples:
kubectl create namespace rbac-demo
Next, create service accounts for developers and viewers. A service account in Kubernetes is a special type of account used by processes running in pods to interact securely with the Kubernetes API:
# Create a service account for developers
kubectl create serviceaccount developer -n rbac-demo
# Create a service account for viewers
kubectl create serviceaccount viewer -n rbac-demo
Now, create a Role that allows full access to pods, services, and deployments in rbac-demo
namespace:
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: rbac-demo
name: developer-role
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps", "secrets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
EOF
Finally, create a Role that only allows read access:
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: rbac-demo
name: viewer-role
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch"]
EOF
Now that we have created roles, we still need to bind them to the service accounts we created:
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: developer-binding
namespace: rbac-demo
subjects:
- kind: ServiceAccount
name: developer
namespace: rbac-demo
roleRef:
kind: Role
name: developer-role
apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: viewer-binding
namespace: rbac-demo
subjects:
- kind: ServiceAccount
name: viewer
namespace: rbac-demo
roleRef:
kind: Role
name: viewer-role
apiGroup: rbac.authorization.k8s.io
EOF
Testing RBAC Permissions
Let’s test our RBAC setup by creating pods that use different service accounts.
First, we create a pod using the developer service account:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: developer-pod
namespace: rbac-demo
spec:
serviceAccountName: developer
containers:
- name: kubectl
image: bitnami/kubectl:latest
command: ["/bin/sleep", "3600"]
EOF
Next, create a pod using the viewer service account:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: viewer-pod
namespace: rbac-demo
spec:
serviceAccountName: viewer
containers:
- name: kubectl
image: bitnami/kubectl:latest
command: ["/bin/sleep", "3600"]
EOF
Wait for pods to be ready…
kubectl wait --for=condition=ready pod developer-pod -n rbac-demo --timeout=300s
kubectl wait --for=condition=ready pod viewer-pod -n rbac-demo --timeout=300s
Now let’s test our RBAC permissions, starting with the developer
first. If we have configured our policies correctly, this should work:
kubectl exec -n rbac-demo developer-pod -- kubectl get pods -n rbac-demo
kubectl exec -n rbac-demo developer-pod -- kubectl create deployment test-app --image=nginx -n rbac-demo
Moving on, let’s test the viewer
role. You should get an error message:
kubectl exec -n rbac-demo viewer-pod -- kubectl get pods -n rbac-demo
kubectl exec -n rbac-demo viewer-pod -- kubectl create deployment viewer-test --image=nginx -n rbac-demo
Implementing Network Policies
Network policies control traffic flow between pods. By default, Kubernetes allows all traffic, but we can restrict this for better security. Since we installed Calico as our CNI in Part 1, we have full NetworkPolicy support, which enables us to implement fine-grained traffic controls between pods.
Let’s start by creating a namespace for network policy testing:
kubectl create namespace network-test
Next, let’s deploy a frontend and backend application:
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
namespace: network-test
spec:
replicas: 2
selector:
matchLabels:
app: frontend
role: frontend
template:
metadata:
labels:
app: frontend
role: frontend
spec:
containers:
- name: frontend
image: nginx
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: frontend-service
namespace: network-test
spec:
selector:
app: frontend
ports:
- port: 80
targetPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend
namespace: network-test
spec:
replicas: 2
selector:
matchLabels:
app: backend
role: backend
template:
metadata:
labels:
app: backend
role: backend
spec:
containers:
- name: backend
image: nginx
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: backend-service
namespace: network-test
spec:
selector:
app: backend
ports:
- port: 80
targetPort: 80
EOF
Wait for deployments…
kubectl wait --for=condition=available deployment/frontend -n network-test --timeout=300s
kubectl wait --for=condition=available deployment/backend -n network-test --timeout=300s
Testing Default (Open) Network Access
Before implementing network policies, let’s verify that all pods can communicate freely (default Kubernetes behavior):
First, create a test pod to simulate external access:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: network-test-pod
namespace: network-test
labels:
app: test
spec:
containers:
- name: netshoot
image: nicolaka/netshoot
command: ["/bin/sleep", "3600"]
EOF
kubectl wait --for=condition=ready pod network-test-pod -n network-test --timeout=300s
Test 1: External pod can reach backend (should work with default policy)
kubectl exec -n network-test network-test-pod -- timeout 5 curl -s backend-service || echo "Connection failed"
Test 2: External pod can reach frontend (should work with default policy)
kubectl exec -n network-test network-test-pod -- timeout 5 curl -s frontend-service || echo "Connection failed"
Test 3: Frontend can reach backend (should work with default policy)
kubectl exec -n network-test $FRONTEND_POD -- timeout 5 curl -s backend-service && \
echo "✓ ALLOWED as expected" || echo "Unexpected failure!"
Adding Network Policies with Calico
Now let’s create network policies to control traffic. Calico’s NetworkPolicy implementation allows us to implement a zero-trust architecture using an implicit-deny
strategy. This means that by default, all traffic is blocked unless specifically allowed.
Note: Calico provides excellent NetworkPolicy support, including both Kubernetes NetworkPolicy and Calico’s own enhanced GlobalNetworkPolicy resources. For this tutorial, we’ll use standard Kubernetes NetworkPolicy resources for compatibility.
To accomplish this, we must first define a default policy that denies all traffic:
# Create a default deny-all network policy
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: network-test
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
EOF
Testing Network Isolation (Deny-All Policy)
Let’s verify that our deny-all policy is working:
Test 1: External pod to backend (should FAIL)
kubectl exec -n network-test network-test-pod -- timeout 5 curl -s backend-service && \
echo "Unexpected success!" || echo "✓ BLOCKED as expected"
Test 2: External pod to frontend (should FAIL)
kubectl exec -n network-test network-test-pod -- timeout 5 curl -s frontend-service && \
echo "Unexpected success!" || echo "✓ BLOCKED as expected"
Test 3: Frontend to backend (should FAIL)
kubectl exec -n network-test $FRONTEND_POD -- timeout 5 curl -s backend-service && \
echo "Unexpected success!" || echo "✓ BLOCKED as expected"
Allowing Routes
Now, that we have confirmed that our Deny-All
policy is working, we begin the process of explicitly adding acceptable network routes.
Let’s start with allowing DNS:
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns-any
namespace: network-test
spec:
podSelector: {}
policyTypes: [Egress]
egress:
- ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
EOF
Next, let’s allow frontend pods to communicate with backend pods (ingress to backend):
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-backend
namespace: network-test
spec:
podSelector:
matchLabels:
role: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
role: frontend
ports:
- protocol: TCP
port: 80
EOF
Now we allow frontend pods to make outbound connections (egress from frontend)
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-egress-backend
namespace: network-test
spec:
podSelector:
matchLabels:
role: frontend
policyTypes:
- Egress
egress:
- to:
- podSelector:
matchLabels:
role: backend
ports:
- protocol: TCP
port: 80
EOF
We also need to allow external access to frontend (ingress to frontend)
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-any-to-frontend
namespace: network-test
spec:
podSelector:
matchLabels:
role: frontend
policyTypes:
- Ingress
ingress:
- {}
EOF
Finally, in order to test External to Frontend traffic, we need to allow test pod to make outbound connections:
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-test-egress-any
namespace: network-test
spec:
podSelector:
matchLabels:
app: test
policyTypes:
- Egress
egress:
- {} # allow all egress
EOF
Testing Frontend-to-Backend Communication
Let’s test communication between the frontend and the backend:
Test 1: External pod to backend (should still FAIL)
kubectl exec -n network-test network-test-pod -- timeout 5 curl -s backend-service && \
echo "Unexpected success!" || echo "✓ BLOCKED as expected"
Test 2: Frontend to backend (should now SUCCEED)
kubectl exec -n network-test $FRONTEND_POD -- timeout 5 curl -s backend-service && \
echo "✓ ALLOWED as expected" || echo "Unexpected failure!"
Testing External Access to Frontend
Test 1: External pod to frontend (should now SUCCEED)
kubectl exec -n network-test network-test-pod -- timeout 5 curl -s frontend-service && \
echo "✓ ALLOWED as expected" || echo "Unexpected failure!"
Test 2: External pod to backend (should still FAIL)
kubectl exec -n network-test network-test-pod -- timeout 5 curl -s backend-service && \
echo "Unexpected success!" || echo "✓ BLOCKED as expected"
Verifying Calico NetworkPolicy Implementation
Let’s also verify that Calico is properly enforcing our network policies:
# Check that Calico is running and ready
echo "=== Calico Status ==="
kubectl get pods -n kube-system -l k8s-app=calico-node
# View the applied network policies
echo "=== Applied Network Policies ==="
kubectl get networkpolicies -n network-test -o wide
# Show network policy details
echo "=== Network Policy Details ==="
for policy in $(kubectl get networkpolicies -n network-test -o jsonpath='{.items[*].metadata.name}'); do
echo "--- Policy: $policy ---"
kubectl describe networkpolicy $policy -n network-test | grep -A 10 -B 2 "PodSelector\|Allowing\|Policy Types"
echo
done
Pod Security Standards
Pod Security Standards in Kubernetes are a set of built-in policies that define different levels of security controls for pods running in a cluster. These standards, Privileged, Baseline, and Restricted, help administrators enforce best practices by limiting what pods can do, such as restricting privilege escalation, enforcing non-root containers, and controlling access to host resources.
By applying these standards, you can reduce the risk of security vulnerabilities and ensure workloads adhere to organizational or compliance requirements.
Start by creating a namespace secure-apps
:
kubectl create namespace secure-apps
Next, we need to label the namespace to enforce restricted security.
kubectl label namespace secure-apps pod-security.kubernetes.io/enforce=restricted
kubectl label namespace secure-apps pod-security.kubernetes.io/audit=restricted
kubectl label namespace secure-apps pod-security.kubernetes.io/warn=restricted
These commands label the secure-apps
namespace to apply Kubernetes Pod Security Standards:
enforce=restricted
: Blocks pods that don’t meet the strictest security requirements.audit=restricted
: Logs violations of the restricted policy for auditing purposes.warn=restricted
: Issues warnings when a pod would violate the restricted policy.
This setup helps ensure only secure pods are allowed, while also providing visibility into potential security issues.
Now, try to create a privileged pod (this should fail).
cat <<EOF | kubectl apply -f - || echo "Expected: Pod security policy violation"
apiVersion: v1
kind: Pod
metadata:
name: privileged-pod
namespace: secure-apps
spec:
containers:
- name: privileged
image: nginx
securityContext:
privileged: true
EOF
Now try to create a compliant secure pod:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: secure-pod
namespace: secure-apps
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: secure-app
image: nginx:latest
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: false
ports:
- containerPort: 8080
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
EOF
kubectl wait --for=condition=ready pod secure-pod -n secure-apps --timeout=300s
echo "Secure pod created successfully!"
Objective 5: Maintenance Operations
In this final section, we’ll cover essential maintenance operations that every Kubernetes administrator needs to know: backup strategies, cluster upgrades, and troubleshooting techniques.
Backup Strategies
Regular backups are crucial for disaster recovery. We’ll cover backing up etcd (the cluster’s database) and persistent volume data. A comprehensive backup strategy should include both the cluster state (etcd) and application data (persistent volumes).
Setting Up etcd Backup
etcd stores all Kubernetes cluster state, making it the most critical component to backup. This includes all your deployments, services, secrets, and configuration data.
Install etcd-client
on master-1
:
sudo apt-get update
sudo apt-get install -y etcd-client
Then create a backup script: etcd-backup.sh
cat <<EOF > ~/etcd-backup.sh
#!/bin/bash
set -e
BACKUP_DIR="/var/backups/etcd"
BACKUP_FILE="etcd-backup-\$(date +%Y%m%d-%H%M%S).db"
# Create backup directory
sudo mkdir -p \$BACKUP_DIR
# Create etcd backup
sudo ETCDCTL_API=3 etcdctl snapshot save \$BACKUP_DIR/\$BACKUP_FILE \\
--endpoints=https://127.0.0.1:2379 \\
--cacert=/etc/kubernetes/pki/etcd/ca.crt \\
--cert=/etc/kubernetes/pki/etcd/server.crt \\
--key=/etc/kubernetes/pki/etcd/server.key
# Verify backup
sudo ETCDCTL_API=3 etcdctl snapshot status \$BACKUP_DIR/\$BACKUP_FILE
echo "Backup created: \$BACKUP_DIR/\$BACKUP_FILE"
# Clean up old backups (keep last 7 days)
sudo find \$BACKUP_DIR -name "etcd-backup-*.db" -mtime +7 -delete
EOF
The script automates the process of backing up the etcd database used by Kubernetes:
- It creates a timestamped backup file in the /var/backups/etcd directory, using secure credentials to connect to the etcd server.
- After saving the backup, it verifies the backup file’s status to ensure integrity.
- Finally, it cleans up old backup files, keeping only those from the last 7 days to manage disk space efficiently. This helps maintain regular, secure, and manageable etcd backups for disaster recovery.
Make the script executable and run it:
chmod +x ~/etcd-backup.sh
# Run the backup script
sudo ~/etcd-backup.sh
You should see output similiar to:
Snapshot saved at /var/backups/etcd/etcd-backup-20250716-192759.db
3452ebb4, 67954, 1484, 20 MB
Backup created: /var/backups/etcd/etcd-backup-20250716-192759.db
Testing etcd Backup and Restore
Let’s test our etcd backup by simulating a cluster disaster and restoring from backup. Warning: This test will temporarily disrupt your cluster, so ensure you have a current backup first.
# First, create a fresh etcd backup
echo "=== Creating Fresh etcd Backup ==="
sudo ~/etcd-backup.sh
# Create some test resources to verify restore
echo "=== Creating Test Resources ==="
kubectl create namespace backup-test
kubectl create deployment test-app --image=nginx --replicas=2 -n backup-test
kubectl create service clusterip test-service --tcp=80:80 -n backup-test
kubectl create configmap test-config --from-literal=message="Hello from backup test" -n backup-test
# Verify test resources exist
echo "=== Verifying Test Resources ==="
kubectl get all -n backup-test
kubectl get configmap test-config -n backup-test -o yaml
# Get the latest backup file
LATEST_BACKUP=$(sudo ls -1t /var/backups/etcd/etcd-backup-*.db | head -1)
echo "Using backup file: $LATEST_BACKUP"
# Stop etcd to simulate disaster
echo "=== Simulating etcd Disaster ==="
echo "Stopping etcd..."
sudo systemctl stop etcd
# Move current etcd data (simulate corruption/loss)
sudo mv /var/lib/etcd /var/lib/etcd.backup.$(date +%s)
# Restore from backup
echo "=== Restoring from etcd Backup ==="
sudo ETCDCTL_API=3 etcdctl snapshot restore $LATEST_BACKUP \
--data-dir /var/lib/etcd \
--initial-cluster=master-1=https://$(hostname -I | awk '{print $1}'):2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://$(hostname -I | awk '{print $1}'):2380 \
--name=master-1
# Fix ownership of restored data
sudo chown -R etcd:etcd /var/lib/etcd
# Restart etcd
echo "=== Restarting etcd ==="
sudo systemctl start etcd
# Wait for etcd to be healthy
echo "Waiting for etcd to become healthy..."
sleep 10
# Check etcd health
sudo ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Restart kubelet to reconnect to etcd
sudo systemctl restart kubelet
# Wait for cluster to be responsive
echo "Waiting for cluster to become responsive..."
sleep 30
# Verify cluster is working
echo "=== Verifying Cluster Restoration ==="
kubectl cluster-info
kubectl get nodes
# Verify our test resources were restored
echo "=== Verifying Test Resources Restored ==="
kubectl get all -n backup-test
kubectl get configmap test-config -n backup-test -o jsonpath='{.data.message}'
echo
# Cleanup test resources
echo "=== Cleaning Up Test Resources ==="
kubectl delete namespace backup-test
echo "=== etcd Backup and Restore Test Completed Successfully! ==="
Alternative Safe Test Method
If you prefer a less disruptive test, you can verify backup integrity without actually restoring:
# Create a test backup
echo "=== Creating Test Backup ==="
sudo ~/etcd-backup.sh
# Get the latest backup
LATEST_BACKUP=$(sudo ls -1t /var/backups/etcd/etcd-backup-*.db | head -1)
echo "Testing backup: $LATEST_BACKUP"
# Verify backup integrity
echo "=== Verifying Backup Integrity ==="
sudo ETCDCTL_API=3 etcdctl snapshot status $LATEST_BACKUP --write-out=table
# Test restore to a temporary location (without actually using it)
echo "=== Testing Restore Process ==="
sudo ETCDCTL_API=3 etcdctl snapshot restore $LATEST_BACKUP \
--data-dir /tmp/etcd-restore-test \
--initial-cluster=master-1=https://127.0.0.1:2380 \
--initial-cluster-token=test-token \
--initial-advertise-peer-urls=https://127.0.0.1:2380 \
--name=master-1
# Check if restore created the directory structure
echo "=== Verifying Restore Structure ==="
sudo ls -la /tmp/etcd-restore-test/
# Cleanup test restore
sudo rm -rf /tmp/etcd-restore-test
echo "=== Backup Verification Completed Successfully! ==="
Note: The first test method actually restores etcd and will temporarily disrupt your cluster (about 1-2 minutes). The second method only verifies backup integrity without disruption. Both methods confirm that your backup strategy is working correctly.
Setting Up Automated Backups
Now let’s set up automated backups using cron:
# Set up cron job for daily backups at 2 AM
echo "0 2 * * * /home/ubuntu/etcd-backup >> /var/log/kubernetes-backup.log 2>&1" | sudo crontab -
# Verify crontab job
sudo crontab -l
Cluster Upgrades
Upgrading Kubernetes clusters requires careful planning and execution. We’ll demonstrate upgrading from the current version to a newer patch version.
Checking Current Versions
First, check what version of kubeadm
, kubectl
, and kubelet
you have installed:
# Check current versions
kubectl version
kubeadm version
kubelet --version
Our version is: 1.31.9-1.1
currently. Now check for available upgrades:
# Check available upgrades
sudo apt update
sudo apt-cache madison kubeadm | head -5
You should see output similiar to:
ubuntu@master-1:~$ sudo apt-cache madison kubeadm | head -5
kubeadm | 1.31.11-1.1 | https://pkgs.k8s.io/core:/stable:/v1.31/deb Packages
kubeadm | 1.31.10-1.1 | https://pkgs.k8s.io/core:/stable:/v1.31/deb Packages
kubeadm | 1.31.9-1.1 | https://pkgs.k8s.io/core:/stable:/v1.31/deb Packages
kubeadm | 1.31.8-1.1 | https://pkgs.k8s.io/core:/stable:/v1.31/deb Packages
kubeadm | 1.31.7-1.1 | https://pkgs.k8s.io/core:/stable:/v1.31/deb Packages
You will need to find the version to upgrade to (and for the commands below) from this list.
In our case, it is 3.31.1-1.1
Upgrading the Control Plane
First, upgrade kubeadm on the control plane:
sudo apt-mark unhold kubeadm
sudo apt update
sudo apt install -y kubeadm=1.31.11-1.1
sudo apt-mark hold kubeadm
# Check the upgrade plan
sudo kubeadm upgrade plan
# Apply the upgrade (replace with actual available version)
sudo kubeadm upgrade apply v1.31.11 --yes
# Drain the control plane node
kubectl drain master-1 --ignore-daemonsets --delete-emptydir-data
# Upgrade kubelet and kubectl
sudo apt-mark unhold kubelet kubectl
sudo apt update
sudo apt install -y kubelet=1.31.11-1.1 kubectl=1.31.11-1.1
sudo apt-mark hold kubelet kubectl
# Restart kubelet
sudo systemctl daemon-reload
sudo systemctl restart kubelet
# Uncordon the node
kubectl uncordon master-1
# Verify the control plane upgrade
kubectl get nodes
kubectl version --short
Upgrading Worker Nodes
For each worker node, perform these steps:
# SSH into worker-1
ssh ubuntu@<worker-1-ip>
# Upgrade kubeadm
sudo apt-mark unhold kubeadm
sudo apt update
sudo apt install -y kubeadm=1.31.11-1.1
sudo apt-mark hold kubeadm
# Upgrade the node configuration
sudo kubeadm upgrade node
# From the control plane, drain the worker node
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data
# Back on worker-1, upgrade kubelet and kubectl
sudo apt-mark unhold kubelet kubectl
sudo apt update
sudo apt install -y kubelet=1.31.11-1.1 kubectl=1.31.11-1.1
sudo apt-mark hold kubelet kubectl
# Restart kubelet
sudo systemctl daemon-reload
sudo systemctl restart kubelet
# From control plane, uncordon the worker
kubectl uncordon worker-1
# Repeat for worker-2
Troubleshooting Techniques
Let’s cover common troubleshooting scenarios and techniques:
Creating a Troubleshooting Toolkit
The troubleshooting-toolkit
is a temporary pod based on the nicolaka/netshoot
container image. This image is a “Swiss-army knife” for network troubleshooting; it comes pre-loaded with dozens of useful tools (ping
, dig
, curl
, tcpdump
, mtr
, etc.) that are often missing from standard application containers.
The command: ["/bin/sleep", "3600"]
instruction keeps the pod running for an hour for use. The NET_ADMIN
capability gives it the necessary permissions to perform advanced network diagnostics.
# Deploy a troubleshooting pod with useful tools
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: troubleshooting-toolkit
namespace: default
spec:
containers:
- name: toolkit
image: nicolaka/netshoot
command: ["/bin/sleep", "3600"]
securityContext:
capabilities:
add: ["NET_ADMIN"]
restartPolicy: Never
EOF
kubectl wait --for=condition=ready pod troubleshooting-toolkit --timeout=300s
To use the toolk kit, simply execute commands inside the pod with kubectl exec
. This allows you to diagnose cluster networking issues from the perspective of a pod running within the cluster.
Some examples of common use cases are:
Testing DNS Resolution:
Check if a service name resolves correctly from within the cluster.
kubectl exec -it troubleshooting-toolkit -- dig my-service.my-namespace.svc.cluster.local
Checking Pod-to-Service Connectivity:
Verify if you can reach a service’s ClusterIP.
kubectl exec -it troubleshooting-toolkit -- curl http://my-service.my-namespace.svc.cluster.local
kubectl exec -it troubleshooting-toolkit -- curl http://<service-cluster-ip>:<port>
Pinging Another Pod’s IP:
Test basic network reachability to another pod.
kubectl exec -it troubleshooting-toolkit -- ping <other-pod-ip>
When you’re finished, you can delete the pod with kubectl delete pod troubleshooting-toolkit
.
Common Troubleshooting Commands
Below are some common troubleshooting commands that you should learn as you dive into the world of Kubernetes orchestration.
Check cluster health
kubectl get componentstatuses
kubectl cluster-info
kubectl get events --sort-by=.metadata.creationTimestamp
Check node health
kubectl describe nodes
kubectl top nodes # Requires metrics-server
Check pod issues
kubectl get pods --all-namespaces
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # Previous container logs
Check resource usage
kubectl top pods --all-namespaces
kubectl describe resourcequota --all-namespaces
Network troubleshooting
kubectl exec -it troubleshooting-toolkit -- nslookup kubernetes.default
kubectl exec -it troubleshooting-toolkit -- ping <pod-ip>
kubectl exec -it troubleshooting-toolkit -- netstat -tlnp
Check persistent volumes
kubectl get pv
kubectl get pvc --all-namespaces
kubectl describe pv <volume-name>
Setting Up Log Aggregation
For better troubleshooting, let’s set up centralized logging by deploying a log aggregator into our Kubernetes cluster using a Fluentd DaemonSet.
Start by creating a logging
namespace:
# Create a simple log aggregation setup using Fluentd
kubectl create namespace logging
Then, create a ConfigMap
for the Loki configuration:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
namespace: logging
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type cri
</parse>
</source>
<filter kubernetes.**>
@type kubernetes_metadata
</filter>
<match kubernetes.**>
@type loki
url "http://loki.monitoring.svc.cluster.local:3100"
flush_interval 10s
</match>
EOF
Note: The Loki service URL is http://loki.monitoring.svc.cluster.local:3100. This uses the full Kubernetes DNS name to reach the loki service in the monitoring namespace from the logging namespace.
Now deploy the daemonset:
# Deploy Fluentd DaemonSet for log collection
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
spec:
selector:
matchLabels:
name: fluentd
template:
metadata:
labels:
name: fluentd
spec:
# Add a service account for proper permissions
serviceAccountName: fluentd
containers:
- name: fluentd
# Use a generic image, not the Elasticsearch-specific one
image: fluent/fluentd-kubernetes-daemonset:v1-debian-amd64
volumeMounts:
# Mount the configuration file from the ConfigMap
- name: config-volume
mountPath: /fluentd/etc/fluent.conf
subPath: fluent.conf
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
# Define the ConfigMap as a volume source
- name: config-volume
configMap:
name: fluentd-config
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
EOF
A DaemonSet
is a Kubernetes workload that ensures a copy of a pod runs on every single node in the cluster. In this case, it deploys a Fluentd pod on each node to act as a log collection agent.
The key to how it works is in the volumes and volumeMounts sections:
- hostPath: This gives the Fluentd pod direct access to directories on the underlying host node.
- Directory Access: It specifically mounts
/var/log
and/var/lib/docker/containers
from the node into the pod. This is where all container logs are written.
By doing this, each Fluentd pod can see and collect the logs from all other pods running on the same node. The environment variables (FLUENT_ELASTICSEARCH_HOST
and FLUENT_ELASTICSEARCH_PORT
) tell Fluentd where to forward these collected logs, in this case, to the Loki service we setup earlier.
Creating Health Checks
It can be very useful to script a tool that gives you a detailed glimpse at the overall health of your clusters, nodes, pods, etc. The script below captures resource usage, pod statuses, recent events, and more.
Create cluster-health-check.sh
:
cat <<EOF > ~/cluster-health-check.sh
#!/bin/bash
echo "=== Kubernetes Cluster Health Check ==="
echo "Date: \$(date)"
echo
echo "=== Cluster Info ==="
kubectl cluster-info
echo
echo "=== Node Status ==="
kubectl get nodes -o wide
echo
echo "=== Pod Status by Namespace ==="
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
echo
echo "=== Recent Events ==="
kubectl get events --sort-by=.metadata.creationTimestamp | tail -10
echo
echo "=== Resource Usage ==="
echo "Node resource usage:"
kubectl top nodes 2>/dev/null || echo "Metrics server not available"
echo
echo "=== Persistent Volume Status ==="
kubectl get pv
echo
echo "=== Critical System Pods ==="
kubectl get pods -n kube-system | grep -E "(etcd|apiserver|controller|scheduler)"
echo
echo "=== Network Policy Status ==="
kubectl get networkpolicies --all-namespaces
echo
echo "Health check completed."
EOF
Now make it executable and test it:
chmod +x ~/cluster-health-check.sh
~/cluster-health-check.sh
Conclusion and Next Steps
Congratulations! You’ve successfully transformed your basic Kubernetes cluster into a well-oiled, production-ready platform. Let’s review what we’ve accomplished:
What We’ve Built
- Persistent Storage: Implemented dynamic volume provisioning with Local Path Provisioner, enabling stateful applications like databases
- External Access: Deployed NGINX Ingress Controller with both path-based and host-based routing for external application access
- Comprehensive Monitoring: Set up Prometheus, Loki, and Grafana stack for complete cluster observability with custom alerts and log aggregation.
- Security: Implemented RBAC for access control, Network Policies for traffic isolation, and Pod Security Standards for container security
- Maintenance Operations: Established backup procedures, upgrade processes, and troubleshooting toolkits
Current Cluster Capabilities
Your cluster now provides enterprise-grade features:
- High Availability: Multi-node setup with monitoring and alerting
- Security: Role-based access control and network isolation
- Observability: Rich metrics, dashboards, and alerting
- External Access: Production-ready ingress with host-based routing
- Data Persistence: Reliable storage for stateful applications
- Maintainability: Automated backups and systematic upgrade procedures
Best Practices Learned
Throughout this tutorial, you’ve implemented industry best practices:
- Infrastructure as Code: All configurations defined in YAML manifests
- Security by Default: RBAC, Network Policies, and Pod Security Standards
- Observability First: Comprehensive monitoring before issues arise
- Backup Strategy: Regular automated backups of critical data
- Systematic Upgrades: Planned, tested upgrade procedures
Next Steps for Advanced Learning
Your homelab cluster is now ready for advanced exploration:
- Service Mesh: Implement Istio or Linkerd for advanced traffic management
- GitOps: Set up ArgoCD for automated application deployment
- Advanced Storage: Explore CSI drivers and distributed storage solutions
- Multi-Cluster: Set up cluster federation or multi-cluster management
- Advanced Security: Implement OPA Gatekeeper for policy as code
- CI/CD Integration: Connect Jenkins, GitLab, or GitHub Actions
Community and Resources
Continue your Kubernetes journey with these resources:
- Official Documentation: kubernetes.io
- Community Forums: Kubernetes Slack, Stack Overflow, Reddit
- Training Platforms: Cloud provider training, CNCF courses
- Certifications: CKA, CKAD, CKS certifications
- Local Meetups: Join Kubernetes and Cloud Native meetups
Final Thoughts
You’ve built more than just a Kubernetes cluster, you’ve created a scalable learning platform that mirrors real-world production environments. The skills and patterns you’ve learned here are directly applicable to any Kubernetes environment, from small startups to large enterprises.
Your homelab cluster serves as both a learning environment and a testing ground for new technologies and approaches. Continue experimenting, breaking things, and rebuilding them - this hands-on experience is invaluable for mastering Kubernetes.
The foundation you’ve established provides a solid base for exploring advanced Kubernetes concepts and integrating with the broader cloud-native ecosystem. Whether you’re preparing for certification, building production systems, or simply satisfying your curiosity about container orchestration, your cluster is ready for the next phase of your journey.
Remember that Kubernetes is a rapidly evolving ecosystem. Stay curious, keep experimenting, and don’t hesitate to tear down and rebuild your cluster as you learn new concepts. The automation scripts from Part 1 make it easy to start fresh whenever needed.
Happy clustering, and welcome to the exciting world of Kubernetes operations!
Join us for Part 3, where we deploy Google’s Online Boutique microservices demo application, showcasing a realistic multi-service architecture in preparation for service mesh implementation.

Aaron Mathis
Systems administrator and software engineer specializing in cloud development, AI/ML, and modern web technologies. Passionate about building scalable solutions and sharing knowledge with the developer community.
Related Articles
Discover more insights on similar topics