DeepThought .sh
AI/ML

Building an Image Classifier – Part 1: Preprocessing and Preparing the CIFAR-10 Dataset

Part 1 of our series on building and deploying a full-stack image classification system with Python, deep learning, and FastAPI.

Aaron Mathis
20 min read
Building an Image Classifier – Part 1: Preprocessing and Preparing the CIFAR-10 Dataset

Before you can build a world-class machine learning model, you need clean, structured, and consistent data. This is the first post in a multi-part tutorial series that walks through the entire lifecycle of building and deploying an image classification model, from raw pixels to a fully operational web API and user interface.

We’ll begin by preparing a real-world image dataset (CIFAR-10) for training. Along the way, we’ll detect and remove corrupted and blurry images, standardize image sizes, encode labels, and export the cleaned data in a format ready for model training. By the end of this post, you’ll have a reusable preprocessing pipeline that ensures your input data is high quality — and your models are set up for success.


Getting the Dataset

For this tutorial, we’ll be using the CIFAR-10 dataset, a widely used benchmark for image classification tasks. CIFAR-10 contains 60,000 color images in 10 different categories (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck), with each image sized at just 32×32 pixels.

For consistency with many teaching environments (including Kaggle-style challenges), we’ll be working with a modified version of CIFAR-10 where:

  • Training images are placed in a flat directory (train/)
  • Test images are stored separately in a test/ directory
  • Labels are defined in a CSV file (trainLabels.csv)
  • An optional sampleSubmission.csv is provided to format predictions

The dataset can be downloaded directly from kaggle.

Directory Structure

After unzipping the dataset, your directory should look like this:

cifar-10/
├── train/                 # 50,000 training images, flat layout (1.png, 2.png, ...)
├── test/                  # 10,000 test images, unlabeled
├── trainLabels.csv        # Contains label for each training image
└── sampleSubmission.csv   # Example output format for test predictions

Note: Within the cifar-10.zip file, you will find train.7s and test.7s. These archives need to be extracted as well. During extraction, choose to extract directly into cifar-10 rather than into cifar-10/test or cifar-10/train. The archives will automatically extract into their own test/ and train/ directory, and if you end up with a folder structure like cifar-10/test/test/1.png the code will throw exceptions.

Sample: trainLabels.csv

id,label
1,dog
2,cat
3,airplane
...

In this format, we’ll need to load the images from the train folder and match them with their corresponding labels from the CSV. The test set does not contain labels, but we’ll still preprocess those images the same way (minus label encoding) to ensure consistency when running inference later.


Planning the Preprocessing Pipeline

Before we train any models, it’s critical to ensure the image data is consistent, clean, and correctly formatted. Poor input data, such as corrupted files, low-quality images, or inconsistent shapes, can significantly degrade your model’s performance or cause it to fail entirely.

Here’s what our preprocessing pipeline will do:

Preprocessing Goals

StepDescription
Image ValidationDetect and skip unreadable or corrupted image files
Blurriness FilteringUse variance of Laplacian[1] to identify and remove overly blurry images
Padding & ResizingEnsure all images are square and resized to 32×32 pixels
Label EncodingConvert string-based class labels (e.g., "cat") into numeric values
Data ExportStore processed image and label arrays in .npz format for fast loading

[1] The Laplacian operator is a second-order derivative that highlights regions of rapid intensity change, useful for detecting image sharpness or blurriness. You can read more here if you must know.

This approach ensures your training data is:

  • Free from errors or visual noise
  • Standardized in shape and resolution
  • Ready to be batched and fed into a neural network

Alright, enough talking. It’s time to get to brass tax and dive in!


Setting Up the Project Environment

To keep your machine clean and your project dependencies isolated, it’s best to use a virtual environment. This ensures that the packages you install for this image classification project don’t interfere with other Python projects or system-wide packages.

This tutorial series assumes that you have Python installed and that it is at least version 3.9x

Create a Project Folder

Start by creating a directory for your project and navigating into it:

mkdir deepthought-image-classifier
cd deepthought-image-classifier

Create a Virtual Environment

Run the following command to create a standalone virtual environmen named .venv: Bash (Linux/MacOS):

python -m venv .venv
source .venv/bin/activate

Powershell (Windows)

python -m venv .venv
.\.venv\Scripts\Activate.ps1

You should see your terminal prompt change to something like (.venv). If it doesn’t work in PowerShell, try running this once before rerunning the Activate script:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Install Required Packages

We’ll use these packages for image processing and CSV handling:

pip install numpy pandas matplotlib pillow opencv-python-headless torch

opencv-python-headless is used instead of the full GUI version to avoid extra dependencies (ideal for headless systems and containers).

Save Environment Snapshot

To make your project reproducible, freeze the installed packages into requirements.txt:

pip freeze > requirements.text

Create New File: preprocessing.py

Create an empty python file named preprocessing.py, either in your editor / IDE of choice, or by running the below command:

Bash (Linux/MacOS)

touch preprocessing.py

Powershell (Windows)

New-Item -ItemType File -Path .\preprocessing.py

Final Directory Structure

deepthought-image-classifier/
├── .venv/                # Your virtual environment (excluded from Git)
├── cifar-10/             # Dataset will go here (train/, test/, trainLabels.csv, etc.)
├── preprocessing.py      # We will write this script in the next section
└── requirements.txt      # Dependencies snapshot

Loading Labels and Preparing Class Mappings

Before we can start working with image files, we need to understand what each image represents. In our dataset, the train/ folder contains tens of thousands of .png images, but the class labels aren’t embedded in the folder structure. Instead, they’re provided in a separate CSV file named trainLabels.csv.

This CSV acts like a manifest, it tells us which image belongs to which class. Each row consists of two columns: id and label. The id corresponds to the image filename (e.g., 1.png, 2.png), and the label is the human-readable class name (e.g., cat, dog, truck).

Here’s what the first few lines of the CSV look like:

id,label
1,cat
2,dog
3,automobile
4,airplane

We’ll start our preprocessing script by loading this CSV and building two key data structures:

  1. A dictionary that maps each image ID to its class name
  2. A mapping from class names (like "cat") to numeric indices (like 2), because neural networks expect numerical labels, not strings

Let’s implement this in Python.

Load and Map Labels

Open preprocessing.py in your editor and begin with the following code:

import os
import cv2
import torch
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image, UnidentifiedImageError

# Define paths
BASE_DIR = Path("cifar-10")
TRAIN_DIR = BASE_DIR / "train"
TEST_DIR = BASE_DIR / "test"
LABELS_CSV = BASE_DIR / "trainLabels.csv"
OUTPUT_TRAIN_FILE = Path("processed","prepared_train_dataset.pt")

To begin working with our labels, we’ll create a helper function that loads the trainLabels.csv file, extracts the unique class names, maps them to integer IDs, and builds a lookup dictionary.

This is a foundational step in many classification pipelines — converting raw labels like "dog", "cat", and "plane" into numeric class IDs that a model can understand.

We’ll start by loading the CSV using pandas, then prepare the class names for training. Neural networks don’t understand words like "cat", they expect integers like 0, 1, or 9 to represent class categories. To bridge that gap, we’ll sort the unique class names and assign each a consistent index. Finally, we create a mapping from each image ID to its label, and wrap everything into a reusable function called load_label_mapping().

def load_label_mapping(labels_csv=LABELS_CSV):
    """Loads the label mapping from the CSV file."""
    df = pd.read_csv(labels_csv)

    # Create a label map
    label_map = {row['id']: row['label'] for _, row in df.iterrows()}

    # Get sorted unique class names and assign integer IDs
    class_names = sorted(set(label_map.values()))
    class_to_idx = {label: idx for idx, label in enumerate(class_names)}


    print(f"Loaded {len(label_map)} labels across {len(class_names)} classes.")
    return label_map, class_to_idx, class_names

For example, after this runs, label_map[1] would return "cat" and label_map[2] would return "dog".

If the above code looks foreign to you, it’s because it is a dictionary comprehension rather than a traditional for loop. You can read more about them here.

With this groundwork laid, we now have everything we need to associate image files with numeric class labels. In the next section, we’ll begin validating and transforming those images, checking for corruption, filtering for quality, and resizing each one into a consistent 32×32 format.


Validating and Filtering Image Files

Machine learning is only as good as the data it’s trained on. If our training set includes corrupted or unreadable image files, or images that are too blurry to be meaningful, we risk confusing the model and degrading its performance. This is especially critical with small datasets like CIFAR-10, where every image counts.

In this section, we’ll implement two quality control measures:

  1. Image validation – We’ll confirm that each image file is readable and not corrupted.
  2. Blurriness detection – We’ll measure the “sharpness” of each image and remove ones that are too blurry to be useful.

Let’s start by adding these checks to our preprocessing script.

Image Validation with Pillow

Sometimes, image files are present on disk but cannot be opened — either because they’re incomplete, corrupted, or improperly encoded. The best way to catch this early is to attempt to open the image using the Pillow library.

Add this function to your preprocessing.py file:


def is_image_valid(path: str) -> bool:
    try:
        img = Image.open(path)
        img.verify()  # Verifies without loading full image into memory
        return True
    except (UnidentifiedImageError, IOError):
        return False

This function attempts to open and verify the image. If the file is invalid or unreadable, it returns False, and we’ll skip that image during preprocessing.

Bluriness Detection with OpenCV

Some images may technically be readable but are so blurry that they contribute noise to the training process. We’ll use a common computer vision trick: the Laplacian variance.

The Laplacian operator highlights edges in an image. If an image has very few edges, it likely means it’s blurry. Measuring the variance of the Laplacian gives us a numeric estimate of how sharp the image is.

Here’s the function to compute that:


def is_blurry(image: np.ndarray, threshold: float = 100.0) -> bool:
    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
    lap_var = cv2.Laplacian(gray, cv2.CV_64F).var()
    return lap_var < threshold

This function works with OpenCV images (NumPy arrays) and returns True if the image is considered too blurry to use. The threshold value is tunable, you can experiment with values between 50 and 150 depending on your dataset. For CIFAR-10, a value around 100.0 works well in most cases.

Combining Both Checks

Now that we have both is_image_valid() and is_blurry(), we can apply them in sequence to each image. If an image fails either test, we skip it and move on to the next one.

We’ll integrate this logic into the main preprocessing loop soon, but first we need to prepare one more important step: ensuring that each image is correctly sized and shaped for our model.

In the next section, we’ll write a function to pad and resize images to 32×32 pixels, the input shape expected by CIFAR-10 models.


Padding and Resizing Images

When working with convolutional neural networks (CNNs), your input data must have consistent dimensions. CIFAR-10 is designed for images of size 32×32 pixels, so we’ll standardize every image to that shape.

However, real-world image datasets,even those curated like CIFAR-10, may not always be square or perfectly sized. Instead of directly resizing a rectangular image (which can distort its contents), a better approach is to pad the image to a square shape first, then resize it.

This preserves the aspect ratio of the original image while allowing us to meet the input requirements of our model.

Step-by-Step: How We’ll Process Each Image

Let’s break down what our image transformation function will do:

  1. Check image dimensions: Get the height and width.
  2. Determine which side is smaller: Add black padding to that side to make the image square.
  3. Resize the padded image: Use OpenCV to scale it to exactly 32×32.
  4. Return the transformed image: Now safe for model input.

Padding and Resizing Function

def pad_and_resize(image: np.ndarray, target_size: int = 32) -> np.ndarray:
    h, w, _ = image.shape
    diff = abs(h - w)

    # Compute padding
    pad1, pad2 = diff // 2, diff - diff // 2
    if h > w:
        # Image is taller than wide: pad width
        padded = cv2.copyMakeBorder(image, 0, 0, pad1, pad2, cv2.BORDER_CONSTANT, value=[0, 0, 0])
    else:
        # Image is wider than tall: pad height
        padded = cv2.copyMakeBorder(image, pad1, pad2, 0, 0, cv2.BORDER_CONSTANT, value=[0, 0, 0])

    # Resize to target size
    resized = cv2.resize(padded, (target_size, target_size))
    return resized

Why Padding First is Better than Just Resizing

If you simply resize an image without padding, you risk stretching it unnaturally, making square objects look oval, or distorting features like faces or shapes. This distortion can confuse your model and reduce accuracy.

By padding first, we maintain the integrity of the original content and give the model a more accurate representation of each object.

For example…

Let’s say you have an image with shape (24, 32, 3), taller than it is wide. Without padding, resizing to (32, 32) would stretch the image sideways. With padding, we simply add 4 pixels of black space to each side, making it (32, 32, 3) before resizing, no distortion.

With validation, blurriness detection, and padding in place, we’re now ready to write the main preprocessing loop that loads each image, applies all these checks, and builds a dataset of clean, standardized training images.


Putting It All Together: Preprocessing the CIFAR-10 Data Set

With our helper functions in place for validating images, detecting blurriness, and reshaping them to a consistent format, it’s time to bring everything together. In this section, we’ll walk through how to loop over the entire CIFAR-10 training set and build a clean, standardized dataset ready for model training.

Our goal is to iterate over each image listed in the trainLabels.csv file, apply our quality control checks, and prepare the data for machine learning. Specifically, we’ll load each image, discard any that are unreadable or too blurry, reshape the usable ones to 32×32 pixels, and encode their labels as integers. At the end of the process, we’ll save the entire dataset in a compressed format that can be quickly loaded later during training.

Let’s walk through the logic…

We begin by loading the CSV and generating our label mappings using the functions we wrote earlier. Then we loop through each image ID from the label map. For each image, we first check if the file exists and is readable. If the image passes these initial checks, we load it into memory using OpenCV and convert it from BGR (OpenCV’s default format) to RGB.

At this point, we apply the blurriness filter. This is an important safeguard, especially in datasets that have been collected automatically or manipulated in earlier steps. Blurry images can be ambiguous even to humans, and for models trying to learn fine-grained patterns, they can act as noise.

Once the image passes validation and quality checks, we apply our pad_and_resize function to reshape it to 32×32 pixels. At this point, we also convert its class label from a string (e.g., "dog") to its corresponding integer (e.g., 3) using the dictionary we constructed earlier.

Each clean image and its numeric label are then appended to our training arrays. After processing all images, we convert the lists to NumPy arrays and write them to disk as a .npz file.

Here is the complete function that encapsulates this process:

def preprocess_train_set():
    print("Loading label map...")
    label_map, class_to_idx, class_names = load_label_mapping()

    X, y = [], []
    total = len(label_map)
    print(f"Found {total} labeled training images.")

    for i, (img_id, label) in enumerate(label_map.items(), start=1):
        img_path = os.path.join(TRAIN_DIR, f"{img_id}.png")

        if not os.path.exists(img_path):
            print(f"[{i}/{total}] Skipping missing: {img_path}")
            continue
        if not is_image_valid(img_path):
            print(f"[{i}/{total}] Skipping corrupted: {img_path}")
            continue

        img = cv2.imread(img_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)


        img = pad_and_resize(img)
        img = img.astype(np.float32) / 255.0
        img = np.transpose(img, (2, 0, 1))  # Convert HWC to CHW
        X.append(torch.tensor(img))
        y.append(class_to_idx[label])

        if i % 500 == 0:
            print(f"Processed {i} images...")

    X_tensor = torch.stack(X)
    y_tensor = torch.tensor(y, dtype=torch.long)

    print(f"Saving torch dataset to {OUTPUT_TRAIN_FILE}...")
    torch.save((X_tensor, y_tensor), OUTPUT_TRAIN_FILE)

    print(f"Done. Total usable images: {X_tensor.shape[0]}")
    return X_tensor, y_tensor, class_names

This function gives us a fully preprocessed training dataset: compact, clean, and ready for use. Once saved, this .npz file can be loaded in a single line in your model training script, bypassing the need to reprocess images every time you experiment.

Running the Script

To run both preprocessing functions directly from the command line, we’ll add the following entry point to the bottom of our script:

if __name__ == "__main__":
    preprocess_train_set()

Before we run the full preprocessing script, let’s make sure we have a clean directory in place to store our output files. We’ll use a folder called processed/ to hold the compressed .npz datasets we generate, one for training data and one for the test set.

From the root of your project directory, create the folder using your terminal:

mkdir processed

Now your project directory should look something like this:

deepthought-image-classifier/
├── .venv/                     # Python virtual environment
├── cifar-10/
   ├── train/
   ├── test/
   ├── trainLabels.csv
   └── sampleSubmission.csv
├── processed/
   ├── prepared_test_dataset.pt   # This will appear after running the script
   └── prepared_train_dataset.pt  # This will appear after running the script   
├── preprocessing.py
├── requirements.txt

With everything in place, you’re ready to run the full preprocessing pipeline:

python preprocessing.py

After it runs, you should see progress messages printed to the terminal and two new files named prepared_train_dataset.npz and prepared_test_dataset.npz in your processed/ directory. These files will be our starting point in the final phase of this part of the series: visually verifying that our images look correct.


Visual Inspection of Preprocessed Images

With our dataset now cleaned, padded, resized, and saved in a convenient compressed format, the next step is to visually confirm that the preprocessing actually worked.

This step is often overlooked, but it’s crucial. Models are sensitive to input shape and format, and a small mistake in preprocessing, like mixing up color channels or truncating dimensions, can silently derail your training process. So, before we dive into model development, we’ll take a moment to visualize a random sample of the cleaned images to make sure they appear as expected.

Let’s write a short script (or Jupyter cell, if you’re using a notebook) to load our .npz dataset, unpack a few sample images, and render them with their corresponding labels.

This is also a great chance to verify that:

  • The images are correctly sized to 32×32 pixels
  • The color channels look natural (i.e., no weird tints from BGR-to-RGB issues)
  • The label encoding correctly maps back to the original class names

Loading the .pt File and Visualizing a Sample

Here’s a standalone Python snippet that you can use in a new file (visualize.py) or in your notebook after preprocessing:

import numpy as np
import matplotlib.pyplot as plt

# Load the compressed training dataset
data = np.load("processed/prepared_train_dataset.npz")
X_train = data["X_train"]
y_train = data["y_train"]

# Optional: if you saved class names separately
# class_names = data["class_names"]

class_names = [
    'airplane', 'automobile', 'bird', 'cat', 'deer',
    'dog', 'frog', 'horse', 'ship', 'truck'
]

# Display 12 random images
plt.figure(figsize=(10, 6))
for i in range(12):
    idx = np.random.randint(0, len(X_train))
    img = X_train[idx]
    label = y_train[idx]
    class_name = class_names[label]

    plt.subplot(3, 4, i + 1)
    plt.imshow(img)
    plt.title(class_name)
    plt.axis('off')

plt.tight_layout()
plt.show()

What to Look For

When the visualization renders:

  • Make sure the images look clear and not distorted
  • Check that each image appears to belong to the correct class label
  • Verify that the aspect ratio was preserved (thanks to padding)
  • Confirm that nothing appears overly dark, washed out, or color-inverted

This visual feedback acts as a sanity check before we invest compute time into training. If anything looks suspicious, now’s the time to go back and revisit the preprocessing steps.

Why do these images look so blurry?

CIFAR-10 images are only 32×32 pixels in size — that’s tiny. When we visualize them on a standard display, they appear pixelated or blurry simply because of how few pixels they contain.


This is not an error or issue with our preprocessing pipeline. The dataset was designed to test small models and enable fast experimentation. When working with these images, it's more important to ensure that: - The image content is present and recognizable, even if blurry. - The image dimensions are correct (32×32). - Padding and resizing worked without distortion. - The labels still match the image file.

Wrapping Up: A Clean Dataset, Ready for Modeling

At this point in our journey, we’ve accomplished something foundational but vital: we’ve taken a raw, unprocessed image dataset and carefully transformed it into something structured, clean, and machine-learning ready.

We started by downloading the CIFAR-10 dataset, which arrived in a rather barebones format, just folders of images and a CSV file mapping image IDs to class labels. From there, we built a preprocessing pipeline that not only read and validated the data, but also addressed common quality issues like unreadable files and excessive blurriness.

To ensure consistency, we padded each image to preserve its aspect ratio and resized everything to 32×32 pixels, a standard input shape for convolutional neural networks. We then encoded the class labels as integers, allowing our models to interpret the categories numerically. Finally, we saved our work in efficient .npz format files for both the training and test sets.

To verify that our preprocessing worked as expected, we also wrote a short visualization script that renders a grid of sample images with their decoded class labels. This visual check is essential before launching into model training, after all, your model can only learn what it sees, and what it sees comes directly from the pipeline you’ve written.

Here’s what our current project structure looks like:

deepthought-image-classifier/
├── cifar-10/
   ├── train/
   ├── test/
   ├── trainLabels.csv
   └── sampleSubmission.csv
├── processed/
   ├── prepared_train_dataset.npz
   └── prepared_test_dataset.npz
├── preprocessing.py
├── visualize.py
├── requirements.txt
└── .venv/ 

Next Steps

In the next installment of this series, we’ll shift our focus from data preparation to model development. You’ll learn how to build and train a simple convolutional neural network (CNN) that can accurately classify CIFAR-10 images into one of ten categories.

We’ll also discuss the logic behind the architecture we choose, explain key training parameters like learning rate and batch size, and visualize the model’s performance across training and validation sets.

Later on in the series, we’ll build toward deploying this model in a real-world setting, designing a FastAPI-based interface that lets end users submit images and receive predictions in real-time.

But for now, take a moment to appreciate what you’ve built: a professional-grade data preprocessing workflow that ensures your input data is clean, validated, normalized, and ready for the challenges ahead.

If you’re following along with your own project, commit your progress, push your code to version control, and get ready — the fun part begins in Part 2: Building a CNN Model from Scratch.

Aaron Mathis

Aaron Mathis

Software engineer specializing in cloud development, AI/ML, and modern web technologies. Passionate about building scalable solutions and sharing knowledge with the developer community.