Sentiment Analysis on Restaurant Reviews – Part 1: Advanced Text Preprocessing with NLTK

Text data is messy, inconsistent, and often filled with nuances that can confuse machine learning models. When building a sentiment analysis system for restaurant reviews, the quality of your preprocessing pipeline directly impacts your model’s ability to understand whether a customer loved their dining experience or left disappointed.

In this tutorial series, we’ll build an advanced sentiment analysis system that goes far beyond basic text cleaning.

This is the first part of a multi-part series where we’ll explore:

Advanced text preprocessing with NLTK (this post)
Sentiment analysis using neural networks
Hyperparameter optimization with Optuna
Ensemble classifiers for maximum F1 score
Bonus: Using DistilBERT for transformer-based sentiment analysis

By the end of this series, you’ll have a production-ready sentiment analysis system that combines traditional NLP techniques with modern deep learning approaches. Let’s start by building a robust preprocessing pipeline that handles the complexities of real-world restaurant review data.

Understanding the Dataset and Problem

For this tutorial series, we’ll be working with restaurant reviews from the Restaurant Reviews dataset on Kaggle. This dataset contains thousands of customer reviews with corresponding ratings, making it perfect for supervised sentiment analysis.

Dataset Overview

The restaurant reviews dataset includes several key fields:

Restaurant: The name of the restaurant being reviewed
Review: The actual customer review text
Rating: Numerical rating (typically 1-5 stars)
Additional metadata: Location, cuisine type, and other restaurant attributes

What makes this dataset particularly interesting for sentiment analysis is the variety of ways customers express their opinions. Reviews might include:

Emojis and emoticons (😊, 😞, :))
Slang and informal language (“totally awesome”, “meh”)
Negations that flip sentiment (“not bad” vs “bad”)
Misspellings and typos (“delicous”, “restarant”)
Mixed sentiment within a single review

Our Preprocessing Goals

Our advanced preprocessing pipeline will address each of these challenges:

Challenge	Solution
Emoji handling	Convert emojis to text descriptions using the `emoji` library
Text normalization	Lowercase conversion, punctuation removal, and standardization
Negation preservation	Keep negation words while removing other stopwords
Lemmatization	Reduce words to their base forms using POS-aware lemmatization
Sentiment scoring	Extract sentiment features using SentiWordNet
Multi-level vectorization	Combine word-level and character-level TF-IDF features

This approach ensures that our models can capture both the semantic meaning and stylistic patterns in restaurant reviews.

Setting Up the Development Environment

Before we dive into the preprocessing pipeline, let’s establish a clean development environment with all the necessary libraries for advanced natural language processing. If you’d like, you can find all the code for this tutorial at our GitHub repository, but I highly recommend you follow along and create the files yourself.

Creating the Project Structure

Start by creating a new project directory and setting up a virtual environment:

mkdir sentiment-analysis-restaurant-reviews
cd sentiment-analysis-restaurant-reviews
python -m venv .venv

Activate your virtual environment:

Linux/MacOS:

source .venv/bin/activate

Windows:

.\.venv\Scripts\Activate.ps1

Installing Required Dependencies

Our preprocessing pipeline requires several specialized NLP libraries:

pip install pandas numpy scikit-learn nltk emoji tqdm scipy

Let’s break down what each library provides:

pandas: Data manipulation and CSV handling
numpy: Numerical operations and array handling
scikit-learn: TF-IDF vectorization and machine learning utilities
nltk: Natural language processing toolkit with tokenization, POS tagging, and lemmatization
emoji: Emoji detection and conversion to text
tqdm: Progress bars for long-running operations
scipy: Sparse matrix operations for efficient memory usage

Setting Up NLTK Resources

NLTK requires several language models and corpora. We’ll download these programmatically in our preprocessing script, but you can also download them manually:

import nltk

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords') 
nltk.download('wordnet')
nltk.download('sentiwordnet')
nltk.download('averaged_perceptron_tagger')

Project Directory Structure

From the root of our project folder, create the following directory structure:

mkdir -p data/raw data/processed src && \
    touch src/preprocessing.py requirements.txt

sentiment-analysis-restaurant-reviews/
├── data/
│   ├── raw/                    # Original dataset
│   └── processed/              # Cleaned and processed data
├── src/
│   └── preprocessing.py        # Our main preprocessing pipeline
└── requirements.txt           # Dependencies

Understanding Text Preprocessing Challenges

Before implementing our preprocessing pipeline, it’s important to understand why restaurant reviews present unique challenges for sentiment analysis. Let’s examine some real examples to illustrate the complexity:

Example 1: Emoji and Mixed Sentiment

"The food was amazing 😍 but the service was terrible 😤. Overall okay experience."

This review contains positive sentiment about food, negative sentiment about service, and emojis that provide additional emotional context. A basic preprocessing approach might lose these nuances.

Example 2: Negation and Context

"This place is not bad at all! Actually, it's quite good. Definitely not overpriced."

The phrases “not bad” and “not overpriced” are actually positive statements, but naive preprocessing might classify them as negative due to the presence of “bad” and “overpriced.”

Example 3: Informal Language and Misspellings

"Absolutley amzing! Best pasta I've evr had. Defintely reccomend!!!"

Despite multiple spelling errors, this is clearly a very positive review. Our preprocessing needs to handle these variations while preserving the underlying sentiment.

These examples highlight why we need a sophisticated preprocessing approach that goes beyond simple text cleaning.

Building the Data Loading and Cleaning Pipeline

Our preprocessing journey begins with loading the raw restaurant review data and performing initial cleaning steps. This foundation ensures that we’re working with valid, consistent data before applying more advanced NLP techniques.

Creating the Preprocessor Class

We’ll build our preprocessing pipeline as a reusable class that encapsulates all the necessary steps:

import os
import re
import nltk
import emoji
import pandas as pd
from typing import Tuple, Dict
from scipy.sparse import csr_matrix
from nltk.corpus import stopwords, sentiwordnet as swn, wordnet
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
from scipy import sparse
from tqdm import tqdm

class SentimentDataPreprocessor:
    def __init__(self, data_path: str, output_dir: str, n_samples_per_class: int = 1000):
        """
        Initialize the SentimentDataPreprocessor.
        
        Args:
            data_path: Path to the raw dataset CSV file
            output_dir: Directory to save processed files
            n_samples_per_class: Number of samples per class for balancing
        """
        self.data_path = data_path
        self.output_dir = output_dir
        self.n_samples_per_class = n_samples_per_class
        
        # Create output directory if it doesn't exist
        os.makedirs(self.output_dir, exist_ok=True)
        
        # Download required NLTK resources
        self._download_nltk_resources()
        
        # Initialize NLP components
        self._initialize_nlp_components()

Downloading NLTK Resources

Our preprocessing pipeline requires several NLTK resources. After first checking to see if the resources already exist (to avoid redownloading them every time the preprocessing pipeline is run), we’ll download them programmatically to ensure they’re available.

Add _download_nltk_resources(self) to our class:


    def _download_nltk_resources(self):
        """Download required NLTK resources."""
        resources = [
            ('punkt_tab', 'tokenizers/punkt_tab'),
            ('stopwords', 'corpora/stopwords'),
            ('wordnet', 'corpora/wordnet'),
            ('sentiwordnet', 'corpora/sentiwordnet'),
            ('averaged_perceptron_tagger_eng', 'taggers/averaged_perceptron_tagger_eng')
        ]
        
        for resource, path in resources:
            try:
                nltk.data.find(path)
                print(f"NLTK resource '{resource}' already present.")
            except LookupError:
                print(f"Downloading NLTK resource '{resource}'...")
                nltk.download(resource)

Resources Explained

punkt_tab: Pre-trained sentence boundary data used by NLTK’s Punkt tokenizer to accurately split text into sentences, accounting for abbreviations and punctuation.
stopwords: List of common non-informative words (like “the”, “is”, “and”) used to remove noise and focus on meaningful tokens during text preprocessing.
wordnet: Large lexical database of English providing synonyms, definitions, and relationships; widely used for lemmatization and semantic analysis.
sentiwordnet: Extension of WordNet assigning sentiment polarity scores (positive, negative, objective) to synsets for lexicon-based sentiment analysis.
averaged_perceptron_tagger_eng: Statistical part-of-speech tagger model assigning grammatical tags (e.g., noun, verb, adjective) to tokens to support linguistic feature extraction.

Initializing NLP Components

Next, we’ll set up the core NLP components that our preprocessing pipeline will use. Add _initialize_nlp_components(self) to our class:

    def _initialize_nlp_components(self):
        """Initialize NLP components and configurations."""
        # Load English stopwords and preserve negations
        stop_words = set(stopwords.words('english'))
        negations = {"not", "no", "never", "n't", "won't", "can't", "don't"}
        self.stop_words_minus_neg = stop_words - negations
        
        # Initialize lemmatizer
        self.lemmatizer = WordNetLemmatizer()
        
        # POS tag mapping for SentiWordNet
        self.pos_map = {
            'n': wordnet.NOUN, 'v': wordnet.VERB,
            'a': wordnet.ADJ, 'r': wordnet.ADV
        }
        
        # Initialize vectorizers (will be configured later)
        self.word_vectorizer = None
        self.char_vectorizer = None

The key insight here is preserving negation words while removing other stopwords. Words like “not,” “never,” and “don’t” are crucial for sentiment analysis because they can completely flip the meaning of a sentence.

Next, we initialize a wordnet lemmatizer and create a part-of-speech map.

Lemmatization reduces words to their base or dictionary form (lemma), helping unify variations like “running,” “ran,” and “runs” into “run.” This improves consistency and reduces sparsity in the data.

To perform accurate lemmatization, it’s important to provide each word’s part of speech (POS); for example, “book” as a noun versus “book” as a verb. The pos_map dictionary helps map POS tags to WordNet’s expected format, enabling the lemmatizer to select the correct base form for each word.

Loading and Initial Cleaning

Now let’s implement the data loading and initial cleaning steps by adding the load_data and create_sentiment_labels functions to our class:

    def load_data(self) -> pd.DataFrame:
        """Load the raw dataset from CSV."""
        print(f"Loading data from {self.data_path}...")
        df = pd.read_csv(self.data_path)
        
        # Drop rows with missing reviews
        df = df.dropna(subset=['Review'])
        df = df[df['Review'].str.strip() != ""]
        
        print(f"Loaded {len(df)} valid reviews.")
        return df

The load_data function does exactly as name implies, it loads the raw dataset from a CSV file specified by self.data_path. It ensures data quality by dropping rows where the Review field is missing or empty (even if it only contains whitespace). Finally, it prints the number of valid reviews loaded and returns the cleaned DataFrame for further processing.

    def create_sentiment_labels(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create binary sentiment labels from ratings."""
        # Convert ratings to numeric, handling any non-numeric values
        df['Rating_Numeric'] = pd.to_numeric(df['Rating'], errors='coerce')
        
        # Filter to clear positive (4-5) and negative (1-2) ratings
        # This removes neutral ratings (3) which can be ambiguous
        df = df[(df['Rating_Numeric'] <= 2) | (df['Rating_Numeric'] >= 4)].copy()
        
        # Create binary sentiment labels
        df['Sentiment'] = (df['Rating_Numeric'] >= 4).astype(int)
        
        print(f"Created sentiment labels. Distribution:")
        print(df['Sentiment'].value_counts())
        
        return df

The create_sentiment_labels function converts the Rating column to numeric values, ignoring or coercing non-numeric entries. It then filters out ambiguous neutral reviews (3-star ratings) to retain only clear positive (4–5) and negative (1–2) examples. Finally, it assigns binary sentiment labels: 1 for positive and 0 for negative, printing the label distribution to confirm balance before returning the updated DataFrame.

This approach creates clear positive and negative examples by excluding neutral ratings (3 stars), which often contain mixed sentiment that can confuse the model during training.

Balancing the Dataset

Imbalanced datasets can lead to biased models. To avoid this common pitfall, we’ll create a balanced dataset by sampling an equal amount of both positive and negative reviews. Add the balance_dataset function to our class:

    def balance_dataset(self, df: pd.DataFrame) -> pd.DataFrame:
        """Balance the dataset by sampling equal numbers of each class."""
        positive_reviews = df[df['Sentiment'] == 1].sample(
            n=self.n_samples_per_class, random_state=42
        )
        negative_reviews = df[df['Sentiment'] == 0].sample(
            n=self.n_samples_per_class, random_state=42
        )
        
        # Combine and shuffle
        balanced_df = pd.concat([positive_reviews, negative_reviews])
        balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)
        
        print(f"Balanced dataset created with {len(balanced_df)} total reviews.")
        return balanced_df

Advanced Text Preprocessing Techniques

Now we’ll implement the core text preprocessing functions that handle the complexities of restaurant review language. These techniques go beyond basic cleaning to preserve sentiment-relevant information.

Emoji Handling and Text Normalization

Emojis in restaurant reviews often carry strong sentiment signals. Rather than removing them, we’ll convert them to text descriptions using the emoji library. Then we will normalize the text by converting it to lowercase and removing punctuation. Finally, we will tokenize the text, remove non-negative stopwords, and apply POS-aware lemmatization.

Add the preprocess_text function to our class:

    def preprocess_text(self, text: str) -> list:
        """
        Advanced text preprocessing pipeline.
        
        Args:
            text: Raw review text
            
        Returns:
            List of processed tokens
        """
        # Convert emojis to text descriptions
        text = emoji.demojize(text, delimiters=(" ", " "))
        
        # Convert to lowercase
        text = text.lower()
        
        # Remove non-alphabetic characters but preserve spaces
        text = re.sub(r'[^a-z\s]', '', text)
        
        # Tokenize the text
        tokens = word_tokenize(text)
        
        # Remove stopwords except negations
        tokens = [word for word in tokens if word not in self.stop_words_minus_neg]
        
        # Apply POS-aware lemmatization
        tokens = self._apply_lemmatization(tokens)
        
        return tokens

POS-Aware Lemmatization

Traditional lemmatization without Part-of-Speech (POS) information can be inaccurate. For example, “better” could be lemmatized to “better” (adjective) or “well” (adverb) depending on context. We’ll use POS tagging to improve lemmatization accuracy.

Add the _get_wordnet_pos and _apply_lemmatization functions to our class:

    def _get_wordnet_pos(self, treebank_tag: str) -> str:
        """Map NLTK POS tags to WordNet POS tags for lemmatization."""
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        return wordnet.NOUN  # Default to noun

    def _apply_lemmatization(self, tokens: list) -> list:
        """Apply POS-aware lemmatization to tokens."""
        # Get POS tags for all tokens
        pos_tags = pos_tag(tokens)
        
        # Lemmatize each token with its POS tag
        lemmatized_tokens = []
        for word, tag in pos_tags:
            wordnet_pos = self._get_wordnet_pos(tag)
            lemmatized_word = self.lemmatizer.lemmatize(word, wordnet_pos)
            lemmatized_tokens.append(lemmatized_word)
        
        return lemmatized_tokens

SentiWordNet Integration

SentiWordNet is a lexical resource that extends WordNet by assigning each synset (set of synonyms sharing a meaning) numerical sentiment scores: positivity (pos), negativity (neg), and objectivity (obj). These scores capture nuanced emotional meaning beyond simple word polarity.

In the function get_sentiment_score, we first convert each word’s part-of-speech tag into a format compatible with WordNet using pos_map. We retrieve all possible SentiSynsets for that word and POS.

If no synsets exist (for example, if the word is rare or ambiguous), we default to a fully objective score. Otherwise, we calculate the average sentiment scores across all synsets, which helps smooth out ambiguity and polysemy in English words.

    def get_sentiment_score(self, word: str, pos_tag: str) -> dict:
        """Get sentiment scores for a word using SentiWordNet."""
        try:
            wordnet_pos = self.pos_map.get(pos_tag, wordnet.NOUN)
            synsets = list(swn.senti_synsets(word, wordnet_pos))
            
            if not synsets:
                return {'pos': 0.0, 'neg': 0.0, 'obj': 1.0}
            
            # Average scores across all synsets
            pos_score = sum(syn.pos_score() for syn in synsets) / len(synsets)
            neg_score = sum(syn.neg_score() for syn in synsets) / len(synsets)
            obj_score = sum(syn.obj_score() for syn in synsets) / len(synsets)
            
            return {'pos': pos_score, 'neg': neg_score, 'obj': obj_score}
        
        except Exception:
            return {'pos': 0.0, 'neg': 0.0, 'obj': 1.0}

Next, the calculate_review_sentiment_scores function processes an entire review. It tokenizes the review, tags each word with its POS, and computes per-word sentiment scores using get_sentiment_score. It sums these scores across the review and normalizes by the number of content words (nouns, verbs, adjectives, and adverbs), resulting in an overall positivity, negativity, and objectivity profile for the text.

These aggregated scores can be added as additional numerical features to improve model performance, providing a lexicon-based view of sentiment alongside machine-learned signals.

    def calculate_review_sentiment_scores(self, review: str) -> dict:
        """Calculate aggregated sentiment scores for an entire review."""
        tokens = word_tokenize(review.lower())
        pos_tags = pos_tag(tokens)
        
        sentiment_scores = {'pos': 0.0, 'neg': 0.0, 'obj': 0.0}
        word_count = 0
        
        for word, tag in pos_tags:
            if tag[0].lower() in ['n', 'v', 'a', 'r']:  # Only content words
                scores = self.get_sentiment_score(word, tag[0].lower())
                sentiment_scores['pos'] += scores['pos']
                sentiment_scores['neg'] += scores['neg']
                sentiment_scores['obj'] += scores['obj']
                word_count += 1
        
        # Normalize by word count
        if word_count > 0:
            sentiment_scores = {k: v / word_count for k, v in sentiment_scores.items()}
        
        return sentiment_scores

Multi-Level TF-IDF Vectorization

Traditional TF-IDF vectorization operates only at the word level, but restaurant reviews often contain misspellings and informal language that word-level features might miss. We’ll implement a multi-level approach that combines word-level and character-level features.

Word-Level TF-IDF Features

Word-level TF-IDF captures the importance of cleaned, lemmatized words. To leverage this, add the create_word_level_features function to our class:

    def create_word_level_features(self, processed_reviews: list) -> csr_matrix:
        """Create word-level TF-IDF features from processed text."""
        # Join processed tokens back into strings
        text_corpus = [' '.join(tokens) for tokens in processed_reviews]
        
        # Configure word-level TF-IDF vectorizer
        self.word_vectorizer = TfidfVectorizer(
            max_features=5000,
            min_df=2,  # Ignore terms that appear in fewer than 2 documents
            max_df=0.95,  # Ignore terms that appear in more than 95% of documents
            ngram_range=(1, 2)  # Include unigrams and bigrams
        )
        
        # Fit and transform the corpus
        word_features = self.word_vectorizer.fit_transform(text_corpus)
        print(f"Word-level TF-IDF matrix shape: {word_features.shape}")
        
        return word_features

Character-Level N-Gram Features

Character-level n-gram features analyze text at the sub-word level by extracting overlapping sequences of characters (for example, 3-grams: “goo”, “ood”, “od ”, “d f” from “good food”).

This approach is especially effective for handling misspellings (e.g., “goood” vs. “good”), slang (“luv” vs. “love”), creative abbreviations, and user-generated variations often found in reviews and social media. By focusing on small character chunks rather than whole words, the model can learn more diverse patterns and better generalize to noisy, informal text.

These features are typically extracted using a CountVectorizer or TfidfVectorizer configured with analyzer="char" and an n-gram range like (3, 5), allowing the model to capture short and mid-length character patterns across the corpus.

Add create_char_level_features to our class:

    def create_char_level_features(self, raw_reviews: list) -> csr_matrix:
        """Create character-level n-gram features from raw text."""
        # Configure character-level TF-IDF vectorizer
        self.char_vectorizer = TfidfVectorizer(
            analyzer='char_wb',  # Character n-grams within word boundaries
            ngram_range=(2, 5),  # Character n-grams of length 2-5
            max_features=2000,
            min_df=2
        )
        
        # Use raw reviews to capture original spelling patterns
        char_features = self.char_vectorizer.fit_transform(raw_reviews)
        print(f"Character-level TF-IDF matrix shape: {char_features.shape}")
        
        return char_features

Combining All Features

Finally, we’ll combine word-level TF-IDF, character-level TF-IDF, and SentiWordNet features into a single feature matrix. Add the create_combined_features function to our class:

    def create_combined_features(self, df: pd.DataFrame) -> Tuple[csr_matrix, pd.DataFrame]:
        """Create combined feature matrix from all preprocessing steps."""
        print("Starting comprehensive feature engineering...")
        
        # Step 1: Preprocess all reviews
        print("Processing text...")
        tqdm.pandas(desc="Processing reviews")
        df['Processed_Tokens'] = df['Review'].progress_apply(self.preprocess_text)
        df['Processed_Text'] = df['Processed_Tokens'].apply(lambda x: ' '.join(x))
        
        # Step 2: Calculate SentiWordNet scores
        print("Calculating sentiment scores...")
        tqdm.pandas(desc="Sentiment scoring")
        sentiment_scores = df['Review'].progress_apply(self.calculate_review_sentiment_scores)
        df['SWN_Positive'] = [scores['pos'] for scores in sentiment_scores]
        df['SWN_Negative'] = [scores['neg'] for scores in sentiment_scores]
        df['SWN_Objective'] = [scores['obj'] for scores in sentiment_scores]
        
        # Step 3: Create word-level features
        print("Creating word-level TF-IDF features...")
        word_features = self.create_word_level_features(df['Processed_Tokens'].tolist())
        
        # Step 4: Create character-level features
        print("Creating character-level TF-IDF features...")
        char_features = self.create_char_level_features(df['Review'].tolist())
        
        # Step 5: Combine all features
        print("Combining all features...")
        sentiment_features = df[['SWN_Positive', 'SWN_Negative', 'SWN_Objective']].values
        
        # Stack all features horizontally
        combined_features = sparse.hstack([
            word_features,
            char_features,
            sentiment_features
        ]).tocsr()
        
        print(f"Combined feature matrix shape: {combined_features.shape}")
        
        return combined_features, df

Putting It All Together: The Complete Pipeline

This pipeline brings together a rich combination of text preprocessing and feature engineering techniques to prepare the reviews for effective sentiment analysis. We start by carefully loading and cleaning the data, ensuring we remove noise while preserving key information like negations. By explicitly mapping ratings to binary sentiment labels, we create a clear learning target and avoid ambiguous examples.

On the feature side, we incorporate traditional word-level TF-IDF representations to capture overall content and context, and we add character-level n-grams to handle misspellings, creative spelling, and slang that are common in informal reviews.

Beyond surface-level features, we integrate lemmatization and part-of-speech tagging to reduce words to their canonical forms and align words to their correct syntactic roles, improving feature consistency. Additionally, by including lexicon-derived sentiment scores from SentiWordNet, we inject a semantic layer that captures nuanced sentiment signals at the word and review levels.

In order to bring all of these steps together in our pipeline, create the execute_preprocessing_pipeline and save_processed_data functions and add them to our class:

    def execute_preprocessing_pipeline(self) -> Tuple[csr_matrix, Dict[str, TfidfVectorizer], pd.DataFrame]:
        """
        Execute the complete preprocessing pipeline.
        
        Returns:
            Tuple containing:
            - Combined feature matrix (sparse)
            - Dictionary of fitted vectorizers
            - Processed DataFrame
        """
        print("=== Starting Sentiment Analysis Preprocessing Pipeline ===")
        
        # Step 1: Load and clean data
        df = self.load_data()
        df = self.create_sentiment_labels(df)
        df = self.balance_dataset(df)
        
        # Keep only necessary columns
        df = df[['Restaurant', 'Review', 'Sentiment']].copy()
        
        # Step 2: Create combined features
        feature_matrix, processed_df = self.create_combined_features(df)
        
        # Step 3: Save processed data and models
        self.save_processed_data(feature_matrix, processed_df)
        
        # Return components for immediate use
        vectorizers = {
            'word': self.word_vectorizer,
            'char': self.char_vectorizer
        }
        
        print("=== Preprocessing Pipeline Complete ===")
        return feature_matrix, vectorizers, processed_df

    def save_processed_data(self, feature_matrix: csr_matrix, df: pd.DataFrame):
        """Save all processed data and fitted models."""
        print("Saving processed data...")
        
        # Save feature matrix
        sparse.save_npz(
            os.path.join(self.output_dir, "restaurant_review_features.npz"), 
            feature_matrix
        )
        
        # Save vectorizers
        with open(os.path.join(self.output_dir, "word_vectorizer.pkl"), "wb") as f:
            pickle.dump(self.word_vectorizer, f)
        
        with open(os.path.join(self.output_dir, "char_vectorizer.pkl"), "wb") as f:
            pickle.dump(self.char_vectorizer, f)
        
        # Save processed DataFrame
        df.to_csv(os.path.join(self.output_dir, "processed_reviews.csv"), index=False)
        
        print("All files saved successfully!")

Using the Preprocessing Pipeline

In later parts of this series, we will initialize the SentimentDataPreprocessor pipeline and use it directly as part of our training workflow, but for now we can call it directly from preprocess.py to ensure it is correctly preprocessing our dataset.

Below our SentimentDataPreprocessor class, add this code block at the end of the script:

# Example usage
if __name__ == "__main__":
    # Initialize the preprocessor
    preprocessor = SentimentDataPreprocessor(
        data_path="data/raw/restaurant_reviews.csv",
        output_dir="data/processed",
        n_samples_per_class=1000
    )
    
    # Execute the complete pipeline
    feature_matrix, vectorizers, processed_df = preprocessor.execute_preprocessing_pipeline()
    
    print(f"Final feature matrix shape: {feature_matrix.shape}")
    print(f"Number of processed reviews: {len(processed_df)}")
    print(f"Sentiment distribution: {processed_df['Sentiment'].value_counts()}")

Now, let’s give it a test. Run preprocess.py from our project root directory:

python src/preprocess.py

Understanding the Output and Next Steps

Our preprocessing pipeline produces several important outputs that will be used in subsequent parts of this series:

Feature Matrix Structure

The combined feature matrix contains three types of features:

Word-level TF-IDF features (5,000 dimensions): Capture semantic meaning from cleaned, lemmatized text
Character-level TF-IDF features (2,000 dimensions): Capture spelling patterns and informal language
SentiWordNet features (3 dimensions): Positive, negative, and objective sentiment scores

This multi-level approach ensures that our models can learn from both high-level semantic patterns and low-level stylistic cues.

Processed DataFrame

The processed DataFrame includes:

Original review text
Processed tokens and cleaned text
SentiWordNet sentiment scores
Binary sentiment labels
Restaurant metadata

Saved Models and Data

Our pipeline saves several files for reuse:

restaurant_review_features.npz: Sparse feature matrix
word_vectorizer.pkl: Fitted word-level TF-IDF vectorizer
char_vectorizer.pkl: Fitted character-level TF-IDF vectorizer
processed_reviews.csv: Complete processed dataset

These files will be loaded in Part 2 when we build our neural network models.

Performance Considerations

Our preprocessing pipeline is designed for efficiency:

Sparse matrices minimize memory usage for high-dimensional features
Progress bars provide feedback during long-running operations
Vectorizer persistence allows reuse without retraining
Balanced sampling prevents class imbalance issues

What’s Coming Next

In Part 2 of this series, we’ll use these carefully engineered features to build and train neural network models for sentiment classification. We’ll explore:

Feed-forward neural networks for baseline performance
LSTM networks for capturing sequential patterns in text
Attention mechanisms for focusing on important words
Model evaluation using precision, recall, and F1-score

The solid preprocessing foundation we’ve built will enable our models to achieve superior performance on restaurant review sentiment analysis.