Sentiment Analysis on Restaurant Reviews – Part 1: Advanced Text Preprocessing with NLTK
Part 1 of our series on sentiment analysis. Learn advanced text preprocessing techniques including emoji handling, SentiWordNet integration, and multi-level TF-IDF vectorization for restaurant review sentiment classification.

Text data is messy, inconsistent, and often filled with nuances that can confuse machine learning models. When building a sentiment analysis system for restaurant reviews, the quality of your preprocessing pipeline directly impacts your model’s ability to understand whether a customer loved their dining experience or left disappointed.
In this tutorial series, we’ll build an advanced sentiment analysis system that goes far beyond basic text cleaning.
This is the first part of a multi-part series where we’ll explore:
- Advanced text preprocessing with NLTK (this post)
- Sentiment analysis using neural networks
- Hyperparameter optimization with Optuna
- Ensemble classifiers for maximum F1 score
- Bonus: Using DistilBERT for transformer-based sentiment analysis
By the end of this series, you’ll have a production-ready sentiment analysis system that combines traditional NLP techniques with modern deep learning approaches. Let’s start by building a robust preprocessing pipeline that handles the complexities of real-world restaurant review data.
Understanding the Dataset and Problem
For this tutorial series, we’ll be working with restaurant reviews from the Restaurant Reviews dataset on Kaggle. This dataset contains thousands of customer reviews with corresponding ratings, making it perfect for supervised sentiment analysis.
Dataset Overview
The restaurant reviews dataset includes several key fields:
- Restaurant: The name of the restaurant being reviewed
- Review: The actual customer review text
- Rating: Numerical rating (typically 1-5 stars)
- Additional metadata: Location, cuisine type, and other restaurant attributes
What makes this dataset particularly interesting for sentiment analysis is the variety of ways customers express their opinions. Reviews might include:
- Emojis and emoticons (😊, 😞, :))
- Slang and informal language (“totally awesome”, “meh”)
- Negations that flip sentiment (“not bad” vs “bad”)
- Misspellings and typos (“delicous”, “restarant”)
- Mixed sentiment within a single review
Our Preprocessing Goals
Our advanced preprocessing pipeline will address each of these challenges:
Challenge | Solution |
---|---|
Emoji handling | Convert emojis to text descriptions using the emoji library |
Text normalization | Lowercase conversion, punctuation removal, and standardization |
Negation preservation | Keep negation words while removing other stopwords |
Lemmatization | Reduce words to their base forms using POS-aware lemmatization |
Sentiment scoring | Extract sentiment features using SentiWordNet |
Multi-level vectorization | Combine word-level and character-level TF-IDF features |
This approach ensures that our models can capture both the semantic meaning and stylistic patterns in restaurant reviews.
Setting Up the Development Environment
Before we dive into the preprocessing pipeline, let’s establish a clean development environment with all the necessary libraries for advanced natural language processing. If you’d like, you can find all the code for this tutorial at our GitHub repository, but I highly recommend you follow along and create the files yourself.
Creating the Project Structure
Start by creating a new project directory and setting up a virtual environment:
mkdir sentiment-analysis-restaurant-reviews
cd sentiment-analysis-restaurant-reviews
python -m venv .venv
Activate your virtual environment:
Linux/MacOS:
source .venv/bin/activate
Windows:
.\.venv\Scripts\Activate.ps1
Installing Required Dependencies
Our preprocessing pipeline requires several specialized NLP libraries:
pip install pandas numpy scikit-learn nltk emoji tqdm scipy
Let’s break down what each library provides:
- pandas: Data manipulation and CSV handling
- numpy: Numerical operations and array handling
- scikit-learn: TF-IDF vectorization and machine learning utilities
- nltk: Natural language processing toolkit with tokenization, POS tagging, and lemmatization
- emoji: Emoji detection and conversion to text
- tqdm: Progress bars for long-running operations
- scipy: Sparse matrix operations for efficient memory usage
Setting Up NLTK Resources
NLTK requires several language models and corpora. We’ll download these programmatically in our preprocessing script, but you can also download them manually:
import nltk
# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('sentiwordnet')
nltk.download('averaged_perceptron_tagger')
Project Directory Structure
From the root of our project folder, create the following directory structure:
mkdir -p data/raw data/processed src && \
touch src/preprocessing.py requirements.txt
sentiment-analysis-restaurant-reviews/
├── data/
│ ├── raw/ # Original dataset
│ └── processed/ # Cleaned and processed data
├── src/
│ └── preprocessing.py # Our main preprocessing pipeline
└── requirements.txt # Dependencies
Understanding Text Preprocessing Challenges
Before implementing our preprocessing pipeline, it’s important to understand why restaurant reviews present unique challenges for sentiment analysis. Let’s examine some real examples to illustrate the complexity:
Example 1: Emoji and Mixed Sentiment
"The food was amazing 😍 but the service was terrible 😤. Overall okay experience."
This review contains positive sentiment about food, negative sentiment about service, and emojis that provide additional emotional context. A basic preprocessing approach might lose these nuances.
Example 2: Negation and Context
"This place is not bad at all! Actually, it's quite good. Definitely not overpriced."
The phrases “not bad” and “not overpriced” are actually positive statements, but naive preprocessing might classify them as negative due to the presence of “bad” and “overpriced.”
Example 3: Informal Language and Misspellings
"Absolutley amzing! Best pasta I've evr had. Defintely reccomend!!!"
Despite multiple spelling errors, this is clearly a very positive review. Our preprocessing needs to handle these variations while preserving the underlying sentiment.
These examples highlight why we need a sophisticated preprocessing approach that goes beyond simple text cleaning.
Building the Data Loading and Cleaning Pipeline
Our preprocessing journey begins with loading the raw restaurant review data and performing initial cleaning steps. This foundation ensures that we’re working with valid, consistent data before applying more advanced NLP techniques.
Creating the Preprocessor Class
We’ll build our preprocessing pipeline as a reusable class that encapsulates all the necessary steps:
import os
import re
import nltk
import emoji
import pandas as pd
from typing import Tuple, Dict
from scipy.sparse import csr_matrix
from nltk.corpus import stopwords, sentiwordnet as swn, wordnet
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
from scipy import sparse
from tqdm import tqdm
class SentimentDataPreprocessor:
def __init__(self, data_path: str, output_dir: str, n_samples_per_class: int = 1000):
"""
Initialize the SentimentDataPreprocessor.
Args:
data_path: Path to the raw dataset CSV file
output_dir: Directory to save processed files
n_samples_per_class: Number of samples per class for balancing
"""
self.data_path = data_path
self.output_dir = output_dir
self.n_samples_per_class = n_samples_per_class
# Create output directory if it doesn't exist
os.makedirs(self.output_dir, exist_ok=True)
# Download required NLTK resources
self._download_nltk_resources()
# Initialize NLP components
self._initialize_nlp_components()
Downloading NLTK Resources
Our preprocessing pipeline requires several NLTK resources. After first checking to see if the resources already exist (to avoid redownloading them every time the preprocessing pipeline is run), we’ll download them programmatically to ensure they’re available.
Add _download_nltk_resources(self)
to our class:
def _download_nltk_resources(self):
"""Download required NLTK resources."""
resources = [
('punkt_tab', 'tokenizers/punkt_tab'),
('stopwords', 'corpora/stopwords'),
('wordnet', 'corpora/wordnet'),
('sentiwordnet', 'corpora/sentiwordnet'),
('averaged_perceptron_tagger_eng', 'taggers/averaged_perceptron_tagger_eng')
]
for resource, path in resources:
try:
nltk.data.find(path)
print(f"NLTK resource '{resource}' already present.")
except LookupError:
print(f"Downloading NLTK resource '{resource}'...")
nltk.download(resource)
Resources Explained
- punkt_tab: Pre-trained sentence boundary data used by NLTK’s Punkt tokenizer to accurately split text into sentences, accounting for abbreviations and punctuation.
- stopwords: List of common non-informative words (like “the”, “is”, “and”) used to remove noise and focus on meaningful tokens during text preprocessing.
- wordnet: Large lexical database of English providing synonyms, definitions, and relationships; widely used for lemmatization and semantic analysis.
- sentiwordnet: Extension of WordNet assigning sentiment polarity scores (positive, negative, objective) to synsets for lexicon-based sentiment analysis.
- averaged_perceptron_tagger_eng: Statistical part-of-speech tagger model assigning grammatical tags (e.g., noun, verb, adjective) to tokens to support linguistic feature extraction.
Initializing NLP Components
Next, we’ll set up the core NLP components that our preprocessing pipeline will use. Add _initialize_nlp_components(self)
to our class:
def _initialize_nlp_components(self):
"""Initialize NLP components and configurations."""
# Load English stopwords and preserve negations
stop_words = set(stopwords.words('english'))
negations = {"not", "no", "never", "n't", "won't", "can't", "don't"}
self.stop_words_minus_neg = stop_words - negations
# Initialize lemmatizer
self.lemmatizer = WordNetLemmatizer()
# POS tag mapping for SentiWordNet
self.pos_map = {
'n': wordnet.NOUN, 'v': wordnet.VERB,
'a': wordnet.ADJ, 'r': wordnet.ADV
}
# Initialize vectorizers (will be configured later)
self.word_vectorizer = None
self.char_vectorizer = None
The key insight here is preserving negation words while removing other stopwords. Words like “not,” “never,” and “don’t” are crucial for sentiment analysis because they can completely flip the meaning of a sentence.
Next, we initialize a wordnet lemmatizer and create a part-of-speech map.
Lemmatization reduces words to their base or dictionary form (lemma), helping unify variations like “running,” “ran,” and “runs” into “run.” This improves consistency and reduces sparsity in the data.
To perform accurate lemmatization, it’s important to provide each word’s part of speech (POS); for example, “book” as a noun versus “book” as a verb. The pos_map
dictionary helps map POS tags to WordNet’s expected format, enabling the lemmatizer to select the correct base form for each word.
Loading and Initial Cleaning
Now let’s implement the data loading and initial cleaning steps by adding the load_data
and create_sentiment_labels
functions to our class:
def load_data(self) -> pd.DataFrame:
"""Load the raw dataset from CSV."""
print(f"Loading data from {self.data_path}...")
df = pd.read_csv(self.data_path)
# Drop rows with missing reviews
df = df.dropna(subset=['Review'])
df = df[df['Review'].str.strip() != ""]
print(f"Loaded {len(df)} valid reviews.")
return df
The load_data
function does exactly as name implies, it loads the raw dataset from a CSV file specified by self.data_path
. It ensures data quality by dropping rows where the Review
field is missing or empty (even if it only contains whitespace). Finally, it prints the number of valid reviews loaded and returns the cleaned DataFrame for further processing.
def create_sentiment_labels(self, df: pd.DataFrame) -> pd.DataFrame:
"""Create binary sentiment labels from ratings."""
# Convert ratings to numeric, handling any non-numeric values
df['Rating_Numeric'] = pd.to_numeric(df['Rating'], errors='coerce')
# Filter to clear positive (4-5) and negative (1-2) ratings
# This removes neutral ratings (3) which can be ambiguous
df = df[(df['Rating_Numeric'] <= 2) | (df['Rating_Numeric'] >= 4)].copy()
# Create binary sentiment labels
df['Sentiment'] = (df['Rating_Numeric'] >= 4).astype(int)
print(f"Created sentiment labels. Distribution:")
print(df['Sentiment'].value_counts())
return df
The create_sentiment_labels
function converts the Rating
column to numeric values, ignoring or coercing non-numeric entries. It then filters out ambiguous neutral reviews (3-star ratings) to retain only clear positive (4–5) and negative (1–2) examples. Finally, it assigns binary sentiment labels: 1 for positive and 0 for negative, printing the label distribution to confirm balance before returning the updated DataFrame.
This approach creates clear positive and negative examples by excluding neutral ratings (3 stars), which often contain mixed sentiment that can confuse the model during training.
Balancing the Dataset
Imbalanced datasets can lead to biased models. To avoid this common pitfall, we’ll create a balanced dataset by sampling an equal amount of both positive and negative reviews. Add the balance_dataset
function to our class:
def balance_dataset(self, df: pd.DataFrame) -> pd.DataFrame:
"""Balance the dataset by sampling equal numbers of each class."""
positive_reviews = df[df['Sentiment'] == 1].sample(
n=self.n_samples_per_class, random_state=42
)
negative_reviews = df[df['Sentiment'] == 0].sample(
n=self.n_samples_per_class, random_state=42
)
# Combine and shuffle
balanced_df = pd.concat([positive_reviews, negative_reviews])
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)
print(f"Balanced dataset created with {len(balanced_df)} total reviews.")
return balanced_df
Advanced Text Preprocessing Techniques
Now we’ll implement the core text preprocessing functions that handle the complexities of restaurant review language. These techniques go beyond basic cleaning to preserve sentiment-relevant information.
Emoji Handling and Text Normalization
Emojis in restaurant reviews often carry strong sentiment signals. Rather than removing them, we’ll convert them to text descriptions using the emoji
library. Then we will normalize the text by converting it to lowercase and removing punctuation. Finally, we will tokenize the text, remove non-negative stopwords, and apply POS-aware lemmatization.
Add the preprocess_text
function to our class:
def preprocess_text(self, text: str) -> list:
"""
Advanced text preprocessing pipeline.
Args:
text: Raw review text
Returns:
List of processed tokens
"""
# Convert emojis to text descriptions
text = emoji.demojize(text, delimiters=(" ", " "))
# Convert to lowercase
text = text.lower()
# Remove non-alphabetic characters but preserve spaces
text = re.sub(r'[^a-z\s]', '', text)
# Tokenize the text
tokens = word_tokenize(text)
# Remove stopwords except negations
tokens = [word for word in tokens if word not in self.stop_words_minus_neg]
# Apply POS-aware lemmatization
tokens = self._apply_lemmatization(tokens)
return tokens
POS-Aware Lemmatization
Traditional lemmatization without Part-of-Speech (POS) information can be inaccurate. For example, “better” could be lemmatized to “better” (adjective) or “well” (adverb) depending on context. We’ll use POS tagging to improve lemmatization accuracy.
Add the _get_wordnet_pos
and _apply_lemmatization
functions to our class:
def _get_wordnet_pos(self, treebank_tag: str) -> str:
"""Map NLTK POS tags to WordNet POS tags for lemmatization."""
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('R'):
return wordnet.ADV
return wordnet.NOUN # Default to noun
def _apply_lemmatization(self, tokens: list) -> list:
"""Apply POS-aware lemmatization to tokens."""
# Get POS tags for all tokens
pos_tags = pos_tag(tokens)
# Lemmatize each token with its POS tag
lemmatized_tokens = []
for word, tag in pos_tags:
wordnet_pos = self._get_wordnet_pos(tag)
lemmatized_word = self.lemmatizer.lemmatize(word, wordnet_pos)
lemmatized_tokens.append(lemmatized_word)
return lemmatized_tokens
SentiWordNet Integration
SentiWordNet is a lexical resource that extends WordNet by assigning each synset (set of synonyms sharing a meaning) numerical sentiment scores: positivity (pos), negativity (neg), and objectivity (obj). These scores capture nuanced emotional meaning beyond simple word polarity.
In the function get_sentiment_score
, we first convert each word’s part-of-speech tag into a format compatible with WordNet using pos_map
. We retrieve all possible SentiSynsets for that word and POS.
If no synsets exist (for example, if the word is rare or ambiguous), we default to a fully objective score. Otherwise, we calculate the average sentiment scores across all synsets, which helps smooth out ambiguity and polysemy in English words.
def get_sentiment_score(self, word: str, pos_tag: str) -> dict:
"""Get sentiment scores for a word using SentiWordNet."""
try:
wordnet_pos = self.pos_map.get(pos_tag, wordnet.NOUN)
synsets = list(swn.senti_synsets(word, wordnet_pos))
if not synsets:
return {'pos': 0.0, 'neg': 0.0, 'obj': 1.0}
# Average scores across all synsets
pos_score = sum(syn.pos_score() for syn in synsets) / len(synsets)
neg_score = sum(syn.neg_score() for syn in synsets) / len(synsets)
obj_score = sum(syn.obj_score() for syn in synsets) / len(synsets)
return {'pos': pos_score, 'neg': neg_score, 'obj': obj_score}
except Exception:
return {'pos': 0.0, 'neg': 0.0, 'obj': 1.0}
Next, the calculate_review_sentiment_scores
function processes an entire review. It tokenizes the review, tags each word with its POS, and computes per-word sentiment scores using get_sentiment_score
. It sums these scores across the review and normalizes by the number of content words (nouns, verbs, adjectives, and adverbs), resulting in an overall positivity, negativity, and objectivity profile for the text.
These aggregated scores can be added as additional numerical features to improve model performance, providing a lexicon-based view of sentiment alongside machine-learned signals.
def calculate_review_sentiment_scores(self, review: str) -> dict:
"""Calculate aggregated sentiment scores for an entire review."""
tokens = word_tokenize(review.lower())
pos_tags = pos_tag(tokens)
sentiment_scores = {'pos': 0.0, 'neg': 0.0, 'obj': 0.0}
word_count = 0
for word, tag in pos_tags:
if tag[0].lower() in ['n', 'v', 'a', 'r']: # Only content words
scores = self.get_sentiment_score(word, tag[0].lower())
sentiment_scores['pos'] += scores['pos']
sentiment_scores['neg'] += scores['neg']
sentiment_scores['obj'] += scores['obj']
word_count += 1
# Normalize by word count
if word_count > 0:
sentiment_scores = {k: v / word_count for k, v in sentiment_scores.items()}
return sentiment_scores
Multi-Level TF-IDF Vectorization
Traditional TF-IDF vectorization operates only at the word level, but restaurant reviews often contain misspellings and informal language that word-level features might miss. We’ll implement a multi-level approach that combines word-level and character-level features.
Word-Level TF-IDF Features
Word-level TF-IDF captures the importance of cleaned, lemmatized words. To leverage this, add the create_word_level_features
function to our class:
def create_word_level_features(self, processed_reviews: list) -> csr_matrix:
"""Create word-level TF-IDF features from processed text."""
# Join processed tokens back into strings
text_corpus = [' '.join(tokens) for tokens in processed_reviews]
# Configure word-level TF-IDF vectorizer
self.word_vectorizer = TfidfVectorizer(
max_features=5000,
min_df=2, # Ignore terms that appear in fewer than 2 documents
max_df=0.95, # Ignore terms that appear in more than 95% of documents
ngram_range=(1, 2) # Include unigrams and bigrams
)
# Fit and transform the corpus
word_features = self.word_vectorizer.fit_transform(text_corpus)
print(f"Word-level TF-IDF matrix shape: {word_features.shape}")
return word_features
Character-Level N-Gram Features
Character-level n-gram features analyze text at the sub-word level by extracting overlapping sequences of characters (for example, 3-grams: “goo”, “ood”, “od ”, “d f” from “good food”).
This approach is especially effective for handling misspellings (e.g., “goood” vs. “good”), slang (“luv” vs. “love”), creative abbreviations, and user-generated variations often found in reviews and social media. By focusing on small character chunks rather than whole words, the model can learn more diverse patterns and better generalize to noisy, informal text.
These features are typically extracted using a CountVectorizer
or TfidfVectorizer
configured with analyzer="char"
and an n-gram range like (3, 5), allowing the model to capture short and mid-length character patterns across the corpus.
Add create_char_level_features
to our class:
def create_char_level_features(self, raw_reviews: list) -> csr_matrix:
"""Create character-level n-gram features from raw text."""
# Configure character-level TF-IDF vectorizer
self.char_vectorizer = TfidfVectorizer(
analyzer='char_wb', # Character n-grams within word boundaries
ngram_range=(2, 5), # Character n-grams of length 2-5
max_features=2000,
min_df=2
)
# Use raw reviews to capture original spelling patterns
char_features = self.char_vectorizer.fit_transform(raw_reviews)
print(f"Character-level TF-IDF matrix shape: {char_features.shape}")
return char_features
Combining All Features
Finally, we’ll combine word-level TF-IDF, character-level TF-IDF, and SentiWordNet features into a single feature matrix. Add the create_combined_features
function to our class:
def create_combined_features(self, df: pd.DataFrame) -> Tuple[csr_matrix, pd.DataFrame]:
"""Create combined feature matrix from all preprocessing steps."""
print("Starting comprehensive feature engineering...")
# Step 1: Preprocess all reviews
print("Processing text...")
tqdm.pandas(desc="Processing reviews")
df['Processed_Tokens'] = df['Review'].progress_apply(self.preprocess_text)
df['Processed_Text'] = df['Processed_Tokens'].apply(lambda x: ' '.join(x))
# Step 2: Calculate SentiWordNet scores
print("Calculating sentiment scores...")
tqdm.pandas(desc="Sentiment scoring")
sentiment_scores = df['Review'].progress_apply(self.calculate_review_sentiment_scores)
df['SWN_Positive'] = [scores['pos'] for scores in sentiment_scores]
df['SWN_Negative'] = [scores['neg'] for scores in sentiment_scores]
df['SWN_Objective'] = [scores['obj'] for scores in sentiment_scores]
# Step 3: Create word-level features
print("Creating word-level TF-IDF features...")
word_features = self.create_word_level_features(df['Processed_Tokens'].tolist())
# Step 4: Create character-level features
print("Creating character-level TF-IDF features...")
char_features = self.create_char_level_features(df['Review'].tolist())
# Step 5: Combine all features
print("Combining all features...")
sentiment_features = df[['SWN_Positive', 'SWN_Negative', 'SWN_Objective']].values
# Stack all features horizontally
combined_features = sparse.hstack([
word_features,
char_features,
sentiment_features
]).tocsr()
print(f"Combined feature matrix shape: {combined_features.shape}")
return combined_features, df
Putting It All Together: The Complete Pipeline
This pipeline brings together a rich combination of text preprocessing and feature engineering techniques to prepare the reviews for effective sentiment analysis. We start by carefully loading and cleaning the data, ensuring we remove noise while preserving key information like negations. By explicitly mapping ratings to binary sentiment labels, we create a clear learning target and avoid ambiguous examples.
On the feature side, we incorporate traditional word-level TF-IDF representations to capture overall content and context, and we add character-level n-grams to handle misspellings, creative spelling, and slang that are common in informal reviews.
Beyond surface-level features, we integrate lemmatization and part-of-speech tagging to reduce words to their canonical forms and align words to their correct syntactic roles, improving feature consistency. Additionally, by including lexicon-derived sentiment scores from SentiWordNet, we inject a semantic layer that captures nuanced sentiment signals at the word and review levels.
In order to bring all of these steps together in our pipeline, create the execute_preprocessing_pipeline
and save_processed_data
functions and add them to our class:
def execute_preprocessing_pipeline(self) -> Tuple[csr_matrix, Dict[str, TfidfVectorizer], pd.DataFrame]:
"""
Execute the complete preprocessing pipeline.
Returns:
Tuple containing:
- Combined feature matrix (sparse)
- Dictionary of fitted vectorizers
- Processed DataFrame
"""
print("=== Starting Sentiment Analysis Preprocessing Pipeline ===")
# Step 1: Load and clean data
df = self.load_data()
df = self.create_sentiment_labels(df)
df = self.balance_dataset(df)
# Keep only necessary columns
df = df[['Restaurant', 'Review', 'Sentiment']].copy()
# Step 2: Create combined features
feature_matrix, processed_df = self.create_combined_features(df)
# Step 3: Save processed data and models
self.save_processed_data(feature_matrix, processed_df)
# Return components for immediate use
vectorizers = {
'word': self.word_vectorizer,
'char': self.char_vectorizer
}
print("=== Preprocessing Pipeline Complete ===")
return feature_matrix, vectorizers, processed_df
def save_processed_data(self, feature_matrix: csr_matrix, df: pd.DataFrame):
"""Save all processed data and fitted models."""
print("Saving processed data...")
# Save feature matrix
sparse.save_npz(
os.path.join(self.output_dir, "restaurant_review_features.npz"),
feature_matrix
)
# Save vectorizers
with open(os.path.join(self.output_dir, "word_vectorizer.pkl"), "wb") as f:
pickle.dump(self.word_vectorizer, f)
with open(os.path.join(self.output_dir, "char_vectorizer.pkl"), "wb") as f:
pickle.dump(self.char_vectorizer, f)
# Save processed DataFrame
df.to_csv(os.path.join(self.output_dir, "processed_reviews.csv"), index=False)
print("All files saved successfully!")
Using the Preprocessing Pipeline
In later parts of this series, we will initialize the SentimentDataPreprocessor
pipeline and use it directly as part of our training workflow, but for now we can call it directly from preprocess.py
to ensure it is correctly preprocessing our dataset.
Below our SentimentDataPreprocessor
class, add this code block at the end of the script:
# Example usage
if __name__ == "__main__":
# Initialize the preprocessor
preprocessor = SentimentDataPreprocessor(
data_path="data/raw/restaurant_reviews.csv",
output_dir="data/processed",
n_samples_per_class=1000
)
# Execute the complete pipeline
feature_matrix, vectorizers, processed_df = preprocessor.execute_preprocessing_pipeline()
print(f"Final feature matrix shape: {feature_matrix.shape}")
print(f"Number of processed reviews: {len(processed_df)}")
print(f"Sentiment distribution: {processed_df['Sentiment'].value_counts()}")
Now, let’s give it a test. Run preprocess.py
from our project root directory:
python src/preprocess.py
Understanding the Output and Next Steps
Our preprocessing pipeline produces several important outputs that will be used in subsequent parts of this series:
Feature Matrix Structure
The combined feature matrix contains three types of features:
- Word-level TF-IDF features (5,000 dimensions): Capture semantic meaning from cleaned, lemmatized text
- Character-level TF-IDF features (2,000 dimensions): Capture spelling patterns and informal language
- SentiWordNet features (3 dimensions): Positive, negative, and objective sentiment scores
This multi-level approach ensures that our models can learn from both high-level semantic patterns and low-level stylistic cues.
Processed DataFrame
The processed DataFrame includes:
- Original review text
- Processed tokens and cleaned text
- SentiWordNet sentiment scores
- Binary sentiment labels
- Restaurant metadata
Saved Models and Data
Our pipeline saves several files for reuse:
restaurant_review_features.npz
: Sparse feature matrixword_vectorizer.pkl
: Fitted word-level TF-IDF vectorizerchar_vectorizer.pkl
: Fitted character-level TF-IDF vectorizerprocessed_reviews.csv
: Complete processed dataset
These files will be loaded in Part 2 when we build our neural network models.
Performance Considerations
Our preprocessing pipeline is designed for efficiency:
- Sparse matrices minimize memory usage for high-dimensional features
- Progress bars provide feedback during long-running operations
- Vectorizer persistence allows reuse without retraining
- Balanced sampling prevents class imbalance issues
What’s Coming Next
In Part 2 of this series, we’ll use these carefully engineered features to build and train neural network models for sentiment classification. We’ll explore:
- Feed-forward neural networks for baseline performance
- LSTM networks for capturing sequential patterns in text
- Attention mechanisms for focusing on important words
- Model evaluation using precision, recall, and F1-score
The solid preprocessing foundation we’ve built will enable our models to achieve superior performance on restaurant review sentiment analysis.
Further Reading
To deepen your understanding of the concepts covered in this tutorial, explore these resources:
Natural Language Processing Fundamentals
- NLTK Documentation - Comprehensive guide to the Natural Language Toolkit
- Speech and Language Processing by Jurafsky & Martin - Essential NLP textbook covering tokenization, lemmatization, and sentiment analysis
Text Preprocessing Techniques
- scikit-learn Text Feature Extraction - Detailed documentation on TF-IDF and text vectorization
- Emoji Analysis in Python - Documentation for the emoji library used in our preprocessing pipeline
Sentiment Analysis Research
- SentiWordNet Paper - Original research on the sentiment lexicon we used
- A Comprehensive Survey on Sentiment Analysis - Academic overview of sentiment analysis approaches and challenges
Advanced Preprocessing Techniques
- The Effect of Negation on Sentiment Analysis - Research on handling negation in sentiment analysis
- Character-level Features for Text Classification - Academic paper on character-level n-gram features
Python Libraries and Tools
- pandas Documentation - Essential for data manipulation and CSV handling
- spaCy Advanced NLP - Alternative to NLTK with more advanced preprocessing capabilities
- Hugging Face Tokenizers - Modern tokenization for transformer models
These resources will provide additional context and help you explore variations on the preprocessing techniques we’ve implemented. In the next part of our series, we’ll leverage these foundations to build powerful neural network models for sentiment classification.

Aaron Mathis
Systems administrator and software engineer specializing in cloud development, AI/ML, and modern web technologies. Passionate about building scalable solutions and sharing knowledge with the developer community.
Related Articles
Discover more insights on similar topics