Normalization is an important preprocessing step in data analysis and machine learning workflows. Normalization and scaling techniques help to standardize and scale numerical features, making them suitable for data modeling and data analysis.
In this blog post, we’ll explore various normalization and scaling techniques using Python ML libraries and visualize their effects on different datasets.
We’ve already discussed Normalization and its importance in these posts:
Normalization and Scaling
Scaling and Normalization, both preprocess numerical data in order to make it suitable for Machine Learning algorithms. They’re used in Machine Learning to address the issue of features having different scales. However, they achieve this in slightly different ways.
Scaling involves changing the range of the data values. It transforms the data to a new range, often between 0 and 1 or -1 and 1. It does not change the distribution of the data but rather adjusts the scale of the data, making it easier to compare different features with different units or magnitudes.
Normalization on the other hand, aims to achieve a standard normal distribution by compressing or stretching distances between data points depending on the original distribution. Unlike scaling, normalization involves changing the distribution and shape of the data.
Normalization is commonly used in machine learning algorithms like K-Nearest Neighbors (KNN) and Support Vector Machines (SVM) to ensure that the features contribute equally to the analysis.
Normalization and Scaling Techniques
Min-Max Scaling
Min-Max scaling rescales features to a specified range, typically between 0 and 1. It subtracts the minimum value of each feature and then divides by the range (maximum – minimum).
Z-score Normalization (Standardization)
Z-score normalization, also known as standardization, transforms features to have a mean of 0 and a standard deviation of 1. It subtracts the mean of each feature and then divides by the standard deviation.
Robust Scaling
Robust scaling scales features to the interquartile range (IQR), making it robust to outliers. It subtracts the median of each feature and then divides by the IQR.
Box-Cox Transformation
The Box-Cox transformation is a power transformation technique that aims to stabilize the variance and make the data more Gaussian-like. It applies a power transformation to each feature to achieve this.
Python Libraries for Normalization and Scaling
Let’s understand and visualise – normalization and scaling using the following Python ML libraries:
- Scikit-learn: A widely-used library for machine learning tasks.
- Pandas: A powerful data manipulation library that includes convenient normalization and transformation functions.
- Seaborn: A visualization library that we’ll use to plot the effects of normalization and transformation.
Note: You could run the below code in Python interpreter or any Python IDE. However, you’ll need to install above python libraries using pip. I’ll recommend using a Python IDE like PyCharm Community Edition. It is free and beginner-friendly.
We’ll understand different normalization and scaling techniques, using synthetic (random) data generated with NumPy. We’ll use a Kernel Density Estimate (KDE) plot for plotting original data distribution and data distribution after performing above mentioned Normalization techniques.
KDE is a method for visualizing the distribution of observations in a dataset, analogous to a histogram. It represents the data using a continuous probability density curve in one or more dimensions.
Code Implementation
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, PowerTransformer
# Setting the seed to generate same sequence of random numbers
np.random.seed(100)
def generate_numbers_in_range(input_data, min_val, max_val):
# Clip the numbers to ensure they are within the desired range [min_val, max_val]
clipped_data = np.clip(input_data, min_val, max_val)
return clipped_data
def plot_distribution(input_data, title='Kernel Density Estimate Plot'):
# Reshape data for Pandas DataFrame
df = pd.DataFrame(input_data, columns=['Original'])
# Min-Max Scaling
scaler = MinMaxScaler()
df['Min-Max'] = scaler.fit_transform(df[['Original']])
# Z-score Normalization
scaler = StandardScaler()
df['Z-score'] = scaler.fit_transform(df[['Original']])
# Robust Scaling
scaler = RobustScaler()
df['Robust'] = scaler.fit_transform(df[['Original']])
# Box-Cox Transformation (can only be applied to strictly positive data)
if np.all(input_data > 0):
transformer = PowerTransformer(method='box-cox')
df['Box-Cox'] = transformer.fit_transform(df[['Original']])
# Plot the distributions
ax = sns.kdeplot(data=df) # Create the KDE plot and capture the Axes object
ax.set_xlabel('X-axis') # Set the label for the x-axis using Axes object
ax.set_ylabel('Density') # Set the label for the y-axis using Axes object
ax.set_title(title) # Set the title of the plot using Axes object
plt.show() # Display the plot using Matplotlib plt
# Generate random data values
# Exponential Distribution
exponential_data = np.random.exponential(scale=1.0, size=1000)
data = generate_numbers_in_range(exponential_data, 0.1, 2)
plot_distribution(data, 'Exponential Distribution')
# Uniform Distribution
uniform_data = np.random.uniform(low=0.0, high=2.0, size=1000)
data = generate_numbers_in_range(uniform_data, 0.1, 2)
plot_distribution(data, 'Uniform Distribution')
# Log-normal Distribution
log_normal_data = np.random.lognormal(mean=0.0, sigma=1.0, size=1000)
data = generate_numbers_in_range(log_normal_data, 0.1, 2)
plot_distribution(data, 'Log-normal Distribution')
# Poisson Distribution
poisson_data = np.random.poisson(lam=1.0, size=1000)
data = generate_numbers_in_range(poisson_data, 0.1, 2)
plot_distribution(data, 'Poisson Distribution')
# Normal (Gaussian) Distribution
# shifting the mean (loc) to 3.0, for Box-Cox transformation
normal_data = np.random.normal(loc=3.0, scale=1.0, size=1000)
data = generate_numbers_in_range(normal_data, 0.1, 2)
plot_distribution(data, 'Normal Distribution')
Visualizing Normalization and Scaling Effects
The generated plots visualize the distributions of the original data and the data after applying each normalization technique. We can observe how each method reshapes the distribution and scales the data differently.
Dataset with Exponential Distribution
Dataset with Uniform Distribution
Dataset with Log-normal Distribution
Dataset with Poisson Distribution
Dataset with Normal (Gaussian) Distribution
Normalization and scaling are essential steps in data analysis and machine learning. By using Python ML libraries like Scikit-learn and Pandas, we can easily apply different normalization techniques to our datasets. Visualizing the effects of normalization helps us understand how each method transforms the data and choose the most suitable technique for our specific use case.
In this post, we’ve covered commonly used normalization and scaling techniques, and their implementation in Python. Experimenting with different normalization methods and understanding their effect on datasets is essential for building accurate machine learning models.