Timsort is a hybrid sorting algorithm designed to take advantage of real-world data patterns to improve performance. In this blog post, we’ll explore Timsort in detail, from its history to how it works, and we’ll implement it in C++.
What is Timsort?
Timsort is a hybrid sorting algorithm that combines the concepts of Merge Sort and Insertion Sort. It was developed by Tim Peters in 2002 and was originally designed for the Python programming language. The algorithm is now widely used as the default sorting algorithm in several major programming languages, including Python and Java.
The key idea behind Timsort is to efficiently handle real-world data that is often partially sorted by identifying natural runs (already sorted sequences) in the data. It then uses Insertion Sort to handle small portions of the data and Merge Sort to efficiently merge larger sections.
Important features:
- Optimized for real-world data patterns.
- Takes advantage of natural runs in data.
- Hybrid approach using Insertion Sort for sorting subarrays and Merge Sort for merging them.
Insertion Sort and Merge Sort are already discussed in these posts:
Timsort Algorithm
Timsort is designed to be highly efficient in practice by exploiting data that may already have sorted segments. It works by dividing the array into small blocks called runs, which are then sorted using Insertion Sort. Afterward, these sorted runs are merged together using a modified version of Merge Sort.
Steps involved
- Divide the Array into Runs: The array is divided into small blocks of size
RUN
. The optimal value forRUN
is typically set to 32 or 64, depending on the size of the array. - Sort Each Run Using Insertion Sort: Since Insertion Sort is efficient for small datasets, each individual run is sorted using Insertion Sort.
- Merge Runs Using Merge Sort: After all runs are sorted, they are merged using a process similar to Merge Sort. The merging process ensures that the entire array becomes sorted.
Pseudocode
This pseudocode of timsort()
method outlines how the array is divided, sorted, and merged.
timsort(arr[], n)
minRun = calculate_min_run(n)
# Sort individual runs using Insertion Sort
for i = 0 to n-1 in steps of minRun do
insertion_sort(arr, i, min(i + minRun - 1, n-1))
# Start merging sorted runs
size = minRun
while size < n do
for left = 0 to n-1 in steps of 2*size do
mid = left + size - 1
right = min(left + 2*size - 1, n-1)
if mid < right:
merge(arr, left, mid, right)
size = 2*size
Here are few things that can be observed from the above pseudocode:
minRun
is calculated dynamically based on the array size. It’s generally determined by finding the smallest power of 2 greater than or equal to a specific threshold (often 32 to 64). For simplicity, it can even be defined as a constant value between 32 and 64.- Runs of size
minRun
are sorted using Insertion Sort. - Merge Sort is used to merge adjacent runs until the entire array is sorted.
Code Implementation
//
// main.cpp
// Timsort
//
// Created by Himanshu on 08/10/24.
//
#include <iostream>
#include <vector>
#include <algorithm>
#include <cmath>
using namespace std;
// This function sorts array using insertion sort
void insertionSort(vector<int>& arr, int left, int right) {
for (int i = left + 1; i <= right; i++) {
int temp = arr[i];
int j = i - 1;
while (j >= left && arr[j] > temp) {
arr[j + 1] = arr[j];
j--;
}
arr[j + 1] = temp;
}
}
// This function merges two sorted parts of the array
void merge(vector<int>& arr, int left, int mid, int right) {
int len1 = mid - left + 1, len2 = right - mid;
vector<int> leftArr(len1), rightArr(len2);
for (int i = 0; i < len1; i++)
leftArr[i] = arr[left + i];
for (int i = 0; i < len2; i++)
rightArr[i] = arr[mid + 1 + i];
int i = 0, j = 0, k = left;
while (i < len1 && j < len2) {
if (leftArr[i] <= rightArr[j]) {
arr[k++] = leftArr[i++];
} else {
arr[k++] = rightArr[j++];
}
}
while (i < len1)
arr[k++] = leftArr[i++];
while (j < len2)
arr[k++] = rightArr[j++];
}
// Timsort implementation
void timsort(vector<int>& arr, int n) {
int minRun = 32;
// Sort individual subarrays of size minRun using insertion sort
for (int i = 0; i < n; i += minRun) {
insertionSort(arr, i, min((i + minRun - 1), (n - 1)));
}
// Start merging from size minRun
for (int size = minRun; size < n; size = 2 * size) {
for (int left = 0; left < n; left += 2 * size) {
int mid = left + size - 1;
int right = min((left + 2 * size - 1), (n - 1));
if (mid < right)
merge(arr, left, mid, right);
}
}
}
int main() {
vector<int> arr = {5, 21, 7, 23, 19, 12, 16, 15, 18, 9};
int n = (int) arr.size();
cout << "Original array:\n";
for (int x : arr) cout << x << " ";
cout << endl;
timsort(arr, n);
cout << "\nSorted array using Timsort:\n";
for (int x : arr) cout << x << " ";
cout << endl;
return 0;
}
Output
Original array:
5 21 7 23 19 12 16 15 18 9
Sorted array using Timsort:
5 7 9 12 15 16 18 19 21 23
Explanation of the Code
- Insertion Sort is used for sorting small runs of size
minRun
. minRun
refers to the ‘minimum run size’. The ‘run’ as discussed above is a small, sorted subsection of the array.- Runs are sorted using Insertion Sort, and then these runs are merged together using Merge Sort.
- The size of a run is typically chosen between 32 and 64 elements. This is because runs should be large enough to be worth merging (using Merge sort) but small enough for Insertion Sort to work efficiently.
Time complexity: O(n logn), both worst and average case
Auxiliary Space: O(n), for the temporary arrays required to merge the sorted arrays using merge sort
Time Complexity Analysis
Providing a formal proof of Timsort’s time complexity can be a bit complex, as Timsort is an adaptive hybrid algorithm that combines multiple strategies. However, I’ll try to offer a clear, structured explanation using mathematical arguments and high-level reasoning.
Timsort consists of two major parts:
- Identifying & Sorting runs: Timsort first scans through the entire array to identify natural runs, which are segments of already sorted elements. This step takes O(n) time.
- Merging runs: Once the runs are identified, they are merged recursively until the entire array is sorted. This is similar to how merge sort works, where the data is split and then merged back together. This step has the time complexity of O(n logn) in worst case.
In the worst case, Timsort behaves similarly to a traditional merge sort. This happens when the array is completely unsorted and lacks any pre-existing order or identifiable “runs.”
Handling Small Runs with Insertion Sort
Insertion Sort is applied to small runs, typically those smaller than a threshold called minRun
(having size between 32 and 64). For small runs, Insertion sort is very efficient, and the average complexity for sorting small segments of data is approximately O(n) (especially when those segments are already partially sorted).
Merging Runs
The merge sort component of Timsort leads to a worst-case complexity of O(n logn), where n
is the size of the array and logn
comes from the depth of splitting during the merging process.
To prove the worst-case time complexity of O(n logn), let’s analyze the merging process:
- Suppose there are
k
runs, each of approximatelyn/k
elements, and these runs need to be merged in pairs. - Merging two runs of size n/k takes O(n/k) time, but since there are k runs, the total cost for merging all runs in each merging phase is O(n). The merging process continues until the entire array is sorted.
- The number of merging levels (phases) depends on how many runs we have and how they are merged.
- In Merge Sort, the splitting is done until we have runs of size 1, which leads to O(logn) levels.
- In Timsort, we have a similar process where runs are merged repeatedly. However, the number of merging levels in Timsort having
k
initial runs islog(k)
. - Since
k
is proportional to the input sizen
in the worst case, the number of merge phases (levels) in Timsort is O(logn) in the worst case. - At each merging phase the total cost of merging is O(n), since every element must be processed to merge the runs. Therefore, the total merging cost across all levels for merging runs is O(n) x O(logn) ie. O(n logn).
Thus, the worst-case time complexity for Timsort is O(n logn).
In general, the average time complexity of Timsort could be written as: O(n + nlogk), where k
is the number of runs.
Best Case: O(n)
The best-case scenario occurs when the input is already sorted or has many long runs. Suppose the entire input is already a single, naturally sorted run:
- Run Identification: Timsort only needs to do a single pass through the data to identify that the entire array is sorted.
- Merging: Since only one run is found, there is no need for further merging.
Thus, Timsort only needs O(n) to identify the sorted run and no further merging is required. Therefore, the best-case time complexity is O(n).
Summary
- Worst Case Time Complexity: O(n logn) – This occurs when the input is completely unsorted, requiring full splitting and merging like merge sort.
- Best Case Time Complexity: O(n) – This occurs when the input is already sorted or mostly sorted, allowing Timsort to quickly merge runs.
- Average Case Time Complexity: O(n logn) – Typical case for partially sorted or unsorted data.
- Space Complexity: O(n) – Due to the need for temporary arrays during merging.
Timsort is a powerful hybrid sorting algorithm that combines the best of Insertion Sort and Merge Sort to deliver highly efficient performance, especially in real-world scenarios where data is often partially sorted. With its time complexity of O(n logn) in the worst case and O(n) in the best case, it has become the default sorting algorithm for Python, Java, and other programming languages.
Understanding Timsort and its optimizations can help you appreciate why it is used so widely in practice and how it outperforms other basic sorting algorithms in real-world applications.