Merging two large sorted CSV files efficiently without loading the entire contents into memory is a common problem in data processing, especially in scenarios dealing with big data or systems with limited memory. In this post, we will code a C++ solution that tackles this problem by leveraging file streaming and efficient sorting algorithms to merge two large sorted files into a single sorted file. This method ensures a ‘stable sort’ where identical numbers at the same positions in both files will have the numbers from the first file appearing before those from the second.
Problem Statement
Given two large CSV files, file_1.csv
and file_2.csv
, each containing sorted non-negative integers separated by commas, the task is to merge these files into a new file – result.csv
. The merged results must also be sorted, and if the same number from both files appears at the same position, the number from file_1.csv
should appear first in result.csv
.
Constraints
- Memory Limitations: Both files are too large to fit entirely in memory, necessitating an approach that reads and processes data in chunks.
- Data Sorting: The merged data must be sorted, requiring an efficient way to combine two sorted lists.
- Stable Merging: If numbers from both files are equal and in the same relative position, the number from
file_1.csv
must be written before the number fromfile_2.csv
.
Generating Random Data
We can use a Python script to create input data for this problem. To generate CSV files containing sorted, non-negative integers we can use Python’s random
module to generate the integers and then write these numbers to files in batches. Here’s the Python script for this task.
import random
def generate_sorted_data(filename, num_values, batch_size, seed=None):
"""
Generates a file with sorted non-negative integers in batches.
Args:
filename (str): The name of the file to write the data to.
num_values (int): Total number of integers to generate.
batch_size (int): Number of integers to write per batch.
seed (int, optional): Seed for the random number generator.
"""
random.seed(seed)
current = 0
range_start = 1
range_end = batch_size*10
with open(filename, 'w', newline='') as file:
# Use a regular file writer instead of csv.writer
while current < num_values:
# Determine the size of the current batch
this_batch_size = min(batch_size, num_values - current)
# Generate this batch
batch = sorted(random.randint(range_start, range_end) for _ in range(this_batch_size))
# Write this batch to the file, adding commas
file.write(','.join(map(str, batch)))
current += this_batch_size
range_start = range_end + 1
range_end += batch_size*10
# Write a trailing comma only if this is not the last number
if current < num_values:
file.write(',')
if __name__ == "__main__":
# Configuration
num_values = 100000 # total values to generate in each file
batch_size = 100 # values per batch
# Generate file 1
generate_sorted_data('file_1.csv', num_values, batch_size, seed=42)
# Generate file 2 with a different seed
generate_sorted_data('file_2.csv', num_values, batch_size, seed=43)
Note: Some points regarding above script:
sorted()
: This is a Python built-in function that takes an iterable and returns a new list containing all items from the iterable in ascending order.- By wrapping the inner generator expression in
sorted()
, we ensure that the integers generated in this batch are sorted before they are stored in thebatch
variable. random.randint(range_start, range_end)
: This function call generates a random integer betweenrange_start
andrange_end
(inclusive). It is called for each iteration of the loop defined by the generator expression.for _ in range(this_batch_size)
: This loop iteratesthis_batch_size
times, generating a total ofthis_batch_size
random integers. The underscore (_
) is used as a variable name when the variable’s value is irrelevant; it’s a common convention in Python to indicate that the loop variable won’t be used.range_end = batch_size*10
: expanding the range for distinct set of numbers
Approach
The solution involves using an external Merge sort algorithm for handling data that cannot fit into memory. We’ll use a two-pointer technique, reading data sequentially from both files, comparing the current numbers, and writing the smaller (or equal, from file_1.csv
first) number to the result file. This way, we only keep a small portion of each file in memory at any time.
Code Implementation
//
// main.cpp
// Efficiently Merging Large Sorted Files
//
// Created on 13/05/24.
//
#include <fstream>
#include <iostream>
#include <string>
#include <sstream>
#include <vector>
using namespace std;
void merge_sorted_files(const string& path1, const string& path2, const string& outputPath) {
ifstream file1(path1);
ifstream file2(path2);
ofstream resultFile(outputPath);
string line1, line2;
getline(file1, line1);
getline(file2, line2);
istringstream stream1(line1);
istringstream stream2(line2);
string value1, value2;
// getline(stream1, value1, ',') read data from the input stream (stream1)
// into the string variable value1.
// The reading stops when the specified delimiter ',' is encountered.
bool moreData1 = (bool) getline(stream1, value1, ',');
bool moreData2 = (bool) getline(stream2, value2, ',');
while (moreData1 && moreData2) {
if (stoi(value1) <= stoi(value2)) {
resultFile << value1;
moreData1 = (bool) getline(stream1, value1, ',');
} else {
resultFile << value2;
moreData2 = (bool) getline(stream2, value2, ',');
}
if (moreData1 || moreData2) {
resultFile << ',';
}
}
// Push remaining data
while (moreData1) {
resultFile << value1;
moreData1 = (bool) getline(stream1, value1, ',');
if (moreData1) {
resultFile << ',';
}
}
while (moreData2) {
resultFile << value2;
moreData2 = (bool) getline(stream2, value2, ',');
if (moreData2) {
resultFile << ',';
}
}
file1.close();
file2.close();
resultFile.close();
}
int main() {
merge_sorted_files("file_1.csv", "file_2.csv", "result.csv");
return 0;
}
When working with large data, it’s crucial to consider the efficiency of data handling and memory usage. The provided code efficiently merges two large sorted CSV files into a single sorted output file under constraints of memory efficiency and stability in sorting. This method is particularly useful in environments where memory is limited, and data processing needs to be optimized for performance and resource usage.