Efficiently Merging Large Sorted Files

Merging two large sorted CSV files efficiently without loading the entire contents into memory is a common problem in data processing, especially in scenarios dealing with big data or systems with limited memory. In this post, we will code a C++ solution that tackles this problem by leveraging file streaming and efficient sorting algorithms to merge two large sorted files into a single sorted file. This method ensures a ‘stable sort’ where identical numbers at the same positions in both files will have the numbers from the first file appearing before those from the second.

Problem Statement

Given two large CSV files, file_1.csv and file_2.csv, each containing sorted non-negative integers separated by commas, the task is to merge these files into a new file – result.csv. The merged results must also be sorted, and if the same number from both files appears at the same position, the number from file_1.csv should appear first in result.csv.

Constraints
  • Memory Limitations: Both files are too large to fit entirely in memory, necessitating an approach that reads and processes data in chunks.
  • Data Sorting: The merged data must be sorted, requiring an efficient way to combine two sorted lists.
  • Stable Merging: If numbers from both files are equal and in the same relative position, the number from file_1.csv must be written before the number from file_2.csv.
Generating Random Data

We can use a Python script to create input data for this problem. To generate CSV files containing sorted, non-negative integers we can use Python’s random module to generate the integers and then write these numbers to files in batches. Here’s the Python script for this task.

import random


def generate_sorted_data(filename, num_values, batch_size, seed=None):

    """
    Generates a file with sorted non-negative integers in batches.

    Args:
    filename (str): The name of the file to write the data to.
    num_values (int): Total number of integers to generate.
    batch_size (int): Number of integers to write per batch.
    seed (int, optional): Seed for the random number generator.
    """

    random.seed(seed)
    current = 0
    range_start = 1
    range_end = batch_size*10

    with open(filename, 'w', newline='') as file:
        # Use a regular file writer instead of csv.writer
        while current < num_values:
            # Determine the size of the current batch
            this_batch_size = min(batch_size, num_values - current)

            # Generate this batch
            batch = sorted(random.randint(range_start, range_end) for _ in range(this_batch_size))

            # Write this batch to the file, adding commas
            file.write(','.join(map(str, batch)))

            current += this_batch_size

            range_start = range_end + 1
            range_end += batch_size*10

            # Write a trailing comma only if this is not the last number
            if current < num_values:
                file.write(',')


if __name__ == "__main__":
    # Configuration
    num_values = 100000  # total values to generate in each file
    batch_size = 100  # values per batch

    # Generate file 1
    generate_sorted_data('file_1.csv', num_values, batch_size, seed=42)

    # Generate file 2 with a different seed
    generate_sorted_data('file_2.csv', num_values, batch_size, seed=43)

Note: Some points regarding above script:

  • sorted(): This is a Python built-in function that takes an iterable and returns a new list containing all items from the iterable in ascending order.
  • By wrapping the inner generator expression in sorted(), we ensure that the integers generated in this batch are sorted before they are stored in the batch variable.
  • random.randint(range_start, range_end): This function call generates a random integer between range_start and range_end (inclusive). It is called for each iteration of the loop defined by the generator expression.
  • for _ in range(this_batch_size): This loop iterates this_batch_size times, generating a total of this_batch_size random integers. The underscore (_) is used as a variable name when the variable’s value is irrelevant; it’s a common convention in Python to indicate that the loop variable won’t be used.
  • range_end = batch_size*10: expanding the range for distinct set of numbers
Approach

The solution involves using an external Merge sort algorithm for handling data that cannot fit into memory. We’ll use a two-pointer technique, reading data sequentially from both files, comparing the current numbers, and writing the smaller (or equal, from file_1.csv first) number to the result file. This way, we only keep a small portion of each file in memory at any time.

Code Implementation

//
//  main.cpp
//  Efficiently Merging Large Sorted Files
//
//  Created on 13/05/24.
//

#include <fstream>
#include <iostream>
#include <string>
#include <sstream>
#include <vector>
using namespace std;

void merge_sorted_files(const string& path1, const string& path2, const string& outputPath) {
    ifstream file1(path1);
    ifstream file2(path2);
    ofstream resultFile(outputPath);

    string line1, line2;
    getline(file1, line1);
    getline(file2, line2);

    istringstream stream1(line1);
    istringstream stream2(line2);
    string value1, value2;

    // getline(stream1, value1, ',') read data from the input stream (stream1) 
    // into the string variable value1. 
    // The reading stops when the specified delimiter ',' is encountered.
    bool moreData1 = (bool) getline(stream1, value1, ',');
    bool moreData2 = (bool) getline(stream2, value2, ',');

    while (moreData1 && moreData2) {
        if (stoi(value1) <= stoi(value2)) {
            resultFile << value1;
            moreData1 = (bool) getline(stream1, value1, ',');
        } else {
            resultFile << value2;
            moreData2 = (bool) getline(stream2, value2, ',');
        }

        if (moreData1 || moreData2) {
            resultFile << ',';
        }
    }

    // Push remaining data
    while (moreData1) {
        resultFile << value1;
        moreData1 = (bool) getline(stream1, value1, ',');
        if (moreData1) {
            resultFile << ',';
        }
    }

    while (moreData2) {
        resultFile << value2;
        moreData2 = (bool) getline(stream2, value2, ',');
        if (moreData2) {
            resultFile << ',';
        }
    }

    file1.close();
    file2.close();
    resultFile.close();
}

int main() {
    merge_sorted_files("file_1.csv", "file_2.csv", "result.csv");
    return 0;
}

When working with large data, it’s crucial to consider the efficiency of data handling and memory usage. The provided code efficiently merges two large sorted CSV files into a single sorted output file under constraints of memory efficiency and stability in sorting. This method is particularly useful in environments where memory is limited, and data processing needs to be optimized for performance and resource usage.

Leave a Reply

Your email address will not be published. Required fields are marked *