Merging Datasets in Python using Numpy, Pandas

When working with large datasets, particularly those that cannot fit entirely into memory, efficiently merging and sorting files is a critical challenge. In this post, we’ll code three different methods to merge two large CSV files into a third, ensuring each row is sorted according to specific criteria. We’ll cover different approaches using Python’s built-in capabilities, as well as utilizing ML libraries like NumPy and Pandas for both in-memory and chunk-based processing.

Problem Statement

You have two large csv files (file1.csv, file2.csv) having 10,000 rows with 100 sorted non-negative integers in each row for pre-processing. Merge the content of two files into a new third “result.csv” file, such that it has 10,000 rows and each row in the “result file” has 200 non-negative integers which are sorted. Also use a “stable sort algorithm” such that if two numbers in two files at a same position are same then number from file1.csv should be written first. Only each row need to be sorted and not the whole file.

Note: We’ll be using Merge Sort (discussed here) as the stable sorting algorithm for this problem.

Approach I (Using Python’s Built-In Libraries with Generators)

The script performs the merge and sort operation row-by-row, adhering to the requirement that only the rows need to be sorted and merged, not the entire file.

We’ve already discussed Python Generators in this post – Generators in Python

import csv

"""


A generator is a special type of function that returns an iterator. Unlike a regular function 
that returns a single value, a generator yields a sequence of values one at a time. 
This makes them memory-efficient for iterating over large data sets without having 
to load the entire data structure in memory at once.

"""


def csv_reader(file):
    """Generator that reads integers from a CSV file one at a time."""

    for row in csv.reader(file):
        yield row


def merge_rows(row1, row2):
    """Merge two sorted rows into one sorted row with preference for row1 values."""
    merged_row = []
    i, j = 0, 0

    while i < len(row1) and j < len(row2):
        if int(row1[i]) <= int(row2[j]):
            merged_row.append(row1[i])
            i += 1
        elif int(row1[i]) > int(row2[j]):
            merged_row.append(row2[j])
            j += 1

    # Append remaining elements, if any
    while i < len(row1):
        merged_row.append(row1[i])
        i += 1

    while j < len(row2):
        merged_row.append(row2[j])
        j += 1

    return merged_row


def merge_csv_files(file1, file2, result_file):
    with open(file1, 'r') as f1, open(file2, 'r') as f2, open(result_file, 'w', newline='') as fr:
        reader1, reader2 = csv_reader(f1), csv_reader(f2)
        writer = csv.writer(fr)

        # Merge loop
        for row1, row2 in zip(reader1, reader2):
            merged_row = merge_rows(row1, row2)
            writer.writerow(merged_row)


if __name__ == "__main__":
    merge_csv_files('file1.csv', 'file2.csv', 'result.csv')

Important points:

  • Low Memory Usage: Only one row from each file is loaded into memory at a time.
  • Efficiency: The generator handles data lazily, processing only as needed without excessive overhead.
  • Complexity: Managing merging logic manually can be error-prone.
  • zip is used to iterate over multiple (two or more) iterables (like lists, tuples) in parallel. It creates an iterator that tuples elements together from each iterable, based on their corresponding positions.

Also read: Efficiently Merging Large Sorted Files in C++

Approach II (Simple Approach Using Numpy and Pandas)

Using Pandas to load entire datasets into memory and perform operations using DataFrame capabilities.

import csv
import numpy as np
import pandas as pd


def merge_csv_files(file1, file2, result_file):
    # Read the CSV files
    df1 = pd.read_csv(file1, header=None)
    df2 = pd.read_csv(file2, header=None)

    # Concatenate DataFrames row-wise
    result_df = pd.concat([df1, df2], axis=1)

    # Get the shape of the DataFrame
    rows, cols = result_df.shape
    print(f"Number of rows: {rows}, Number of columns: {cols}")

    # Sort each row of the DataFrame
    # using a stable sorting algorithm
    result_df = result_df.apply(lambda row: np.sort(row, kind='mergesort'), axis=1)

    # Open the result file and write the formatted rows
    with open(result_file, 'w', newline='') as file:
        writer = csv.writer(file)
        for row in result_df.values:
            writer.writerow(row)


if __name__ == "__main__":
    merge_csv_files('file1.csv', 'file2.csv', 'result.csv')

Output

Number of rows: 10000, Number of columns: 200

Important points:

  • Leverages Pandas and NumPy, which are optimized for fast data manipulation.
  • This method reads the entire datasets into two separate DataFrames.
  • Requires loading the entire dataset into memory, which is not feasible for very large datasets.
Approach III (Chunk Processing)

Using Pandas’ chunk processing capabilities to handle data in segments. If the datasets are very large, you can process them in chunks. This method is particularly useful if the datasets are too large to fit in memory.

import csv
import numpy as np
import pandas as pd


def merge_and_sort_chunks(file1, file2, result_file, chunk_size=100):
    # Define iterators for reading CSV in chunks
    # chunk_size specifies the number of rows to include in each chunk
    reader1 = pd.read_csv(file1, header=None, chunksize=chunk_size)
    reader2 = pd.read_csv(file2, header=None, chunksize=chunk_size)

    # Process each chunk pair
    for chunk1, chunk2 in zip(reader1, reader2):
        result_chunk = pd.concat([chunk1, chunk2], axis=1)

        # Sort using a stable sorting algorithm
        result_chunk = result_chunk.apply(lambda row: np.sort(row, kind='mergesort'), axis=1)

        # Open the result file and "append" the formatted rows
        with open(result_file, 'a', newline='') as file:
            writer = csv.writer(file)
            for row in result_chunk.values:
                writer.writerow(row)



if __name__ == "__main__":
    merge_and_sort_chunks('file1.csv', 'file2.csv', 'result.csv')

Important points:

  • This method processes the files in chunks, suitable for very large datasets. Each chunk is read separately from each file.
  • Chunks are concatenated, sorted, and written to the output file one at a time, thus never loading the entire dataset into memory.
  • The chunksize parameter controls how much data (number of rows) is processed at once, and can be adjusted based on memory availability.
  • Sorted chunks are appended to the output CSV file, ensuring that the file is constructed incrementally.
  • Could be slower than in-memory operations due to repeated reading and writing.

Depending on the size of the data and system memory limitations, you can choose between these approaches for optimal performance and efficiency. Generators offer excellent memory management for large files, while Pandas provides powerful tools for complex data manipulations when memory is less of a concern. For the large datasets, chunk processing represents a middle ground, balancing memory use and processing power.

Leave a Reply

Your email address will not be published. Required fields are marked *