Combining Multiple CSV Files into A Polars DataFrame

Combining multiple CSV files into a single dataset is a common task in data analysis, especially when working with data that spans multiple time periods or categories. With Polars, this process becomes incredibly straightforward. In this article, I’ll show you how to easily combine CSV files that share the same schema using Polars, and compare it to how you would do the same with Pandas.

What is Polars?

Polars is a high-performance DataFrame library designed to handle large datasets with speed and efficiency. It’s perfect for tasks that require fast data manipulation, such as combining multiple CSV files. Whether you’re working with millions of rows or just a few thousand, Polars simplifies the process and enhances your productivity.

Why Combine CSV Files with Polars?

When dealing with multiple CSV files that have the same schema, the goal is often to combine them into a single DataFrame for easier analysis. Traditional methods can be slow and cumbersome, but Polars offers a quick and efficient solution. By using Polars, you can load and concatenate your CSV files in just a few lines of code, making the process both easy and time-saving.

Getting Started with Polars

Before you can start combining your CSV files, you’ll need to install Polars. This can be done easily with pip:

pip install polars

Once installed, you’re ready to begin working with your CSV files.

Loading and Combining CSV Files with Polars

Let’s dive into a practical example. Suppose you have a series of CSV files containing marketing cost data for different months, and you want to combine them into a single DataFrame.

Here’s how you can do it with Polars:

import polars as pl
pl.Config.set_fmt_str_lengths(100) 

filepaths = [
    'marketing_cost_01.csv',
    'marketing_cost_03.csv',
    'marketing_cost_04.csv',
    'marketing_cost_05.csv',
    'marketing_cost_06.csv',
    'marketing_cost_07.csv',
    'marketing_cost_08.csv',
    'marketing_cost_09.csv',
    'marketing_cost_02.csv',
    'marketing_cost_10.csv',
    'marketing_cost_11.csv',
    'marketing_cost_12.csv'
]

data = pl.scan_csv(filepaths, try_parse_dates=True, include_file_paths='path').collect()
data

Explanation

Setting Column Display Length: pl.Config.set_fmt_str_lengths(100) ensures that long file paths are displayed clearly in the DataFrame output.
Filepaths: The filepaths list contains the paths to all the CSV files you want to combine. Since these files have the same schema, they can be easily merged into a single DataFrame.
Combining the Files: The pl.scan_csv function reads all the files simultaneously, concatenating them into a single DataFrame. The try_parse_dates=True option ensures that date columns are parsed correctly, while include_file_paths='path' adds a column indicating the source file for each row.

This method is not only efficient but also incredibly fast, making it ideal for combining multiple CSV files with the same schema.

Why Use Polars for Combining CSV Files?

Polars is designed for performance, making it an excellent choice for combining large datasets. With Polars, you can streamline your data workflow, reduce the time spent on data preparation, and focus on more critical analysis tasks.

Conclusion

Combining CSV files with the same schema is a breeze with Polars. With just a few lines of code, you can load and concatenate your files, saving time and effort. While Pandas is a powerful tool, Polars offers a more streamlined and efficient approach, especially when working with large datasets.

Thanks for reading! Be sure to check out the video tutorial on Polars Code Academy!