How to Use DataFramely to Validate Polars DataFrames

In today’s AI-driven world, clean data is absolutely essential. If you’re leveraging polars for data cleaning and manipulation, you’ll likely benefit from a new DataFrame validation library known as DataFramely. With this new library, you can build reliable data pipelines through validation models with customizable and finely tuned rules. In this how-to guide, we’ll learn how to leverage the powerful capabilities that DataFramely provides.

Getting Started with DataFramely

As with any new library, you’ll need to install it before using it. We recommend using pip.

pip install dataframely

Get the Jupyter Notebook

We’ve provided a Jupyter Notebook for you to follow along with and learn DataFramely. You can find the notebook in this repository: https://github.com/CodeCrewCareers/polarscodeacademy

If you’d like you can clone the entire repository with the following command.

git clone https://github.com/CodeCrewCareers/polarscodeacademy

Building the Model

We have some request form data that we want to validate before bringing it into the database. We’ll create a DataFramely model by importing dataframely as dy and then create a new class called Request. From here, we will define the columns and data types in our model.

import dataframely as dy

class Request(dy.Schema):
    requestId = dy.Integer(primary_key=True)
    requestor = dy.String(nullable=False)
    requestType = dy.String(regex=r'\b(?:Add|Edit|Update|Delete)\b')
    description = dy.String(nullable=True)
    requestDate = dy.Date(nullable=False)

Validating the DataFrame

Now that we have the model, we need to compare it against some data. We’ll import our data using the read_csv method. As we do so, we’ll make sure to parse date data types.

import polars as pl

df = pl.read_csv('../datasets/requests.csv',try_parse_dates=True)
df

From here, we can call the validate method on our Request model.

Request.validate(df)

Now, in this particular case, we do run into some errors. We have two options here. The first option is to fix these errors before bringing them into the database. The second option is to remove problematic records and push the successful ones. We’ll dive into the second one here!

Filtering Out Failed Records

DataFramely makes it possible to filter our rows that failed validation. To do this, we’ll use the filter function. This function returns two different objects. The first is a DataFrame containing the records with successful validation. The second is a FailureInfo Object. Within the second object, we can call an attribute containing a DataFrame with the failed records. We’ll access both with the following code.

successes, failures = Request.filter(df)
print(successes)
print(failures.invalid())

Now the output should give us two DataFrames: one with the successful records and one with the failed records.

Conclusion

DataFramely is an incredible new data validation library for Polars. You’ve already seen how easy it is to use for basic validation. But there is so much more that you can do with custom validation rules. Check out the documentation to learn more!