In today’s AI-driven world, clean data is absolutely essential. If you’re leveraging polars for data cleaning and manipulation, you’ll likely benefit from a new DataFrame validation library known as DataFramely. With this new library, you can build reliable data pipelines through validation models with customizable and finely tuned rules. In this how-to guide, we’ll learn how to leverage the powerful capabilities that DataFramely provides.
Getting Started with DataFramely
As with any new library, you’ll need to install it before using it. We recommend using pip.
pip install dataframely
Get the Jupyter Notebook
We’ve provided a Jupyter Notebook for you to follow along with and learn DataFramely. You can find the notebook in this repository: https://github.com/CodeCrewCareers/polarscodeacademy
If you’d like you can clone the entire repository with the following command.
git clone https://github.com/CodeCrewCareers/polarscodeacademy
Building the Model
We have some request form data that we want to validate before bringing it into the database. We’ll create a DataFramely model by importing dataframely
as dy
and then create a new class called Request. From here, we will define the columns and data types in our model.
import dataframely as dy
class Request(dy.Schema):
requestId = dy.Integer(primary_key=True)
requestor = dy.String(nullable=False)
requestType = dy.String(regex=r'\b(?:Add|Edit|Update|Delete)\b')
description = dy.String(nullable=True)
requestDate = dy.Date(nullable=False)
Validating the DataFrame
Now that we have the model, we need to compare it against some data. We’ll import our data using the read_csv
method. As we do so, we’ll make sure to parse date data types.
import polars as pl
df = pl.read_csv('../datasets/requests.csv',try_parse_dates=True)
df
From here, we can call the validate method on our Request
model.
Request.validate(df)
Now, in this particular case, we do run into some errors. We have two options here. The first option is to fix these errors before bringing them into the database. The second option is to remove problematic records and push the successful ones. We’ll dive into the second one here!
Filtering Out Failed Records
DataFramely makes it possible to filter our rows that failed validation. To do this, we’ll use the filter function. This function returns two different objects. The first is a DataFrame containing the records with successful validation. The second is a FailureInfo Object. Within the second object, we can call an attribute containing a DataFrame with the failed records. We’ll access both with the following code.
successes, failures = Request.filter(df)
print(successes)
print(failures.invalid())
Now the output should give us two DataFrames: one with the successful records and one with the failed records.
Conclusion
DataFramely is an incredible new data validation library for Polars. You’ve already seen how easy it is to use for basic validation. But there is so much more that you can do with custom validation rules. Check out the documentation to learn more!