Read & Write Polars Schemas With This New Python Library

Introduction

A while back I was working on a project where I was bringing new data into a database. This would be one of many files that would come in about once a month. While automating the process in Polars, I ran into an issue with the data types that were being generated with the infer schema feature. It simply didn’t read the file with the data types I needed. This inspired me to create a functionality that would store the correct polars schema in a json file so that it could be used over and over again. I placed these functions into a new Python library called polars-extensions.

Managing Polars Schemas

Polars has its own class for schemas which means that you can access the object and make edits at will. Once you have the schema you want, you can use polars-extensions to save that Schema. Let’s walk you through how its done!

First you need to install the library. You can do that by running the command below in your terminal.

pip install polars-extensions
Python

With the library installed, let’s open a csv file.

import polars as pl

data = pl.read_csv('../../Datasets/employees.csv')
data.head(3)
Python
shape: (3, 10)
┌─────────────┬────────────┬───────────┬────────┬───┬───────────────┬────────────┬────────────────────────┬──────────────┐
│ Employee ID ┆ First Name ┆ Last Name ┆ Gender ┆ … ┆ Position      ┆ Salary ($) ┆ Email                  ┆ Phone        │
│ ---         ┆ ---        ┆ ---       ┆ ---    ┆   ┆ ---           ┆ ---        ┆ ---                    ┆ ---          │
│ i64         ┆ str        ┆ str       ┆ str    ┆   ┆ str           ┆ i64        ┆ str                    ┆ str          │
╞═════════════╪════════════╪═══════════╪════════╪═══╪═══════════════╪════════════╪════════════════════════╪══════════════╡
│ 1           ┆ John       ┆ Smith     ┆ Male   ┆ … ┆ Sales Manager ┆ 75000      ┆ john.smith@example.com ┆ 123-456-7890 │
│ 2           ┆ Jane       ┆ Doe       ┆ Female ┆ … ┆ HR Specialist ┆ 60000      ┆ jane.doe@example.com   ┆ 234-567-8901 │
│ 3           ┆ Michael    ┆ Johnson   ┆ Male   ┆ … ┆ IT Manager    ┆ 80000      ┆ michael.j@example.com  ┆ 345-678-9012 │
└─────────────┴────────────┴───────────┴────────┴───┴───────────────┴────────────┴────────────────────────┴──────────────┘

In the output you’ll see our employee dataset displayed. For the most part, it gives use the data types that we want, but we want to alter to Salary column because most salaries have a decimal amount. We should also alter the Date of Birth Column as well. Let’s fix these!

data = data.with_columns(pl.col("Date of Birth").cast(pl.Date)
                        ,pl.col("Salary ($)").cast(pl.Float64))
data.head(3)
Python
shape: (3, 10)
┌────────────┬────────────┬───────────┬────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ Employee   ┆ First Name ┆ Last Name ┆ Gender ┆ … ┆ Position  ┆ Salary    ┆ Email     ┆ Phone     │
│ ID         ┆ ---        ┆ ---       ┆ ---    ┆   ┆ ---       ┆ ($)       ┆ ---       ┆ ---       │
│ ---        ┆ str        ┆ str       ┆ str    ┆   ┆ str       ┆ ---       ┆ str       ┆ str       │
│ i64        ┆            ┆           ┆        ┆   ┆           ┆ f64       ┆           ┆           │
╞════════════╪════════════╪═══════════╪════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 1          ┆ John       ┆ Smith     ┆ Male   ┆ … ┆ Sales     ┆ 75000.0   ┆ john.smit ┆ 123-456-7 │
│            ┆            ┆           ┆        ┆   ┆ Manager   ┆           ┆ h@example ┆ 890       │
│            ┆            ┆           ┆        ┆   ┆           ┆           ┆ .com      ┆           │
│ 2          ┆ Jane       ┆ Doe       ┆ Female ┆ … ┆ HR Specia ┆ 60000.0   ┆ jane.doe@ ┆ 234-567-8 │
│            ┆            ┆           ┆        ┆   ┆ list      ┆           ┆ example.c ┆ 901       │
│            ┆            ┆           ┆        ┆   ┆           ┆           ┆ om        ┆           │
│ 3          ┆ Michael    ┆ Johnson   ┆ Male   ┆ … ┆ IT        ┆ 80000.0   ┆ michael.j ┆ 345-678-9 │
│            ┆            ┆           ┆        ┆   ┆ Manager   ┆           ┆ @example. ┆ 012       │
│            ┆            ┆           ┆        ┆   ┆           ┆           ┆ com       ┆           │
└────────────┴────────────┴───────────┴────────┴───┴───────────┴───────────┴───────────┴───────────┘

Now that we’ve successfully changed our schema we can save it to a json file. This is where we’ll leverage the polars-extensions library.

import polars_extensions as plx

plx.write_schema(data, 'employees_schema.json')
Python

Running this code will generate a file that looks like this:

{
    "Employee ID": "Int64",
    "First Name": "String",
    "Last Name": "String",
    "Gender": "String",
    "Date of Birth": "Date",
    "Department": "String",
    "Position": "String",
    "Salary ($)": "Float64",
    "Email": "String",
    "Phone": "String"
}
Python

Now that we have the schema, we can use it to read the data again and apply the schema to it.

schema = plx.read_schema('employees_schema.json')


data = pl.read_csv('../../Datasets/employees.csv', schema=schema)
data.head(3)
Python

Output

shape: (3, 10)
┌────────────┬────────────┬───────────┬────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ Employee   ┆ First Name ┆ Last Name ┆ Gender ┆ … ┆ Position  ┆ Salary    ┆ Email     ┆ Phone     │
│ ID         ┆ ---        ┆ ---       ┆ ---    ┆   ┆ ---       ┆ ($)       ┆ ---       ┆ ---       │
│ ---        ┆ str        ┆ str       ┆ str    ┆   ┆ str       ┆ ---       ┆ str       ┆ str       │
│ i64        ┆            ┆           ┆        ┆   ┆           ┆ f64       ┆           ┆           │
╞════════════╪════════════╪═══════════╪════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 1          ┆ John       ┆ Smith     ┆ Male   ┆ … ┆ Sales     ┆ 75000.0   ┆ john.smit ┆ 123-456-7 │
│            ┆            ┆           ┆        ┆   ┆ Manager   ┆           ┆ h@example ┆ 890       │
│            ┆            ┆           ┆        ┆   ┆           ┆           ┆ .com      ┆           │
│ 2          ┆ Jane       ┆ Doe       ┆ Female ┆ … ┆ HR Specia ┆ 60000.0   ┆ jane.doe@ ┆ 234-567-8 │
│            ┆            ┆           ┆        ┆   ┆ list      ┆           ┆ example.c ┆ 901       │
│            ┆            ┆           ┆        ┆   ┆           ┆           ┆ om        ┆           │
│ 3          ┆ Michael    ┆ Johnson   ┆ Male   ┆ … ┆ IT        ┆ 80000.0   ┆ michael.j ┆ 345-678-9 │
│            ┆            ┆           ┆        ┆   ┆ Manager   ┆           ┆ @example. ┆ 012       │
│            ┆            ┆           ┆        ┆   ┆           ┆           ┆ com       ┆           │
└────────────┴────────────┴───────────┴────────┴───┴───────────┴───────────┴───────────┴───────────┘

Now when we run the cell, we get our new dataframe with the correct data types. Changing data types in Polars can be a lot of work. One of the benefits of saving the schema to a json file is that you can adjust them directly!

Conclusion

So polars-extensions provides us with simple, but powerful tools for managing schemas. If you found this article insightful, I invite you to share it with a friend. Be sure to check out other helpful functions in the polars-extensions library!

Review Your Cart
0
Add Coupon Code
Subtotal