Frame SQL is one of the best features available in Polars and is a considerable step up from how data is queried in libraries like Pandas. Those familiar with Pandas will tell you that querying data takes some getting used to. Rather than using SQL, you must use a “pythonic” syntax that is unique to the Pandas library.
Now, polars does have its own “pythonic” way of selecting and filtering data which, like pandas, also takes time to learn. But Polars also supports Robust SQL syntax in a number of contexts.
This feature resonated with me because I’ve spent a lot of my career in SQL Server. And while Polars may not have all of the same functionalities as SQL Server, it has the most essential and is actually very robust.
With that said, lets show you how to query data on the Frame-level with SQL.
Video Walk Through
If you prefer a more visual experience, I invite you to learn about Frame SQL from the video below. Be sure to like and subscribe to our channel for more great Polars content!
Loading the Data
Let’s start by reading our Marketing Costs Dataset into a dataframe.
import polars as pl
data = pl.read_csv('../../Datasets/Marketing Cost Data/marketing_cost_all.csv',try_parse_dates=True)
data.head()
shape: (5, 2)
┌────────────┬───────────────┐
│ Date ┆ MarketingCost │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪═══════════════╡
│ 2023-01-01 ┆ 132.32 │
│ 2023-01-02 ┆ 157.28 │
│ 2023-01-03 ┆ 140.41 │
│ 2023-01-04 ┆ 131.73 │
│ 2023-01-05 ┆ 113.55 │
└────────────┴───────────────┘
A Simple Frame SQL Query
Now this data contains daily marketing costs and what we want to do is take a snapshot of this dataset. We can use the sql method off of our dataframe to begin a query. We’ll start with a very basic SQL query.
data.sql('''SELECT * FROM self WHERE Date BETWEEN '2023-02-01' AND '2023-02-28' ''')
shape: (28, 2)
┌────────────┬───────────────┐
│ Date ┆ MarketingCost │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪═══════════════╡
│ 2023-02-01 ┆ 166.14 │
│ 2023-02-02 ┆ 118.42 │
│ 2023-02-03 ┆ 135.27 │
│ 2023-02-04 ┆ 52.82 │
│ 2023-02-05 ┆ 142.65 │
│ … ┆ … │
│ 2023-02-24 ┆ 81.33 │
│ 2023-02-25 ┆ 74.2 │
│ 2023-02-26 ┆ 147.97 │
│ 2023-02-27 ┆ 87.99 │
│ 2023-02-28 ┆ 119.95 │
└────────────┴───────────────┘
In the output, we get the daily marketing costs for February.
Aggregation with Frame SQL
Let’s change our query slightly to do some aggregation.
data.sql('''SELECT SUM(MarketingCost) FROM self WHERE Date BETWEEN '2023-02-01' AND '2023-02-28' ''')
shape: (1, 1)
┌───────────────┐
│ MarketingCost │
│ --- │
│ f64 │
╞═══════════════╡
│ 3356.35 │
└───────────────┘
In our results we get the total marketing costs for February.
Complex Queries
Let’s shoot for a slightly more complex query that includes a CTE.
data.sql('''
WITH CTE AS (
SELECT
STRFTIME(Date, '%Y-%m') as Month
,MarketingCost
FROM self
)
SELECT
Month
,SUM(MarketingCost) as CostPerMonth
FROM CTE
GROUP BY
Month
ORDER BY Month
''')
shape: (12, 2)
┌─────────┬──────────────┐
│ Month ┆ CostPerMonth │
│ --- ┆ --- │
│ str ┆ f64 │
╞═════════╪══════════════╡
│ 2023-01 ┆ 4204.71 │
│ 2023-02 ┆ 3356.35 │
│ 2023-03 ┆ 3468.82 │
│ 2023-04 ┆ 3914.2 │
│ 2023-05 ┆ 4109.69 │
│ … ┆ … │
│ 2023-08 ┆ 3794.84 │
│ 2023-09 ┆ 3774.95 │
│ 2023-10 ┆ 4174.32 │
│ 2023-11 ┆ 3206.25 │
│ 2023-12 ┆ 3789.48 │
└─────────┴──────────────┘
In the results, we now get them monthly totals.
Conclusion
So you can see that Polars has some robust SQL capabilities and if you started in a SQL background this may feel very comfortable and familiar.
Now we’ve covered how SQL can be used on a single DataFrame, but you can use SQL in other ways within polars. Global SQL and Expression SQL are additional topics that may be of interest to you. I highly recommend checking out the Polars documentation for more info on these topics.
With that, we’ll end here and catch up with you at the next one!