Read CSV Files into Polars DataFrames using Python

read csv Files Using Polars

In this article, we’re going to walk through how to read csv files into a polars dataframe. To do this we’ll be utilizing a function called read_csv. This function comes with a number of arguments that you can pass to change how the csv is opened, We’re going to keep our first example simple and use the default arguments.

Default Arguments

First, we’ll start by importing polars as pl. Next we’ll define a new variable called data and set it equal to pl dot read csv. Within the function, we’ll type the path to our csv file. The CSV file that we’ll utilize in this demonstration is the employee csv file. Now that we’ve defined our new data variable, we want to display our data frame. In Jupyter Notebooks, you can do that by simply typing the variable name. To display just a few rows, we’ll call the head function off of the DataFrame.

import polars as pl
data = pl.read_csv('employees.csv')
data.head()

shape: (5, 10)

Employee ID	First Name	Last Name	Gender	Date of Birth	Department	Position	Salary ($)	Email	Phone
i64	str	str	str	str	str	str	i64	str	str
1	“John”	“Smith”	“Male”	“1985-03-15”	“Sales”	“Sales Manager”	75000	“john.smith@exa…	“123-456-7890”
2	“Jane”	“Doe”	“Female”	“1990-07-20”	“HR”	“HR Specialist”	60000	“jane.doe@examp…	“234-567-8901”
3	“Michael”	“Johnson”	“Male”	“1988-11-10”	“IT”	“IT Manager”	80000	“michael.j@exam…	“345-678-9012”
4	“Emily”	“Williams”	“Female”	“1992-04-25”	“Marketing”	“Marketing Spec…	55000	“emily.w@exampl…	“456-789-0123”
5	“David”	“Brown”	“Male”	“1987-09-08”	“Finance”	“Accountant”	65000	“david.b@exampl…	“567-890-1234”

When we run the cell we get our polars DataFrame displayed in a nice tabular format. We also get the shape of our DataFrame in a tuple format that tells us how many rows and columns there are. Here, it’s just counting the rows that contain data. The first two non-data rows in the DataFrame contain the column names and the polars data type. Pretty simple, right?

Has Headers

As a data engineer or data analyst, you’re bound to work with csv files that both do and don’t have column headers in the first row. By default polars assumes that headers exist. However, you can tell polars that the first row of data does not contain column headers using the “has headers” argument.

Let’s do that now by once again opening our employees csv file. This time, we’ll call the has_headers argument and set it equal to False. Let’s display the first few rows of our DataFrame by calling the head function.

data = pl.read_csv('employees.csv', has_header=False)
data.head()

shape: (5, 10)

column_1	column_2	column_3	column_4	column_5	column_6	column_7	column_8	column_9	column_10
str	str	str	str	str	str	str	str	str	str
“Employee ID”	“First Name”	“Last Name”	“Gender”	“Date of Birth”	“Department”	“Position”	“Salary ($)”	“Email”	“Phone”
“1”	“John”	“Smith”	“Male”	“1985-03-15”	“Sales”	“Sales Manager”	“75000”	“john.smith@exa…	“123-456-7890”
“2”	“Jane”	“Doe”	“Female”	“1990-07-20”	“HR”	“HR Specialist”	“60000”	“jane.doe@examp…	“234-567-8901”
“3”	“Michael”	“Johnson”	“Male”	“1988-11-10”	“IT”	“IT Manager”	“80000”	“michael.j@exam…	“345-678-9012”
“4”	“Emily”	“Williams”	“Female”	“1992-04-25”	“Marketing”	“Marketing Spec…	“55000”	“emily.w@exampl…	“456-789-0123”

When we run the code, the column headers are now in the first row of data and the column names are now generic. This ultimately isn’t what we want for this dataset, but it illustrates how we can adjust our polars DataFrame for csv files without headers.

New Columns

A somewhat related argument is called New Columns. This allows us to overwrite the column headers by passing a list of column names. If you are working with files without column headers, this is the perfect way to add them for greater clarity. Let’s demonstrate how this is done!

First, we’ll create our list of headers. We’ll put this list into a new variable called col_names. Once you have that list, we’ll go ahead and reuse the same code as before, copying the has_header argument. We’ll then call the new_columns argument and pass our column names list. Let’s once again display the first five rows of the output.

col_names = ['EID','FName','LName','Gender','DOB','Dept','Position','Compensation','Email','Phone']
data = pl.read_csv('employees.csv', has_header=False, new_columns=col_names)
data.head()

shape: (5, 10)

EID	FName	LName	Gender	DOB	Dept	Position	Compensation	Email	Phone
str	str	str	str	str	str	str	str	str	str
“Employee ID”	“First Name”	“Last Name”	“Gender”	“Date of Birth”	“Department”	“Position”	“Salary ($)”	“Email”	“Phone”
“1”	“John”	“Smith”	“Male”	“1985-03-15”	“Sales”	“Sales Manager”	“75000”	“john.smith@exa…	“123-456-7890”
“2”	“Jane”	“Doe”	“Female”	“1990-07-20”	“HR”	“HR Specialist”	“60000”	“jane.doe@examp…	“234-567-8901”
“3”	“Michael”	“Johnson”	“Male”	“1988-11-10”	“IT”	“IT Manager”	“80000”	“michael.j@exam…	“345-678-9012”
“4”	“Emily”	“Williams”	“Female”	“1992-04-25”	“Marketing”	“Marketing Spec…	“55000”	“emily.w@exampl…	“456-789-0123”

Now, instead of having column 1, column 2, and so on, we have our new column names. In addition to being able to change column names for files without column headers, you can also use this function to change the headers on files with column headers.

Let’s quickly show you how this is done. We’ll once again reuse most of the code from before, but this time we’ll drop the has headers argument so that polars assumes we have column headers in the first row of the file. We’ll use the same column list that we defined above and then run the code!

data = pl.read_csv('employees.csv', new_columns=col_names)
data.head()

shape: (5, 10)

EID	FName	LName	Gender	DOB	Dept	Position	Compensation	Email	Phone
i64	str	str	str	str	str	str	i64	str	str
1	“John”	“Smith”	“Male”	“1985-03-15”	“Sales”	“Sales Manager”	75000	“john.smith@exa…	“123-456-7890”
2	“Jane”	“Doe”	“Female”	“1990-07-20”	“HR”	“HR Specialist”	60000	“jane.doe@examp…	“234-567-8901”
3	“Michael”	“Johnson”	“Male”	“1988-11-10”	“IT”	“IT Manager”	80000	“michael.j@exam…	“345-678-9012”
4	“Emily”	“Williams”	“Female”	“1992-04-25”	“Marketing”	“Marketing Spec…	55000	“emily.w@exampl…	“456-789-0123”
5	“David”	“Brown”	“Male”	“1987-09-08”	“Finance”	“Accountant”	65000	“david.b@exampl…	“567-890-1234”

Once again, the column headers were rewritten and our DataFrame now looks nice and clean.

Separator

Let’s take a look at some more arguments. Believe it or not, not all CSV files are comma separated. As a matter of fact, you can use any delimiter you’d like within certain limits. Some of the most popular separators are: semicolons, tabs, and pipes. Let’s get some practice with this by opening a tab separated file.

To open this file, we’ll once again call the read csv function. This time we’ll be opening the employees TSV file. By default, polars assumes the separator is a comma. If we were to run the code as is, polars would fail to parse the data correctly. So what we’re going to do is call the separator argument and pass backslash lowercase T.

data = pl.read_csv('employees.tsv',separator='t',)
data.head()

shape: (5, 10)

Employee ID	First Name	Last Name	Gender	Date of Birth	Department	Position	Salary ($)	Email	Phone
i64	str	str	str	str	str	str	i64	str	str
1	“John”	“Smith”	“Male”	“1985-03-15”	“Sales”	“Sales Manager”	75000	“john.smith@exa…	“123-456-7890”
2	“Jane”	“Doe”	“Female”	“1990-07-20”	“HR”	“HR Specialist”	60000	“jane.doe@examp…	“234-567-8901”
3	“Michael”	“Johnson”	“Male”	“1988-11-10”	“IT”	“IT Manager”	80000	“michael.j@exam…	“345-678-9012”
4	“Emily”	“Williams”	“Female”	“1992-04-25”	“Marketing”	“Marketing Spec…	55000	“emily.w@exampl…	“456-789-0123”
5	“David”	“Brown”	“Male”	“1987-09-08”	“Finance”	“Accountant”	65000	“david.b@exampl…	“567-890-1234”

When we run the code, we get the DataFrame that we’d expect.

Data Types

Next, let’s review the dtypes argument. By default polars will try to pick the best data type for any given column. While it does a pretty good job at interpreting the data, it doesn’t always hit the mark and give us the data type that we really want to work with. For example, our Salary column was read in as an integer 64. This works with this dataset because there aren’t any decimal values, but let’s say that I actually want to represent this data as a float. To do this, I can use the data_types argument.

Once again, we’ll call the read csv function and pass the file path to our csv. Next, we’ll call the dtypes argument and pass a dictionary of the columns we want to change with their respective data types. In our case, we want to change our salary column to be a float. We can do this by typing pl.Float64.

data = pl.read_csv('employees.csv',dtypes={'Salary ($)':pl.Float64})
data.head()

shape: (5, 10)

Employee ID	First Name	Last Name	Gender	Date of Birth	Department	Position	Salary ($)	Email	Phone
i64	str	str	str	str	str	str	f64	str	str
1	“John”	“Smith”	“Male”	“1985-03-15”	“Sales”	“Sales Manager”	75000.0	“john.smith@exa…	“123-456-7890”
2	“Jane”	“Doe”	“Female”	“1990-07-20”	“HR”	“HR Specialist”	60000.0	“jane.doe@examp…	“234-567-8901”
3	“Michael”	“Johnson”	“Male”	“1988-11-10”	“IT”	“IT Manager”	80000.0	“michael.j@exam…	“345-678-9012”
4	“Emily”	“Williams”	“Female”	“1992-04-25”	“Marketing”	“Marketing Spec…	55000.0	“emily.w@exampl…	“456-789-0123”
5	“David”	“Brown”	“Male”	“1987-09-08”	“Finance”	“Accountant”	65000.0	“david.b@exampl…	“567-890-1234”

When we display the DataFrame, our salary column is no longer sporting an integer 64 value. Instead we now have it displaying as a float value with decimals.

Try Parse Dates

You might have noticed that we have a Date of Birth Column, but the data type currently in use is a string. While this might be fine in some circumstances, there will be times when you’ll want to read date columns in with actual date or datetime data types. To do this we can use the try_parse_dates argument. This will essentially tell polars to look through the csv file, identify date & datetime patterns and read them in as a date, or datetime data type.

Let’s demonstrate this with our employees dataset. We’ll once again call the read csv function and pass the file path to our employees csv file. Next we’ll set try_parse_dates equal to True.

data = pl.read_csv('employees.csv',try_parse_dates=True)
data.head()

shape: (5, 10)

Employee ID	First Name	Last Name	Gender	Date of Birth	Department	Position	Salary ($)	Email	Phone
i64	str	str	str	date	str	str	i64	str	str
1	“John”	“Smith”	“Male”	1985-03-15	“Sales”	“Sales Manager”	75000	“john.smith@exa…	“123-456-7890”
2	“Jane”	“Doe”	“Female”	1990-07-20	“HR”	“HR Specialist”	60000	“jane.doe@examp…	“234-567-8901”
3	“Michael”	“Johnson”	“Male”	1988-11-10	“IT”	“IT Manager”	80000	“michael.j@exam…	“345-678-9012”
4	“Emily”	“Williams”	“Female”	1992-04-25	“Marketing”	“Marketing Spec…	55000	“emily.w@exampl…	“456-789-0123”
5	“David”	“Brown”	“Male”	1987-09-08	“Finance”	“Accountant”	65000	“david.b@exampl…	“567-890-1234”

When we run the code we get back our DataFrame once more. This time, instead of a string data type in the date of birth column we have the correct date data type.

Conclusion

We’ve demonstrated the major capabilities of the read csv function. That said, there are a few more parameters that you can use. For more, you can check out the polars documentation here: https://docs.pola.rs/py-polars/html/reference/api/polars.read_csv.html#polars.read_csv

Want to learn more about polars? Check out my course!

Data Engineering with Polars in Python