read csv Files Using Polars
In this article, we’re going to walk through how to read csv files into a polars dataframe. To do this we’ll be utilizing a function called read_csv
. This function comes with a number of arguments that you can pass to change how the csv is opened, We’re going to keep our first example simple and use the default arguments.
Default Arguments
First, we’ll start by importing polars as pl. Next we’ll define a new variable called data and set it equal to pl dot read csv. Within the function, we’ll type the path to our csv file. The CSV file that we’ll utilize in this demonstration is the employee csv file. Now that we’ve defined our new data variable, we want to display our data frame. In Jupyter Notebooks, you can do that by simply typing the variable name. To display just a few rows, we’ll call the head function off of the DataFrame.
import polars as pl
data = pl.read_csv('employees.csv')
data.head()
shape: (5, 10)
Employee ID | First Name | Last Name | Gender | Date of Birth | Department | Position | Salary ($) | Phone | |
---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | str | str | i64 | str | str |
1 | “John” | “Smith” | “Male” | “1985-03-15” | “Sales” | “Sales Manager” | 75000 | “john.smith@exa… | “123-456-7890” |
2 | “Jane” | “Doe” | “Female” | “1990-07-20” | “HR” | “HR Specialist” | 60000 | “jane.doe@examp… | “234-567-8901” |
3 | “Michael” | “Johnson” | “Male” | “1988-11-10” | “IT” | “IT Manager” | 80000 | “michael.j@exam… | “345-678-9012” |
4 | “Emily” | “Williams” | “Female” | “1992-04-25” | “Marketing” | “Marketing Spec… | 55000 | “emily.w@exampl… | “456-789-0123” |
5 | “David” | “Brown” | “Male” | “1987-09-08” | “Finance” | “Accountant” | 65000 | “david.b@exampl… | “567-890-1234” |
When we run the cell we get our polars DataFrame displayed in a nice tabular format. We also get the shape of our DataFrame in a tuple format that tells us how many rows and columns there are. Here, it’s just counting the rows that contain data. The first two non-data rows in the DataFrame contain the column names and the polars data type. Pretty simple, right?
Has Headers
As a data engineer or data analyst, you’re bound to work with csv files that both do and don’t have column headers in the first row. By default polars assumes that headers exist. However, you can tell polars that the first row of data does not contain column headers using the “has headers” argument.
Let’s do that now by once again opening our employees csv file. This time, we’ll call the has_headers
argument and set it equal to False. Let’s display the first few rows of our DataFrame by calling the head function.
data = pl.read_csv('employees.csv', has_header=False)
data.head()
shape: (5, 10)
column_1 | column_2 | column_3 | column_4 | column_5 | column_6 | column_7 | column_8 | column_9 | column_10 |
---|---|---|---|---|---|---|---|---|---|
str | str | str | str | str | str | str | str | str | str |
“Employee ID” | “First Name” | “Last Name” | “Gender” | “Date of Birth” | “Department” | “Position” | “Salary ($)” | “Email” | “Phone” |
“1” | “John” | “Smith” | “Male” | “1985-03-15” | “Sales” | “Sales Manager” | “75000” | “john.smith@exa… | “123-456-7890” |
“2” | “Jane” | “Doe” | “Female” | “1990-07-20” | “HR” | “HR Specialist” | “60000” | “jane.doe@examp… | “234-567-8901” |
“3” | “Michael” | “Johnson” | “Male” | “1988-11-10” | “IT” | “IT Manager” | “80000” | “michael.j@exam… | “345-678-9012” |
“4” | “Emily” | “Williams” | “Female” | “1992-04-25” | “Marketing” | “Marketing Spec… | “55000” | “emily.w@exampl… | “456-789-0123” |
When we run the code, the column headers are now in the first row of data and the column names are now generic. This ultimately isn’t what we want for this dataset, but it illustrates how we can adjust our polars DataFrame for csv files without headers.
New Columns
A somewhat related argument is called New Columns. This allows us to overwrite the column headers by passing a list of column names. If you are working with files without column headers, this is the perfect way to add them for greater clarity. Let’s demonstrate how this is done!
First, we’ll create our list of headers. We’ll put this list into a new variable called col_names
. Once you have that list, we’ll go ahead and reuse the same code as before, copying the has_header
argument. We’ll then call the new_columns
argument and pass our column names list. Let’s once again display the first five rows of the output.
col_names = ['EID','FName','LName','Gender','DOB','Dept','Position','Compensation','Email','Phone']
data = pl.read_csv('employees.csv', has_header=False, new_columns=col_names)
data.head()
shape: (5, 10)
EID | FName | LName | Gender | DOB | Dept | Position | Compensation | Phone | |
---|---|---|---|---|---|---|---|---|---|
str | str | str | str | str | str | str | str | str | str |
“Employee ID” | “First Name” | “Last Name” | “Gender” | “Date of Birth” | “Department” | “Position” | “Salary ($)” | “Email” | “Phone” |
“1” | “John” | “Smith” | “Male” | “1985-03-15” | “Sales” | “Sales Manager” | “75000” | “john.smith@exa… | “123-456-7890” |
“2” | “Jane” | “Doe” | “Female” | “1990-07-20” | “HR” | “HR Specialist” | “60000” | “jane.doe@examp… | “234-567-8901” |
“3” | “Michael” | “Johnson” | “Male” | “1988-11-10” | “IT” | “IT Manager” | “80000” | “michael.j@exam… | “345-678-9012” |
“4” | “Emily” | “Williams” | “Female” | “1992-04-25” | “Marketing” | “Marketing Spec… | “55000” | “emily.w@exampl… | “456-789-0123” |
Now, instead of having column 1, column 2, and so on, we have our new column names. In addition to being able to change column names for files without column headers, you can also use this function to change the headers on files with column headers.
Let’s quickly show you how this is done. We’ll once again reuse most of the code from before, but this time we’ll drop the has headers argument so that polars assumes we have column headers in the first row of the file. We’ll use the same column list that we defined above and then run the code!
data = pl.read_csv('employees.csv', new_columns=col_names)
data.head()
shape: (5, 10)
EID | FName | LName | Gender | DOB | Dept | Position | Compensation | Phone | |
---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | str | str | i64 | str | str |
1 | “John” | “Smith” | “Male” | “1985-03-15” | “Sales” | “Sales Manager” | 75000 | “john.smith@exa… | “123-456-7890” |
2 | “Jane” | “Doe” | “Female” | “1990-07-20” | “HR” | “HR Specialist” | 60000 | “jane.doe@examp… | “234-567-8901” |
3 | “Michael” | “Johnson” | “Male” | “1988-11-10” | “IT” | “IT Manager” | 80000 | “michael.j@exam… | “345-678-9012” |
4 | “Emily” | “Williams” | “Female” | “1992-04-25” | “Marketing” | “Marketing Spec… | 55000 | “emily.w@exampl… | “456-789-0123” |
5 | “David” | “Brown” | “Male” | “1987-09-08” | “Finance” | “Accountant” | 65000 | “david.b@exampl… | “567-890-1234” |
Once again, the column headers were rewritten and our DataFrame now looks nice and clean.
Separator
Let’s take a look at some more arguments. Believe it or not, not all CSV files are comma separated. As a matter of fact, you can use any delimiter you’d like within certain limits. Some of the most popular separators are: semicolons, tabs, and pipes. Let’s get some practice with this by opening a tab separated file.
To open this file, we’ll once again call the read csv function. This time we’ll be opening the employees TSV file. By default, polars assumes the separator is a comma. If we were to run the code as is, polars would fail to parse the data correctly. So what we’re going to do is call the separator
argument and pass backslash lowercase T.
data = pl.read_csv('employees.tsv',separator='t',)
data.head()
shape: (5, 10)
Employee ID | First Name | Last Name | Gender | Date of Birth | Department | Position | Salary ($) | Phone | |
---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | str | str | i64 | str | str |
1 | “John” | “Smith” | “Male” | “1985-03-15” | “Sales” | “Sales Manager” | 75000 | “john.smith@exa… | “123-456-7890” |
2 | “Jane” | “Doe” | “Female” | “1990-07-20” | “HR” | “HR Specialist” | 60000 | “jane.doe@examp… | “234-567-8901” |
3 | “Michael” | “Johnson” | “Male” | “1988-11-10” | “IT” | “IT Manager” | 80000 | “michael.j@exam… | “345-678-9012” |
4 | “Emily” | “Williams” | “Female” | “1992-04-25” | “Marketing” | “Marketing Spec… | 55000 | “emily.w@exampl… | “456-789-0123” |
5 | “David” | “Brown” | “Male” | “1987-09-08” | “Finance” | “Accountant” | 65000 | “david.b@exampl… | “567-890-1234” |
When we run the code, we get the DataFrame that we’d expect.
Data Types
Next, let’s review the dtypes
argument. By default polars will try to pick the best data type for any given column. While it does a pretty good job at interpreting the data, it doesn’t always hit the mark and give us the data type that we really want to work with. For example, our Salary column was read in as an integer 64. This works with this dataset because there aren’t any decimal values, but let’s say that I actually want to represent this data as a float. To do this, I can use the data_types
argument.
Once again, we’ll call the read csv function and pass the file path to our csv. Next, we’ll call the dtypes
argument and pass a dictionary of the columns we want to change with their respective data types. In our case, we want to change our salary column to be a float. We can do this by typing pl.Float64
.
data = pl.read_csv('employees.csv',dtypes={'Salary ($)':pl.Float64})
data.head()
shape: (5, 10)
Employee ID | First Name | Last Name | Gender | Date of Birth | Department | Position | Salary ($) | Phone | |
---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | str | str | f64 | str | str |
1 | “John” | “Smith” | “Male” | “1985-03-15” | “Sales” | “Sales Manager” | 75000.0 | “john.smith@exa… | “123-456-7890” |
2 | “Jane” | “Doe” | “Female” | “1990-07-20” | “HR” | “HR Specialist” | 60000.0 | “jane.doe@examp… | “234-567-8901” |
3 | “Michael” | “Johnson” | “Male” | “1988-11-10” | “IT” | “IT Manager” | 80000.0 | “michael.j@exam… | “345-678-9012” |
4 | “Emily” | “Williams” | “Female” | “1992-04-25” | “Marketing” | “Marketing Spec… | 55000.0 | “emily.w@exampl… | “456-789-0123” |
5 | “David” | “Brown” | “Male” | “1987-09-08” | “Finance” | “Accountant” | 65000.0 | “david.b@exampl… | “567-890-1234” |
When we display the DataFrame, our salary column is no longer sporting an integer 64 value. Instead we now have it displaying as a float value with decimals.
Try Parse Dates
You might have noticed that we have a Date of Birth Column, but the data type currently in use is a string. While this might be fine in some circumstances, there will be times when you’ll want to read date columns in with actual date or datetime data types. To do this we can use the try_parse_dates
argument. This will essentially tell polars to look through the csv file, identify date & datetime patterns and read them in as a date, or datetime data type.
Let’s demonstrate this with our employees dataset. We’ll once again call the read csv function and pass the file path to our employees csv file. Next we’ll set try_parse_dates
equal to True.
data = pl.read_csv('employees.csv',try_parse_dates=True)
data.head()
shape: (5, 10)
Employee ID | First Name | Last Name | Gender | Date of Birth | Department | Position | Salary ($) | Phone | |
---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | date | str | str | i64 | str | str |
1 | “John” | “Smith” | “Male” | 1985-03-15 | “Sales” | “Sales Manager” | 75000 | “john.smith@exa… | “123-456-7890” |
2 | “Jane” | “Doe” | “Female” | 1990-07-20 | “HR” | “HR Specialist” | 60000 | “jane.doe@examp… | “234-567-8901” |
3 | “Michael” | “Johnson” | “Male” | 1988-11-10 | “IT” | “IT Manager” | 80000 | “michael.j@exam… | “345-678-9012” |
4 | “Emily” | “Williams” | “Female” | 1992-04-25 | “Marketing” | “Marketing Spec… | 55000 | “emily.w@exampl… | “456-789-0123” |
5 | “David” | “Brown” | “Male” | 1987-09-08 | “Finance” | “Accountant” | 65000 | “david.b@exampl… | “567-890-1234” |
When we run the code we get back our DataFrame once more. This time, instead of a string data type in the date of birth column we have the correct date data type.
Conclusion
We’ve demonstrated the major capabilities of the read csv function. That said, there are a few more parameters that you can use. For more, you can check out the polars documentation here: https://docs.pola.rs/py-polars/html/reference/api/polars.read_csv.html#polars.read_csv
Want to learn more about polars? Check out my course!