Why Use Regular Expressions
Regular expressions in polars allow you to search for specific patterns within unstructured text. This capability helps you extract valuable insights from what might initially seem like random, disorganized data. By identifying patterns, you can transform unstructured text into useful information, unlocking hidden value in your data.
Configuring Polars
Before diving into some regular expressions use cases, let’s start by importing polars and reconfiguring our DataFrame output. Since we’ll be working with long strings of text, we’ll want to extend the default length of the columns.
import polars as pl
pl.Config.set_fmt_str_lengths(100)
Customer Review Patterns
Now that we have polars imported and our table configurations set, we can start getting hands on experience. Our first use case will be to use regular expression on a customer review datasets. Our objective is to find certain patterns relating to sentiment. Is the review overall positive or negative? Let’s write those patterns now.
positive_pattern = r"\b(great|amazing|love|perfect|happy|satisfied|recommended|good)\b"
negative_pattern = r"\b(disappointed|terrible|scam|poor|damaged|unhappy|bad|worst)\b"
delivery_pattern = r"\b(late|didn't arrive|damaged|delivery issue|took too long)\b"
Great, with our patterns ready to go, we need some data to test them against.
reviews = pl.read_csv('../../Datasets/customer_reviews.csv')
reviews
In the output we now have our customer reviews dataset. You’ll notice our unstructured text in the ReviewText column and as you read through some of those records, you might be able to see which reviews are positive or negative. That said, reading through each of these would take a considerable amount of time. Especially since we have 300 of them.
The contains
Method
Instead, let’s use our pattern to give us some quicker insights. We’ll start by using the contains method which will return a Boolean value based on if the pattern is matched or not.
reviews.select(
pl.col("ReviewText"),
pl.col("ReviewText").str.contains(positive_pattern).alias("PositiveSentiment"),
pl.col("ReviewText").str.contains(negative_pattern).alias("NegativeSentiment")
)
In the output, we see our two new columns with True and False values. You can see that the first two records appear to be positive and the next three records appear to be negative.
The count_matches
Method
Now not all reviews are straight forward. For example, the last column contains both positive and negative sentiments. Sometimes you have reviews with both positive and negative patterns. In this case, how do you know the overall sentiment.
Well polars provides us with another regex method that can help with this. count_matches
makes it easy to quantify the patterns.
reviews.select(
pl.col("ReviewText"),
pl.col("ReviewText").str.count_matches(positive_pattern).alias("PostiveCount"),
pl.col("ReviewText").str.count_matches(negative_pattern).alias("NegativeCount"),
)
In the output we get back two new columns with numeric values that give us the count of both the positive and negative patterns found. Looking back down at that last record, you can see that we have 1 positive pattern and 2 negative patterns. If you only compared those two columns and didn’t read the actual review, you could likely come to the right conclusion that the review is overall negative. Reading the review would confirm this.
Social Media Patterns
Let’s move on to social media analysis! Social media posts are another form of unstructured text that we can use Regular expressions on. Probably one of the most widely known use cases is when companies use regex patterns to flag content that violates community standards. These patterns are usually very robust and complicated.
Our examples however will be very basic. We simply want to extract all of the hashtags and mentions in our social media posts. Let’s start by declaring our patterns.
hashtag_pattern = r"#\w+"
mention_pattern = r"@\w+"
event_pattern = r"(conference|summit|webinar|fair|retreat)\d{4}"
Now that we have our patterns, lets import our data.
posts = pl.read_csv('../../Datasets/social_media_posts.csv')
posts
And in the output we have our PostText
column with our unstructured text.
The extract
Method
We want to extract all of the hashtags and mentions for each post. Let’s start by using the extract method.
posts.select(
pl.col("*"),
pl.col("PostText").str.extract(hashtag_pattern,0).alias("HashTags"),
pl.col("PostText").str.extract(mention_pattern,0).alias("Mentions"),
)
In the output, we get back two new columns with the actual pattern that was matched. You’ll notice that this is quite different from the contain methods we used above that simply told us the pattern was there. That said the extract method only returns the first match. And you can see that with the first record only giving us the first mention.
The extract_all
Method
To get every matched pattern, you’ll need to use the extract_all
method.
posts.select(
pl.col("PostText"),
pl.col("PostText").str.extract_all(hashtag_pattern).alias("HashTags"),
pl.col("PostText").str.extract_all(mention_pattern).alias("Mentions")
)
And this time, we get all of the values that match the pattern.
Conclusion
So now we’ve covered several ways to use regular expressions in Polars. And by mastering these techniques, you can efficiently process and gather insights from large volumes of textual data. Whether you’re working on customer reviews, survey responses, social media posts or any other form of unstructured text these functions will come in handy!
Check out the tutorial video on YouTube: https://youtu.be/ELELO8qcV2w