You might’ve heard that the word “OK” is the most globally recognized phrase in the world. It’s found in almost every spoken language from Arabic to Zulu.
In the programming world, we have something similar: Regular Expressions.
Regular Expressions, or regexes as the cool kids say, are powerful tools used to validate, manipulate, and extract data from text. The way they work is by defining a pattern that describes what’s trying to be found. This pattern is the “OK” of the programming world.
First, let’s take a look at how to construct a common pattern. We can use the title of this blog post as an example:
“Regular Expressions in 10 different languages”
Let’s use a pattern to make sure a number remains in the title. The text you are working against is called the subject, or target. We can write a pattern to work against our subject like this:
This pattern will match one or more digits between 0 and 9. (The \d represents a digit, 0 through 9 and the + is known as a quantifier, and it represents one or more of the characters before it.) Now, since there are more characters in the title, let’s allow them in.
So now it reads like this: zero or more characters followed by one or more digits followed by zero or more characters. (The . is a wildcard, the * quantifier represents zero or more.) Each different programming language will offer a function or method to run this validation. Often it’s named something along the lines of matches and will return true if the pattern matches and false if not.
Now let’s pull that number out so we can use it elsewhere in the program. You can do this by defining what’s known as a capturing group. Creating one is easy. Simply surround the part of the pattern you’d like to be able to extract with parentheses like this:
This pattern is the same for all programming languages that implement Regular Expressions. OK?
Now let’s take a look at how each of these different languages match and extract data using Regular Expressions. For each of these examples we’ll store the blog title in the variable title and use that as our subject.
In Java, the String class has a method called matches. It’ll return true if it does. Java also has a few helper classes. Namely java.util.regex.Pattern which helps to create the patterns, and java.util.regex.Matcher which is a helper for navigating and keeping state of various matches.
(I recently released a workshop on Regular Expressions in Java and I walk through how to build, learn, and read regexes. It’s a lot of fun. I even sing a little bit.)
Python provides regular expressions through the re module, and provides a way to avoid the double escaping by prefacing your string with an r (for raw).
(Kenneth Love provides a full hands-on Regular Expressions in Python course. You should check it out if you want to see Kenneth flip a table.)
Ruby allows you to create a Regular Expression object by surrounding a pattern with forward slashes. This is known as a regular expression literal.
Ruby also allows some cool shorthand operator overriding. Make sure to check out the =~ operator.
PHP has a function called preg_match. Despite what the name sounds like, it’s not a paternity test result function.
Regular Expressions support is not so easy in Swift. I came up with this as a solution after doing some searches. Doesn’t quite roll off the tongue, but the concept is there.
(Maybe I can talk Pasan into doing a workshop for me.)
The System.Text.RegularExpressions Namespace is where you’ll find the C# goods. Regex.Matches returns an empty collection if the match is not found.
(Psst! We just hired a C# teacher…much more coming soon!)
Groovy, Java’s wacky cousin, adds some nice touches as far as syntax goes. It uses the forward slash literal as well as the =~ operator. The matcher returned here can be used the same as in the Java example above (because it is, in fact, the same!).
Scala, as expected, does things a little differently. Things can be found in the scala.util.matching package. Make sure to read up on the unanchored bits and how the pattern is used as an extractor.
Seriously. All the editors from Sublime Text to Emacs to VIM will allow you to provide a Regular Expression to find and replace. Also a bunch of command line tools like grep and sed. They’re embedded everywhere and will really help you add some superpowers to the original functionality.
Regular Expressions are pretty awesome, aren’t they?
I know that we teachers say don’t repeat yourself (DRY) but…come hang out with me in my quick Regular Expressions in Java workshop. No prior knowledge is required. We’ll go over everything from the ground up. And you’ll be able to take the information with you even if Java isn’t your cup of tea.
OMG! You actually left Perl out!
imagine that I have a website http://www.abc.com that is in several languages. I want to create the regular expression that will cover all of the language variations of homepage.
http://www.abc.com is in English, http://www.abc.com/es is in Spanish and http://www.abc.com/it is in Italian. What expression I can use to capture all of these versions?
Thank you for your answer.
Hi Lukas! As this is an older blog post, I’d recommend posting your question in the Treehouse Community where Treehouse students and teachers are always happy to help!
To get the language you can use something like
This is untested but should work
If you are using Ruby, http://rubular.com/, is a wonderful resource to checking out regex and if your string matches the pattern.
I’ve always resorted to using http://regexpal.com/ whenever I have trouble getting a regex just right. These things are really great and can do some really amazing things if you take the time to implement them. Sometimes regexs are the only way to truly solve a problem.