Java Journal: Regular Expressions

Java
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

So why devote an entire "Java Journal" to regular expressions? One word: Power. In all of the Java extensions and APIs from Sun to date never have you been able to wield so much power with just a few clicks of your keyboard. But as Spiderman's Uncle Ben so prophetically warned Peter Parker, "With great power comes great responsibility." Lest this power gifted to you by Sun become your curse, use it wisely. OK, so maybe the Spiderman reference is a bit much, and maybe regular expressions won't turn you into a superhero--well, at least not overnight.

In last month's Java Journal, I discussed optimizing Java I/O using NIO, the new Java I/O API. However, you might totally overlook one part of the NIO API unless you know about it, and that's its regular expression engine. So what exactly is a regular expression? Well, it's nothing more than a pattern. In our context, it is specifically a pattern--or sequence, if you prefer--of characters. In fact, calling it a character sequence is exactly what Sun did. Within JDK 1.4, Sun added an interface called CharSequence, which both the String and StringBuffer classes now implement. Thus, with the regular expression classes (and others for that matter), we can now deal with the abstraction of a CharSequence and not have to worry about the underlying implementation. This is truly polymorphism at its best. The NIO regular expression engine itself is made up of a paltry two classes, Pattern and Matcher.

Pattern

The Pattern class is the container for a regular expression. It has no constructors, but rather a pair of static factory methods called compile(), both of which return an instance of Pattern. Both flavors of compile() take a string argument that represents the regular expression. In, addition one takes a sequence of Boolean flags that can be passed as a single argument by applying the integer bitwise operator, which is represented as a vertical bar (|), as in Flag1 | Flag2. (I know it is splitting semantic hairs, but I hate it when people refer to the vertical bar as the Boolean OR operator; it does function as such but not when acting on primitive integral types. This is probably leftover C jargon from the days when all Booleans in C were integers.) The flags that can be passed to the compile() factory method allow for such rules as enabling UNIX line terminators or making the entire pattern case-insensitive, including comments in the pattern. There are several other flags, but they are mostly esoteric and are included to make regular expressions of types compatible with previous engines.

Matcher

The Matcher class, as the name implies, allows us to match instances of Pattern against instances of CharSequence. Like Pattern, Matcher has no public constructors. In fact, we actually use instances of Pattern to create instances of Matcher by invoking the appropriately named Pattern method matcher(). Once you have created your Matcher, you can do all kinds of tricks with it. You can do a find(), which returns a Boolean telling you whether or not there is a match of your regular expression contained in your CharSequence. If it is found, you can then make a call to group() to extract the matching sequence.

Example 1

Here is an example of matching a pattern to a string. In the first couple of examples, I'll stick to lowercase strings to keep thing readable:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegularExpressionMain 
{
  public static void main(String[] args) 
  { 
    String inputString = "michael floyd";
    
    Pattern myPattern = Pattern.compile("fl");
    
    Matcher myMatcher = myPattern.matcher(inputString);
    
    if (myMatcher.find());
    { 
      System.out.println(myMatcher.group());
    }
  }
}

In the above example, we first import the two classes that we will be using from java.util.regex. Then, to keep things simple, we do our work inside main(). We create a string containing my name and a two-letter pattern. We then create an instance of Matcher by calling matcher() and passing it our input string. Remember that the String class now implements CharSequence. We could have just as easily used a StringBuffer, which also implements CharSequence. We then call find() on the matcher, and if we find what we are looking for, we call group() to extract the match and push the results to System.out.

This is pretty straightforward and could have been accomplished in a number of ways by using existing APIs. In order to experience the true power of regular expressions, we must dig deeper. It is also worth noting that we used an if statement based on the return value of find() to guard our call to group(). Otherwise, the call to group() would throw an IllegalStateException if there was no match for the regular expression. You could just put in a try catch block, but I think the example given reads better without it. The output for this program would be "fl", since the pattern would be found.

Example 2

For this example, we want to find all the consonants followed by a vowel. Since the Java API for regular expressions follows the standard format for regular expressions, you just need to learn how to express a couple of basic patterns.

First, the syntax for a single character or any range of characters is expressed by enclosing the character(s) in brackets. You can use dashes to specify all the characters that come between any two specific characters. So [ac], would match either a or c, but [a-z], would match any lowercase letter between a and z. You can also use the vertical bar (|) to designate the OR condition or double ampersands (&&) for AND. So cat|dog, would match "cat" or "dog". The caret (^) is used for negation of a statement. So the regular expression we need for our example is [a-z&&[^aeiou]][aeiou].

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegularExpressionMain 
{
  public static void main(String[] args) 
  { 
    String inputString = "michael floyd";
    String myExpression =
      "[a-z&&[^aeiou]][aeiou]";
    
    Pattern myPattern = Pattern.compile(myExpression);
    
    Matcher myMatcher = myPattern.matcher(inputString);
    
    while (myMatcher.find())
    { 
      System.out.println(myMatcher.group());
    }
  }
}

We have made a few key modifications to our previous example here. First, we have changed the expression to literally read "match any letter that is not a vowel followed by any letter that is a vowel." Second, we have changed the if statement to a while statement so that all of the occurrences of the pattern, not just the first one, are found and printed. The output of this program would be as follows:

mi
ha
lo

Example 3

For this example, let's assume that we want to extract from a string all instances in which we have either two consonants or two vowels in a row.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegularExpressionMain 
{
  public static void main(String[] args) 
  { 
    String inputString = "michael floyd";
    String myExpression =
  "[a-z&&[^aeiou]][a-z&&[^aeiou]]|[aeiou][aeiou]";
    
    Pattern myPattern = Pattern.compile(myExpression);
    
    Matcher myMatcher = myPattern.matcher(inputString);
    
while (myMatcher.find())

  System.out.println(myMatcher.group());
}
  }
}

For this example, we have built on the previous example, and although our expression is getting longer, it isn't actually very complicated. In fact, to make it more readable, all we have to do is replace the reusable parts of the expression with symbolic constants, like in the following example:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegularExpressionMain 
{
  public static void main(String[] args) 
  { 
    String consonants = "[a-z&&[^aeiou]]";
    String vowel = "[aeiou]";
    String inputString = "michael floyd";
    String myExpression =
      consonants + consonants + "|" + vowel + vowel;
    
    Pattern myPattern = Pattern.compile(myExpression);
    
    Matcher myMatcher = myPattern.matcher(inputString);
    
    while (myMatcher.find())
    { 
      System.out.println(myMatcher.group());
    }
  }
}

The output for both versions of this example are the same:

ch
ae
fl
yd

These two examples aren't very different structurally. But what would happen if we decided to include uppercase letter in our pattern and input? Our regular expression would turn into this:

String myExpression =
  [a-zA-Z&&[^aeiouAEIOU]][a-zA-Z&&[^aeiouAEIOU]]|[aeiouAEIOU][aeiouAEIOU]

But if we went with our newer version, the same snippet would look like this:

String consonants = "[a-zA-Z&&[^aeiouAEIOU]]";
String vowel = "[aeiouAEIOU]";
    
String myExpression =
  consonants + consonants + "|" + vowel + vowel;


I don't know about you, but I far prefer the latter. There are also ways of expressing repeated patterns like this within the standard regular expression syntax.

Other Regular Expression Engines

Predating the JDK 1.4 release were two regular expression engines put out by Jakarta/Apache: Jakarta ORO, which provides Perl5-compatible regular expressions, and Jakarta regexp, which predates ORO. In addition to sed, awk, grep, and egrep on UNIX, we have to pay homage to Perl which, whether you love it or hate it, has pushed the boundaries of what can be done with regular expressions.

Declarative Languages

Like many people, I learned my computer skills in the UNIX-filled world of higher education. In those days, I knew sed, awk, grep, and vi better than the cheat codes to my favorite video games. These tools all had one thing in common: regular expressions. Later in my academic career, I took my first database class. Since my school's Computer Science program was an offshoot of the Mathematics Department, the database class was all based on set theory and tuple relational calculus. In the end, we studied SQL, and I was surprised by how much it reminded me of regular expressions. SQL by itself is a declarative language; by this, I mean that in SQL we make a single declarative statement and then pipe it off to a database to return the desired result. Sound familiar? With regular expressions, we create an expression and then run it through an engine to return the desired result. Early in my career, I found myself using this analogy to teach coworkers who were familiar with regular expressions how to write SQL. Ironically, I recently found myself explaining regular expressions to someone quite familiar with SQL, using the inverse analogy. Regular expressions are tightly coupled with automata theory and formal language theory. Regular expressions themselves actually describe languages.

Final Thoughts

Regular expression engines have been around for a long time and, after having been part of some Java APIs for quite some time, have finally found a home in the JDK. Here, I have provided the smallest of tastes of the full syntax for regular expressions. For a full treatment, see the API specifications. Regular expressions are powerful, but they can be unwieldy and error prone. So be sure to write lots of unit tests to make sure your code is behaving as it should. In fact, sometimes regular expressions can be helpful in writing your unit tests.

In the examples provided here, we have just scratched the surface of what can be done with regular expressions. Think about writing a validator for URLs or email addresses. This can be done with a single, although rather ugly, regular expression. In fact, you can probably find one already written for you. Using a regular expression to validate URLs and email addresses instead of writing your own using string manipulation and tokens may just make you feel like a superhero.

For more information on NIO check out Java NIO by Ron Hitchens. If you would like more information on formal languages and automata theory, there is no better book than The Theory of Finite Automata by John Carroll and Darrell Long.

Michael J. Floyd is the Software Engineering Manager for DivXNetworks. He is also a consultant for San Diego State University and can be reached at This email address is being protected from spambots. You need JavaScript enabled to view it..

So why devote an entire "Java Journal" to regular expressions? One word: Power. In all of the Java extensions and APIs from Sun to date never have you been able to wield so much power with just a few clicks of your keyboard. But as Spiderman's Uncle Ben so prophetically warned Peter Parker, "With great power comes great responsibility." Lest this power gifted to you by Sun become your curse, use it wisely. OK, so maybe the Spiderman reference is a bit much, and maybe regular expressions won't turn you into a superhero--well, at least not overnight.

In last month's Java Journal, I discussed optimizing Java I/O using NIO, the new Java I/O API. However, you might totally overlook one part of the NIO API unless you know about it, and that's its regular expression engine. So what exactly is a regular expression? Well, it's nothing more than a pattern. In our context, it is specifically a pattern--or sequence, if you prefer--of characters. In fact, calling it a character sequence is exactly what Sun did. Within JDK 1.4, Sun added an interface called CharSequence, which both the String and StringBuffer classes now implement. Thus, with the regular expression classes (and others for that matter), we can now deal with the abstraction of a CharSequence and not have to worry about the underlying implementation. This is truly polymorphism at its best. The NIO regular expression engine itself is made up of a paltry two classes, Pattern and Matcher.

Pattern

The Pattern class is the container for a regular expression. It has no constructors, but rather a pair of static factory methods called compile(), both of which return an instance of Pattern. Both flavors of compile() take a string argument that represents the regular expression. In, addition one takes a sequence of Boolean flags that can be passed as a single argument by applying the integer bitwise operator, which is represented as a vertical bar (|), as in Flag1 | Flag2. (I know it is splitting semantic hairs, but I hate it when people refer to the vertical bar as the Boolean OR operator; it does function as such but not when acting on primitive integral types. This is probably leftover C jargon from the days when all Booleans in C were integers.) The flags that can be passed to the compile() factory method allow for such rules as enabling UNIX line terminators or making the entire pattern case-insensitive, including comments in the pattern. There are several other flags, but they are mostly esoteric and are included to make regular expressions of types compatible with previous engines.

Matcher

The Matcher class, as the name implies, allows us to match instances of Pattern against instances of CharSequence. Like Pattern, Matcher has no public constructors. In fact, we actually use instances of Pattern to create instances of Matcher by invoking the appropriately named Pattern method matcher(). Once you have created your Matcher, you can do all kinds of tricks with it. You can do a find(), which returns a Boolean telling you whether or not there is a match of your regular expression contained in your CharSequence. If it is found, you can then make a call to group() to extract the matching sequence.

Example 1

Here is an example of matching a pattern to a string. In the first couple of examples, I'll stick to lowercase strings to keep thing readable:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegularExpressionMain 
{
  public static void main(String[] args) 
  { 
    String inputString = "michael floyd";
    
    Pattern myPattern = Pattern.compile("fl");
    
    Matcher myMatcher = myPattern.matcher(inputString);
    
    if (myMatcher.find());
    { 
      System.out.println(myMatcher.group());
    }
  }
}

In the above example, we first import the two classes that we will be using from java.util.regex. Then, to keep things simple, we do our work inside main(). We create a string containing my name and a two-letter pattern. We then create an instance of Matcher by calling matcher() and passing it our input string. Remember that the String class now implements CharSequence. We could have just as easily used a StringBuffer, which also implements CharSequence. We then call find() on the matcher, and if we find what we are looking for, we call group() to extract the match and push the results to System.out.

This is pretty straightforward and could have been accomplished in a number of ways by using existing APIs. In order to experience the true power of regular expressions, we must dig deeper. It is also worth noting that we used an if statement based on the return value of find() to guard our call to group(). Otherwise, the call to group() would throw an IllegalStateException if there was no match for the regular expression. You could just put in a try catch block, but I think the example given reads better without it. The output for this program would be "fl", since the pattern would be found.

Example 2

For this example, we want to find all the consonants followed by a vowel. Since the Java API for regular expressions follows the standard format for regular expressions, you just need to learn how to express a couple of basic patterns.

First, the syntax for a single character or any range of characters is expressed by enclosing the character(s) in brackets. You can use dashes to specify all the characters that come between any two specific characters. So [ac], would match either a or c, but [a-z], would match any lowercase letter between a and z. You can also use the vertical bar (|) to designate the OR condition or double ampersands (&&) for AND. So cat|dog, would match "cat" or "dog". The caret (^) is used for negation of a statement. So the regular expression we need for our example is [a-z&&[^aeiou]][aeiou].

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegularExpressionMain 
{
  public static void main(String[] args) 
  { 
    String inputString = "michael floyd";
    String myExpression =
      "[a-z&&[^aeiou]][aeiou]";
    
    Pattern myPattern = Pattern.compile(myExpression);
    
    Matcher myMatcher = myPattern.matcher(inputString);
    
    while (myMatcher.find())
    { 
      System.out.println(myMatcher.group());
    }
  }
}

We have made a few key modifications to our previous example here. First, we have changed the expression to literally read "match any letter that is not a vowel followed by any letter that is a vowel." Second, we have changed the if statement to a while statement so that all of the occurrences of the pattern, not just the first one, are found and printed. The output of this program would be as follows:

mi
ha
lo

Example 3

For this example, let's assume that we want to extract from a string all instances in which we have either two consonants or two vowels in a row.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegularExpressionMain 
{
  public static void main(String[] args) 
  { 
    String inputString = "michael floyd";
    String myExpression =
  "[a-z&&[^aeiou]][a-z&&[^aeiou]]|[aeiou][aeiou]";
    
    Pattern myPattern = Pattern.compile(myExpression);
    
    Matcher myMatcher = myPattern.matcher(inputString);
    
while (myMatcher.find())

  System.out.println(myMatcher.group());
}
  }
}

For this example, we have built on the previous example, and although our expression is getting longer, it isn't actually very complicated. In fact, to make it more readable, all we have to do is replace the reusable parts of the expression with symbolic constants, like in the following example:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegularExpressionMain 
{
  public static void main(String[] args) 
  { 
    String consonants = "[a-z&&[^aeiou]]";
    String vowel = "[aeiou]";
    String inputString = "michael floyd";
    String myExpression =
      consonants + consonants + "|" + vowel + vowel;
    
    Pattern myPattern = Pattern.compile(myExpression);
    
    Matcher myMatcher = myPattern.matcher(inputString);
    
    while (myMatcher.find())
    { 
      System.out.println(myMatcher.group());
    }
  }
}

The output for both versions of this example are the same:

ch
ae
fl
yd

These two examples aren't very different structurally. But what would happen if we decided to include uppercase letter in our pattern and input? Our regular expression would turn into this:

String myExpression =
  [a-zA-Z&&[^aeiouAEIOU]][a-zA-Z&&[^aeiouAEIOU]]|[aeiouAEIOU][aeiouAEIOU]

But if we went with our newer version, the same snippet would look like this:

String consonants = "[a-zA-Z&&[^aeiouAEIOU]]";
String vowel = "[aeiouAEIOU]";
    
String myExpression =
  consonants + consonants + "|" + vowel + vowel;


I don't know about you, but I far prefer the latter. There are also ways of expressing repeated patterns like this within the standard regular expression syntax.

Other Regular Expression Engines

Predating the JDK 1.4 release were two regular expression engines put out by Jakarta/Apache: Jakarta ORO, which provides Perl5-compatible regular expressions, and Jakarta regexp, which predates ORO. In addition to sed, awk, grep, and egrep on UNIX, we have to pay homage to Perl which, whether you love it or hate it, has pushed the boundaries of what can be done with regular expressions.

Declarative Languages

Like many people, I learned my computer skills in the UNIX-filled world of higher education. In those days, I knew sed, awk, grep, and vi better than the cheat codes to my favorite video games. These tools all had one thing in common: regular expressions. Later in my academic career, I took my first database class. Since my school's Computer Science program was an offshoot of the Mathematics Department, the database class was all based on set theory and tuple relational calculus. In the end, we studied SQL, and I was surprised by how much it reminded me of regular expressions. SQL by itself is a declarative language; by this, I mean that in SQL we make a single declarative statement and then pipe it off to a database to return the desired result. Sound familiar? With regular expressions, we create an expression and then run it through an engine to return the desired result. Early in my career, I found myself using this analogy to teach coworkers who were familiar with regular expressions how to write SQL. Ironically, I recently found myself explaining regular expressions to someone quite familiar with SQL, using the inverse analogy. Regular expressions are tightly coupled with automata theory and formal language theory. Regular expressions themselves actually describe languages.

Final Thoughts

Regular expression engines have been around for a long time and, after having been part of some Java APIs for quite some time, have finally found a home in the JDK. Here, I have provided the smallest of tastes of the full syntax for regular expressions. For a full treatment, see the API specifications. Regular expressions are powerful, but they can be unwieldy and error prone. So be sure to write lots of unit tests to make sure your code is behaving as it should. In fact, sometimes regular expressions can be helpful in writing your unit tests.

In the examples provided here, we have just scratched the surface of what can be done with regular expressions. Think about writing a validator for URLs or email addresses. This can be done with a single, although rather ugly, regular expression. In fact, you can probably find one already written for you. Using a regular expression to validate URLs and email addresses instead of writing your own using string manipulation and tokens may just make you feel like a superhero.

For more information on NIO check out Java NIO by Ron Hitchens. If you would like more information on formal languages and automata theory, there is no better book than The Theory of Finite Automata by John Carroll and Darrell Long.

Michael J. Floyd is the Software Engineering Manager for DivXNetworks. He is also a consultant for San Diego State University and can be reached at This email address is being protected from spambots. You need JavaScript enabled to view it..

BLOG COMMENTS POWERED BY DISQUS