26
Fri, Apr
1 New Articles

Java Journal: Regular Expressions

Java
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

So why devote an entire "Java Journal" to regular expressions? One word: Power. In all of the Java extensions and APIs from Sun to date never have you been able to wield so much power with just a few clicks of your keyboard. But as Spiderman's Uncle Ben so prophetically warned Peter Parker, "With great power comes great responsibility." Lest this power gifted to you by Sun become your curse, use it wisely. OK, so maybe the Spiderman reference is a bit much, and maybe regular expressions won't turn you into a superhero--well, at least not overnight.

In last month's Java Journal, I discussed optimizing Java I/O using NIO, the new Java I/O API. However, you might totally overlook one part of the NIO API unless you know about it, and that's its regular expression engine. So what exactly is a regular expression? Well, it's nothing more than a pattern. In our context, it is specifically a pattern--or sequence, if you prefer--of characters. In fact, calling it a character sequence is exactly what Sun did. Within JDK 1.4, Sun added an interface called CharSequence, which both the String and StringBuffer classes now implement. Thus, with the regular expression classes (and others for that matter), we can now deal with the abstraction of a CharSequence and not have to worry about the underlying implementation. This is truly polymorphism at its best. The NIO regular expression engine itself is made up of a paltry two classes, Pattern and Matcher.

Pattern

The Pattern class is the container for a regular expression. It has no constructors, but rather a pair of static factory methods called compile(), both of which return an instance of Pattern. Both flavors of compile() take a string argument that represents the regular expression. In, addition one takes a sequence of Boolean flags that can be passed as a single argument by applying the integer bitwise operator, which is represented as a vertical bar (|), as in Flag1 | Flag2. (I know it is splitting semantic hairs, but I hate it when people refer to the vertical bar as the Boolean OR operator; it does function as such but not when acting on primitive integral types. This is probably leftover C jargon from the days when all Booleans in C were integers.) The flags that can be passed to the compile() factory method allow for such rules as enabling UNIX line terminators or making the entire pattern case-insensitive, including comments in the pattern. There are several other flags, but they are mostly esoteric and are included to make regular expressions of types compatible with previous engines.

Matcher

The Matcher class, as the name implies, allows us to match instances of Pattern against instances of CharSequence. Like Pattern, Matcher has no public constructors. In fact, we actually use instances of Pattern to create instances of Matcher by invoking the appropriately named Pattern method matcher(). Once you have created your Matcher, you can do all kinds of tricks with it. You can do a find(), which returns a Boolean telling you whether or not there is a match of your regular expression contained in your CharSequence. If it is found, you can then make a call to group() to extract the matching sequence.

Example 1

Here is an example of matching a pattern to a string. In the first couple of examples, I'll stick to lowercase strings to keep thing readable:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegularExpressionMain 
{
  public static void main(String[] args) 
  { 
    String inputString = "michael floyd";
    
    Pattern myPattern = Pattern.compile("fl");
    
    Matcher myMatcher = myPattern.matcher(inputString);
    
    if (myMatcher.find());
    { 
      System.out.println(myMatcher.group());
    }
  }
}

In the above example, we first import the two classes that we will be using from java.util.regex. Then, to keep things simple, we do our work inside main(). We create a string containing my name and a two-letter pattern. We then create an instance of Matcher by calling matcher() and passing it our input string. Remember that the String class now implements CharSequence. We could have just as easily used a StringBuffer, which also implements CharSequence. We then call find() on the matcher, and if we find what we are looking for, we call group() to extract the match and push the results to System.out.

This is pretty straightforward and could have been accomplished in a number of ways by using existing APIs. In order to experience the true power of regular expressions, we must dig deeper. It is also worth noting that we used an if statement based on the return value of find() to guard our call to group(). Otherwise, the call to group() would throw an IllegalStateException if there was no match for the regular expression. You could just put in a try catch block, but I think the example given reads better without it. The output for this program would be "fl", since the pattern would be found.

Example 2

For this example, we want to find all the consonants followed by a vowel. Since the Java API for regular expressions follows the standard format for regular expressions, you just need to learn how to express a couple of basic patterns.

First, the syntax for a single character or any range of characters is expressed by enclosing the character(s) in brackets. You can use dashes to specify all the characters that come between any two specific characters. So [ac], would match either a or c, but [a-z], would match any lowercase letter between a and z. You can also use the vertical bar (|) to designate the OR condition or double ampersands (&&) for AND. So cat|dog, would match "cat" or "dog". The caret (^) is used for negation of a statement. So the regular expression we need for our example is [a-z&&[^aeiou]][aeiou].

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegularExpressionMain 
{
  public static void main(String[] args) 
  { 
    String inputString = "michael floyd";
    String myExpression =
      "[a-z&&[^aeiou]][aeiou]";
    
    Pattern myPattern = Pattern.compile(myExpression);
    
    Matcher myMatcher = myPattern.matcher(inputString);
    
    while (myMatcher.find())
    { 
      System.out.println(myMatcher.group());
    }
  }
}

We have made a few key modifications to our previous example here. First, we have changed the expression to literally read "match any letter that is not a vowel followed by any letter that is a vowel." Second, we have changed the if statement to a while statement so that all of the occurrences of the pattern, not just the first one, are found and printed. The output of this program would be as follows:

mi
ha
lo

Example 3

For this example, let's assume that we want to extract from a string all instances in which we have either two consonants or two vowels in a row.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegularExpressionMain 
{
  public static void main(String[] args) 
  { 
    String inputString = "michael floyd";
    String myExpression =
  "[a-z&&[^aeiou]][a-z&&[^aeiou]]|[aeiou][aeiou]";
    
    Pattern myPattern = Pattern.compile(myExpression);
    
    Matcher myMatcher = myPattern.matcher(inputString);
    
while (myMatcher.find())

  System.out.println(myMatcher.group());
}
  }
}

For this example, we have built on the previous example, and although our expression is getting longer, it isn't actually very complicated. In fact, to make it more readable, all we have to do is replace the reusable parts of the expression with symbolic constants, like in the following example:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegularExpressionMain 
{
  public static void main(String[] args) 
  { 
    String consonants = "[a-z&&[^aeiou]]";
    String vowel = "[aeiou]";
    String inputString = "michael floyd";
    String myExpression =
      consonants + consonants + "|" + vowel + vowel;
    
    Pattern myPattern = Pattern.compile(myExpression);
    
    Matcher myMatcher = myPattern.matcher(inputString);
    
    while (myMatcher.find())
    { 
      System.out.println(myMatcher.group());
    }
  }
}

The output for both versions of this example are the same:

ch
ae
fl
yd

These two examples aren't very different structurally. But what would happen if we decided to include uppercase letter in our pattern and input? Our regular expression would turn into this:

String myExpression =
  [a-zA-Z&&[^aeiouAEIOU]][a-zA-Z&&[^aeiouAEIOU]]|[aeiouAEIOU][aeiouAEIOU]

But if we went with our newer version, the same snippet would look like this:

String consonants = "[a-zA-Z&&[^aeiouAEIOU]]";
String vowel = "[aeiouAEIOU]";
    
String myExpression =
  consonants + consonants + "|" + vowel + vowel;


I don't know about you, but I far prefer the latter. There are also ways of expressing repeated patterns like this within the standard regular expression syntax.

Other Regular Expression Engines

Predating the JDK 1.4 release were two regular expression engines put out by Jakarta/Apache: Jakarta ORO, which provides Perl5-compatible regular expressions, and Jakarta regexp, which predates ORO. In addition to sed, awk, grep, and egrep on UNIX, we have to pay homage to Perl which, whether you love it or hate it, has pushed the boundaries of what can be done with regular expressions.

Declarative Languages

Like many people, I learned my computer skills in the UNIX-filled world of higher education. In those days, I knew sed, awk, grep, and vi better than the cheat codes to my favorite video games. These tools all had one thing in common: regular expressions. Later in my academic career, I took my first database class. Since my school's Computer Science program was an offshoot of the Mathematics Department, the database class was all based on set theory and tuple relational calculus. In the end, we studied SQL, and I was surprised by how much it reminded me of regular expressions. SQL by itself is a declarative language; by this, I mean that in SQL we make a single declarative statement and then pipe it off to a database to return the desired result. Sound familiar? With regular expressions, we create an expression and then run it through an engine to return the desired result. Early in my career, I found myself using this analogy to teach coworkers who were familiar with regular expressions how to write SQL. Ironically, I recently found myself explaining regular expressions to someone quite familiar with SQL, using the inverse analogy. Regular expressions are tightly coupled with automata theory and formal language theory. Regular expressions themselves actually describe languages.

Final Thoughts

Regular expression engines have been around for a long time and, after having been part of some Java APIs for quite some time, have finally found a home in the JDK. Here, I have provided the smallest of tastes of the full syntax for regular expressions. For a full treatment, see the API specifications. Regular expressions are powerful, but they can be unwieldy and error prone. So be sure to write lots of unit tests to make sure your code is behaving as it should. In fact, sometimes regular expressions can be helpful in writing your unit tests.

In the examples provided here, we have just scratched the surface of what can be done with regular expressions. Think about writing a validator for URLs or email addresses. This can be done with a single, although rather ugly, regular expression. In fact, you can probably find one already written for you. Using a regular expression to validate URLs and email addresses instead of writing your own using string manipulation and tokens may just make you feel like a superhero.

For more information on NIO check out Java NIO by Ron Hitchens. If you would like more information on formal languages and automata theory, there is no better book than The Theory of Finite Automata by John Carroll and Darrell Long.

Michael J. Floyd is the Software Engineering Manager for DivXNetworks. He is also a consultant for San Diego State University and can be reached at This email address is being protected from spambots. You need JavaScript enabled to view it..

Michael Floyd

Michael J. Floyd is the Vice President of Engineering for DivX, Inc.

BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$0.00 Raised:
$

Book Reviews

Resource Center

  • SB Profound WC 5536 Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application. You can find Part 1 here. In Part 2 of our free Node.js Webinar Series, Brian May teaches you the different tooling options available for writing code, debugging, and using Git for version control. Brian will briefly discuss the different tools available, and demonstrate his preferred setup for Node development on IBM i or any platform. Attend this webinar to learn:

  • SB Profound WP 5539More than ever, there is a demand for IT to deliver innovation. Your IBM i has been an essential part of your business operations for years. However, your organization may struggle to maintain the current system and implement new projects. The thousands of customers we've worked with and surveyed state that expectations regarding the digital footprint and vision of the company are not aligned with the current IT environment.

  • SB HelpSystems ROBOT Generic IBM announced the E1080 servers using the latest Power10 processor in September 2021. The most powerful processor from IBM to date, Power10 is designed to handle the demands of doing business in today’s high-tech atmosphere, including running cloud applications, supporting big data, and managing AI workloads. But what does Power10 mean for your data center? In this recorded webinar, IBMers Dan Sundt and Dylan Boday join IBM Power Champion Tom Huntington for a discussion on why Power10 technology is the right strategic investment if you run IBM i, AIX, or Linux. In this action-packed hour, Tom will share trends from the IBM i and AIX user communities while Dan and Dylan dive into the tech specs for key hardware, including:

  • Magic MarkTRY the one package that solves all your document design and printing challenges on all your platforms. Produce bar code labels, electronic forms, ad hoc reports, and RFID tags – without programming! MarkMagic is the only document design and print solution that combines report writing, WYSIWYG label and forms design, and conditional printing in one integrated product. Make sure your data survives when catastrophe hits. Request your trial now!  Request Now.

  • SB HelpSystems ROBOT GenericForms of ransomware has been around for over 30 years, and with more and more organizations suffering attacks each year, it continues to endure. What has made ransomware such a durable threat and what is the best way to combat it? In order to prevent ransomware, organizations must first understand how it works.

  • SB HelpSystems ROBOT GenericIT security is a top priority for businesses around the world, but most IBM i pros don’t know where to begin—and most cybersecurity experts don’t know IBM i. In this session, Robin Tatam explores the business impact of lax IBM i security, the top vulnerabilities putting IBM i at risk, and the steps you can take to protect your organization. If you’re looking to avoid unexpected downtime or corrupted data, you don’t want to miss this session.

  • SB HelpSystems ROBOT GenericCan you trust all of your users all of the time? A typical end user receives 16 malicious emails each month, but only 17 percent of these phishing campaigns are reported to IT. Once an attack is underway, most organizations won’t discover the breach until six months later. A staggering amount of damage can occur in that time. Despite these risks, 93 percent of organizations are leaving their IBM i systems vulnerable to cybercrime. In this on-demand webinar, IBM i security experts Robin Tatam and Sandi Moore will reveal:

  • FORTRA Disaster protection is vital to every business. Yet, it often consists of patched together procedures that are prone to error. From automatic backups to data encryption to media management, Robot automates the routine (yet often complex) tasks of iSeries backup and recovery, saving you time and money and making the process safer and more reliable. Automate your backups with the Robot Backup and Recovery Solution. Key features include:

  • FORTRAManaging messages on your IBM i can be more than a full-time job if you have to do it manually. Messages need a response and resources must be monitored—often over multiple systems and across platforms. How can you be sure you won’t miss important system events? Automate your message center with the Robot Message Management Solution. Key features include:

  • FORTRAThe thought of printing, distributing, and storing iSeries reports manually may reduce you to tears. Paper and labor costs associated with report generation can spiral out of control. Mountains of paper threaten to swamp your files. Robot automates report bursting, distribution, bundling, and archiving, and offers secure, selective online report viewing. Manage your reports with the Robot Report Management Solution. Key features include:

  • FORTRAFor over 30 years, Robot has been a leader in systems management for IBM i. With batch job creation and scheduling at its core, the Robot Job Scheduling Solution reduces the opportunity for human error and helps you maintain service levels, automating even the biggest, most complex runbooks. Manage your job schedule with the Robot Job Scheduling Solution. Key features include:

  • LANSA Business users want new applications now. Market and regulatory pressures require faster application updates and delivery into production. Your IBM i developers may be approaching retirement, and you see no sure way to fill their positions with experienced developers. In addition, you may be caught between maintaining your existing applications and the uncertainty of moving to something new.

  • LANSAWhen it comes to creating your business applications, there are hundreds of coding platforms and programming languages to choose from. These options range from very complex traditional programming languages to Low-Code platforms where sometimes no traditional coding experience is needed. Download our whitepaper, The Power of Writing Code in a Low-Code Solution, and:

  • LANSASupply Chain is becoming increasingly complex and unpredictable. From raw materials for manufacturing to food supply chains, the journey from source to production to delivery to consumers is marred with inefficiencies, manual processes, shortages, recalls, counterfeits, and scandals. In this webinar, we discuss how:

  • The MC Resource Centers bring you the widest selection of white papers, trial software, and on-demand webcasts for you to choose from. >> Review the list of White Papers, Trial Software or On-Demand Webcast at the MC Press Resource Center. >> Add the items to yru Cart and complet he checkout process and submit

  • Profound Logic Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application.

  • SB Profound WC 5536Join us for this hour-long webcast that will explore:

  • Fortra IT managers hoping to find new IBM i talent are discovering that the pool of experienced RPG programmers and operators or administrators with intimate knowledge of the operating system and the applications that run on it is small. This begs the question: How will you manage the platform that supports such a big part of your business? This guide offers strategies and software suggestions to help you plan IT staffing and resources and smooth the transition after your AS/400 talent retires. Read on to learn: