textarea
Nov 01 2006
Software

Regular Expressions in C# or Visual Basic

Find, replace and validate text with just a few lines of code.

Developers frequently need to process text.  A developer might need to process input from a user to remove or replace special characters or process text that has been output from a legacy application to integrate an application with an existing system. For decades, UNIX and Perl developers have used a complex but efficient technique for processing text: regular expressions.

A regular expression is a set of characters that can be compared to a string to determine whether the string meets specified format requirements and supports dozens of special characters and operators. You also can use regular expressions to extract portions of the text or to replace text. To make decisions based on text, you can create regular expressions that match strings consisting entirely of integers — strings that contain only lowercase letters, or strings that can match hexadecimal input. You can also extract key portions of a block of text, which you could use to extract the state from a user’s address or image links from an HTML page. Finally, you can update text using regular expressions to change the format of text or remove invalid characters. This article focuses on the most fundamental use of regular expressions — matching and validating strings.

The most commonly used regular expressions are: “^” to match the beginning of a string, “$” to match the end of a string, “?” to make a character optional, “.” to match any character and “*” to match a repeated character.

To be able to test regular expressions, create a Visual Basic or C# console app named TestRegExp that accepts two strings as input and determines whether the first string (a regular expression) matches the second string. The following console application, which uses the System.Text.RegularExpressions namespace, performs this check using the static System.Text.RegularExpressions.Regex.IsMatch method and displays the results to the console:

' VB
Imports System.Text.RegularExpressions
 
Namespace TestRegExp
    Class Class1
        <STAThread> _
        Shared  Sub Main(ByVal args() As String)
            If Regex.IsMatch(args(1),args(0)) then
                Console.WriteLine  (“Input matches regular expression.”)
            Else
                Console.WriteLine (“Input DOES NOT match regular expression.”)
            End If
        End Sub
    End Class
End Namespace

// C#
using System.Text.RegularExpressions;

namespace TestRegExp
{
    class: Class1
    {
         [STAThread]
        static void Main(string[] args)
        {
            If (Regex.IsMatch(args[1], args[0]))
                Console.WriteLine  (“Input matches regular expression.”);
            Else
                Console.WriteLine(“Input DOES NOT match regular expression.”);
        }
    }
}


Next, run the application by determining whether the regular expression “^\d{5}$” matches the string “12345” or “1234.” The regular expression won’t make sense now, but it will by the end of the article. Your output should resemble the following:

C:\>TestRegExp ^\d{5}$ 1234
Input DOES NOT match regular expression.

C:\>TestRegExp ^\d{5}$ 12345
Input matches regular expression.


As this code demonstrates, the Regex.IsMatch method compares a regular expression to a string and returns true if the string matches the regular expression. In this example, “^\d{5}$” means that the string must be exactly five numeric digits. As shown in Figure 1, the carat (“^”) represents the start of the string, “\d” means numeric digits, “{5}” indicates five sequential numeric digits, and “$” represents the end of the string.

rx
Analysis of a regular expression

If you remove the first character from the regular expression, you drastically change the meaning of the pattern. The regular expression “\d{5}$” will still match valid five-digit numbers, such as “12345.” However, it also will match the input string “abcd12345” or “drop table customers —12345”. In fact, the modified regular expression will match any input string that ends in any five-digit number. When validating input, always begin regular expressions with a “^” character and end them with “$.” This system ensures that input exactly matches the specified regular expression and does not merely contain matching input.

Table 1 lists other characters you can use to make your regular expression match a specific location in a string.

Characters that Match Locations in Strings

Character Description

^

Specifies that the match must begin at either the first character of the string or the first character of the line. If you are analyzing multiline input, the ^ will match the beginning of any line.

$

Specifies that the match must end at either the last character of the string, the last character before \n at the end of the string, or the last character at the end of the line. If you are analyzing multiline input, the $ will match the end of any line.

\A

Specifies that the match must begin at the first character of the string (and ignores multiple lines).

\Z

Specifies that the match must end at either the last character of the string or the last character before \n at the end of the string (and ignores multiple lines).

\z

Specifies that the match must end at the last character of the string (and ignores multiple lines).

\G

Specifies that the match must occur at the point where the previous match ended. When used with Match.NextMatch, this arrangement ensures that matches are all contiguous.

\b

Specifies that the match must occur on a boundary between \w (alphanumeric) and \W (non-alphanumeric) characters. The match must occur on word boundaries, which are the first or last characters in words separated by any non-alphanumeric characters.

\B

Specifies that the match must not occur on a \b boundary.

Notice that regular expressions are case-sensitive, even in Visual Basic. Often, capitalized characters have the opposite meaning of lowercase characters.

The simplest use of regular expressions is to determine whether a string matches a pattern. For example, the regular expression “abc” matches the strings “abc”, “abcde”, or “yzabc” because each of the strings contains the regular expression. No wild cards are necessary.

You can also use regular expressions to match repeated characters. The “*” symbol matches the preceding character zero or more times. For example, “to*n” matches “ton”, “tooon”, or “tn.” The “+” symbol works similarly, but it must match one or more times. For example, “to+n” matches “ton” or “tooon”, but not “tn”.

To match a specific number of repeated characters, use “{n}” where n is a digit. For example, “to{3}” matches “tooon” but not “ton” or “tn”. To match a range of repeated characters, use “{min,max}”. For example, “to{1,3}n” matches “ton” or “tooon” but not “tn” or “toooon”. To specify only a minimum, leave the second number blank. For example, “to{3,}n” requires 3 or more consecutive “o” characters.

To make a character optional, use the “?” symbol. For example, “to?n” matches “ton” or “tn”, but not “tooon”. To match any single character, use “.”. For example, “to.n” matches “totn” or “tojn” but not “ton” or “tn”.

To match one of several characters, use brackets. For example, “to[ro]n” would match “toon” or “torn”, but not “ton” or “toron”. You can also match a range of characters. For example, “to[o-r]n” matches “toon”, “topn”, “toqn”, or “torn”, but would not match “toan” or “toyn”.

Table 2 summarizes the regular expression characters used to match multiple characters or a range of characters.

Wild card and Character Ranges Used in Regular Expressions

Character

Description

*

Matches the preceding character or sub-expression zero or more times. For example, “zo*” matches “z” and “zoo”. The “*” character is equivalent to “{0,}”.

+

Matches the preceding character or sub-expression one or more times. For example, “zo+” matches “zo” and “zoo”, but not “z”. The “+” character is equivalent to “{1,}”.

?

Matches the preceding character or sub-expression zero or one time. For example, “do(es)?” matches the “do” in “do” or “does”. The “?” character is equivalent to “{0,1}”.

{n}

The n is a non-negative integer. Matches exactly n times. For example, “o{2}” does not match the “o” in “Bob” but does match the two “o”s in “food”.

{n,}

The n is a non-negative integer. Matches at least n times. For example, “o{2,}” does not match the “o” in “Bob” and does match all the “o”s in “foooood”. The sequence “o{1,}” is equivalent to “o+”. The sequence “o{0,}” is equivalent to “o*”.

{n,m}

The m and n are non-negative integers, where “n <= m”. Matches at least n and at most m times. For example, “o{1,3}” matches the first three “o”s in “fooooood”. “o{0,1}” is equivalent to “o?”. Note that you cannot put a space between the comma and the numbers.

?

When this character immediately follows any of the other quantifiers (*, +,?, {n}, {n,}, {n,m}), the matching pattern is non-greedy. A non-greedy pattern matches as little of the searched string as possible, whereas the default greedy pattern matches as much of the searched string as possible. For example, in the string “oooo”, “o+?” matches a single “o”, whereas “o+” matches all “o”s.

.

Matches any single character except “\n”. To match any character including the  “\n”, use a pattern such as  “[\s\S]”.

x|y

Matches either x or y. For example, “z|food” matches “z” or “food”. “(z|f)ood” matches “zood” or “food”.

[xyz]

A character set. Matches any one of the enclosed characters. For example, “[abc]” matches the “a” in “plain”.

[a-z]

A range of characters. Matches any character in the specified range. For example, “[a-z]” matches any lowercase alphabetic character in the range “a” through “z”.

Regular expressions also provide special characters to represent common character ranges. You could use “[0-9]” to match any numeric digit, or you can use “\d”. Similarly, “\D” matches any non-numeric digit. Use “\s” to match any white-space character, and use “\S” to match any non-white-space character. Table 3 summarizes these characters.

Characters Used in Regular Expressions

Character

Description

\d

Matches a digit character. Equivalent to “[0-9]”.

\D

Matches a non-digit character. Equivalent to “[^0-9]”.

\s

Matches any white-space character, including Space, Tab and form-feed. Equivalent to “[ \f\n\r\t\v]”.

\S

Matches any non-white-space character. Equivalent to “[^ \f\n\r\t\v]”.

\w

Matches any word character, including underscore. Equivalent to “[A-Za-z0-9_]”.

\W

Matches any nonword character. Equivalent to “[^A-Za-z0-9_]”.

To match a group of characters, surround the characters with parentheses. For example, “foo(loo){1,3}hoo” would match “fooloohoo” and “fooloolooloohoo” but not “foohoo” or “foololohoo”. Similarly, “foo(loo|roo|)hoo” would match either “fooloohoo” or “fooroohoo”. You can apply any wild card or other special character to a group of characters.

You also can name groups to refer to the matched data later. To name a group, use the format “(?<name>pattern)”. For example, the regular expression “foo(?<mid>loo|roo)hoo” would match “fooloohoo”. Later, you could reference the group “mid” to retrieve “loo”. If you used the same regular expression to match “fooroohoo”, “mid” would contain “roo”.

Regular expressions can be used to match complex input patterns, too. The following regular expression matches e-mail addresses:

^([\w-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))
([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$

Tony Northrup is the author of MCTS Self-Paced Training Kit (Exam 70-536): Microsoft .NET Framework 2.0 Application Development Foundation (Microsoft Press 2006). This book was created to help you master the .Net Framework Version 2.0 using either Visual Basic or C#, and to help you become a Microsoft Certified Technical Specialist by passing the 70-536 certification exam.
textfield

More On

Close

Become an Insider

Unlock white papers, personalized recommendations and other premium content for an in-depth look at evolving IT