Er. alokpandey's Blog

Know about Regular expressions

Posted in ASP.NET (C# & VB), C#, HTML and XHTML, J-Query, JavaScript, SEO, VB by Alok Kumar Pandey on July 22, 2010

Wtf is a regular expression?

Regular expressions are strings formatted using a special pattern notation that allow you to describe and parse text. Many programmers (even some good ones) disregard regular expressions as line noise, and it’s a damned shame because they come in handy so often. Once you’ve gotten the hang of them, regular expressions can be used to solve countless real world problems.

Regular expressions work a lot like the filename “globs” in Windows or *nix that allow you to specify a number of files by using the special * or ? characters (oops, did I just use a glob while defining a glob?). But with regular expressions the special characters, or metacharacters, are far more expressive.

Like globs, regular expressions treat most characters as literal text. The regular expression mike, for example, will only match the letters m – i – k – e, in that order. But regular expressions sport an extensive set of metacharacters that make the simple glob look downright primitive.

Meet the metacharacters: ^[](){}.*?\|+$ and sometimes –

I know, they look intimidating, but they’re really nice characters once you get to know them.

The Line Anchors: ‘^’ and ‘$’

The ‘^’ (caret) and ‘$’ (dollar) metacharacters represent the start and end of a line of text, respectively. As I mentioned earlier, the regular expression mike will match the letters m – i – k – e, but it will match anywhere in a line (e.g. it would match “I’m mike”, or even “carmike”). The ‘^’ character is used to anchor the match to the start of the line, so ^mike would only find lines that start with mike. Similarly, the expressionmike$ would only find m – i – k – e at the end of a line (but would still match ‘carmike’).

If we combine the two line anchor characters we can search for lines of text that contain a specific sequence of characters. The expression ^mike$ will only match the word mike on a line by itself – nothing more, nothing less. Similarly the expression ^$ is useful for finding empty lines, where the beginning of the line is promptly followed by the end.

The Character Class: ‘[]‘

Square brackets, called a character class, let you match any one of several characters. Suppose you want to match the word ‘gray’, but also want to find it if it was spelled ‘grey’. A character class will allow you to match either. The regular expressiongr[ea]y is interpreted as “g, followed by r, followed by either an e or an a, followed by y.”

If you use [^ ... ] instead of [ ... ], the class matches any character that isn’tlisted. The leading ^ “negates” the list. Instead of listing all of the characters you want to included in the class, you list the characters you don’t want included. Note that the ^ (caret) character used here has a different meaning when it’s used outside of a character class, where it is used to match the beginning of a line.

The Character Class Metacharacter: ‘-’

Within a character-class, the character-class metacharacter ‘-’ (dash) indicates a range of characters. Instead of [01234567890abcdefABCDEF] we can write [0-9a-fA-F]. How convenient. The dash is a metacharacter only within a character class, elsewhere it simply matches the normal dash character.

But wait, there’s a catch. If a dash is the first character in a character class it is not considered a metacharacter (it can’t possibly represent a range, since a range requires a beginning and an end), and will match a normal dash character. Similarly, the question mark and period are usually regex metacharacters, but not when they’re inside a class (in the class [-0-9.?] the only special character is the dash between the 0 and 9).

Matching Any Character With a Dot: ‘.’

The ‘.’ metacharacter (called a dot or point) is shorthand for a character class that matches any character. It’s very convenient when you want to match any character at a particular position in a string. Once again, the dot metacharacter is not a metacharacter when it’s inside of a character class. Are you beginning to see a pattern? The list of metacharacters is different inside and outside of a character class.

The Alternation Metacharacter: ‘|’

The ‘|’ metacharacter, (pipe) means “or.” It allows you to combine multiple expressions into a single expression that matches any of the individual ones. The subexpressions are called alternatives.

For example, Mike and Michael are separate expressions, but Mike|Michael is one expression that matches either.

Parenthesis can be used to limit the scope of the alternatives. I could shorten our previous expression that matched Mike or Michael with creative use of parenthesis. The expression Mi(ke|chael) matches the same thing. I probably would use the first expression in practice, however, as it is more legible and therefore more maintainable.

Matching Optional Items: ‘?’

The ‘?’ metacharacter (question mark) means optional. It is placed after a character that is allowed, but not required, at a certain point in an expression. The question mark attaches only to the immediately preceding character.

If I wanted to match the english or american versions of the word ‘flavor’ I could use the regex flavou?r, which is interpreted as “f, followed by l, followed by a, followed by v, followed by o, followed by an optional u, followed by r.”

The Other Quantifiers: ‘+’ and ‘*’

Like the question mark, the ‘+’ (plus) and ‘*’ (star) metacharacters affect the number of times the preceding character can appear in the expression (with ‘?’ the preceding character could appear 0 or 1 times). The metacharacter ‘+’ matches one or more of the immediately preceding item, while ‘*’ matches any number of the preceding item, including 0.

If I was trying to determine the score of a soccer match by counting the number of times the announcer said the word ‘goal’ in a transcript, I might use the regular expression go+al, which would match ‘goal’, as well as ‘gooooooooooooooooal’ (but not ‘gal’).

The three metacharacters, question mark, plus, and star are called quantifiers because they influence the quantity of the item they’re attached to.

The Interval Quantifier: ‘{}’

The ‘{min, max}’ metasequence allows you to specify the number of times a particular item can match by providing your own minimum and maximum. The regex go{1,5}alwould limit our previous example to matching between one and five o’s. The sequence {0,1} is identical to a question mark.

The Escape Character: ‘\’

The ‘\’ metacharacter (backslash) is used to escape metacharacters that have special meaning so you can match them in patterns. For example, if you would like to match the ‘?’ or ‘\’ characters, you can precede them with a backslash, which removes their meaning: ‘\?’ or ‘\\’.

When used before a non-metacharacter a backslash can have a different meaning depending on the flavor of regular expression you’re using. For perl compatible regular expressions (PCREs) you can check out the perldoc page for perl regular expressions. PCREs are extremely common, this flavor of regexes can be used in PHPRuby, andECMAScript/Javascript, and many other languages.

Using Parenthesis for Matching: ‘()’

Most regular expression tools will allow you to capture a particular subset of an expression with parenthesis. I could match the domain portion of a URL by using an expression like http://(%5B^/]+). Let’s break this expression down into it’s components to see how it works.

The beginning of the expression is fairly straightforward: it matches the sequence “h – t – t – p – : – / – /”. This initial sequence is followed by parenthesis, which are used tocapture the characters that match the subexpression they surround. In this case the subexpression is ‘[^/]+’, which matches any character except for ‘/’ one or more times. For a URL like http://immike.net/blog/Some-blog-post, ‘immike.net’ will be captured by the parenthesis.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: