Regular Expressions
Tuesday, October 7th, 2008Regular expressions (regexes) are one of those concepts that sound innocuous, turn out to be frighteningly complex when you approach them, but aren’t that big a deal when you actually get to know them.
The idea behind a regex is quite simple: it is a single concise series of symbols that can be used to represent a class of expressions exactly. For example, a regex could be used to represent “a sequence of characters that begins with the letter C” or “a sequence beginning with b, ending with d and any number of x’s in between.” Regexes are extremely expressive, and can come in handy at odd times.
Regular expressions are built on simple rules. The following is not a comprehensive list, but should provide an idea of what regular expressions look like -
- Alphabets and numbers represent themselves. So do a large number of punctuation characters. These are case-sensitive.
- A dot “.” represents a single instance of any character.
- An asterisk “*” indicates that the preceding character may be repeated zero or more times.
- An plus “+” indicates that the preceding character may be repeated one or more times.
- A carot “^” is an anchor for the start of the line.
- A dollar “$” is an anchor for the end of the line.
- The “<” and “>” symbols are anchors for start and end of a word respectively.
- Et cetera.
For example, ^Cof*e+$ would match Coffee, Coeeeeee or Coffffffe but not Coffff, coffee or Cofeen. Regexes can be much more complicated in practice, but the basics are sufficient for many common cases.
The most important advantage of understanding regexes is that it opens up the doors to a huge collection of Unix tools, such as grep, sed and awk. Most Unix text editors also support regexes to some degree.
While grep is the most well-known amongst these tools — it is used to find lines that match a given expression — sed aka ‘the stream editor’ is perhaps the most useful, because it can actually manipulate text. For instance, when I migrated more than a hundred old posts into this blog a couple of weeks ago, I needed to replace a whole bunch of <div> tags with <p> tags. That’s when sed came in useful: it took just ten minutes and a single command to get the job done.




