Posts Tagged ‘sed’

Regular Expressions

Tuesday, October 7th, 2008

Regular expres­sions (regexes) are one of those concepts that sound innocuous, turn out to be fright­en­ingly complex when you approach them, but aren’t that big a deal when you actually get to know them.

The idea behind a regex is quite simple: it is a single concise series of symbols that can be used to repre­sent a class of expres­sions exactly. For example, a regex could be used to repre­sent “a sequence of charac­ters that begins with the letter C” or “a sequence begin­ning with b, ending with d and any number of x’s in between.” Regexes are extremely expres­sive, and can come in handy at odd times.

Regular expres­sions are built on simple rules. The following is not a compre­hen­sive list, but should provide an idea of what regular expres­sions look like -

  1. Alpha­bets and numbers repre­sent themselves. So do a large number of punctu­a­tion charac­ters. These are case-sensitive.
  2. A dot “.” repre­sents a single instance of any character.
  3. An asterisk “*” indicates that the preceding character may be repeated zero or more times.
  4. An plus “+” indicates that the preceding character may be repeated one or more times.
  5. A carot “^” is an anchor for the start of the line.
  6. A dollar “$” is an anchor for the end of the line.
  7. The “<” and “>” symbols are anchors for start and end of a word respectively.
  8. Et cetera.

For example, ^Cof*e+$ would match Coffee, Coeeeeee or Coffffffe but not Coffff, coffee or Cofeen. Regexes can be much more compli­cated in practice, but the basics are suffi­cient for many common cases.

The most impor­tant advan­tage of under­standing regexes is that it opens up the doors to a huge collec­tion of Unix tools, such as grep, sed and awk. Most Unix text editors also support regexes to some degree.

While grep is the most well-known amongst these tools — it is used to find lines that match a given expres­sion — sed aka ‘the stream editor’ is perhaps the most useful, because it can actually manip­u­late text. For instance, when I migrated more than a hundred old posts into this blog a couple of weeks ago, I needed to replace a whole bunch of <div> tags with <p> tags. That’s when sed came in useful: it took just ten minutes and a single command to get the job done.