Regular Expressions in Bash (and Alternatives) ¬

2010-11-03

While cleaning up some old bash code and preparing tools-osx for release, I happened across a very useful bit of information: bash does support regular expressions! Well, at least bash 3.0 and newer do.

I first learned regular expressions in Perl, so I’ve pined for =~ in other scripting languages ever since. With bash, I, like most others, get by most of the time by piping things through grep for matching and sed for replacements, but the bane of my existence has always been capturing groups (capturing parentheses).

For example, let’s say we want to grab just the volume name out of a path like /Volumes/Macintosh HD/Users/Shared/, the following regular expression would be perfect for that:

^/Volumes/([^/]+)

That says match a string that starts with “/Volumes/” followed by one or more characters that are not “/” (capturing the one or more characters that are not “/”). So, if we were to match that against the aforementioned example path, it would capture:

Macintosh HD

Well, now I know that you can do this using bash 3.0+‘s built-in regular expressions support:

if [[ "/Volumes/Macintosh HD/Users/Shared/" =~ ^/Volumes/([^/]+) ]]; then
	vol="${BASH_REMATCH[1]}"
fi

Very straightforward for those who are familiar with regular expressions. However, it took my a while to get that to even work. Why? I assumed that I needed to quote the regular expression (in bash quoting is extremely important). The first tutorial I was going by pulled the regex from command line input and used it from a variable, so that offered little evidence for or against quoting the regular expression, but another that I found clearly was quoting the regular expression. Eventually I read the comments on the latter tutorial and there were some that found the regular expression worked in single quotes and some found that it had to be left unquoted.

For me, on Mac OS X 10.5 Leopard, bash regular expressions have to be left unquoted.

Note: bash 3.0+‘s built-in regular expressions are, like grep -e or egrep, POSIX extended regular expressions, not full Perl-compatible regular expressions, so make sure you understand the differences in syntax.

So, now comes the big caveat with all of this new found power and why it’s taken so long for me to discover it: bash 3.0 and newer have only started becoming common in the last few years, so it’s not widely supported yet. I looked through the Mac OS X source code and found that only Mac OS X 10.5 Leopard and 10.6 Snow Leopard have included a version of bash newer than version 3.0. Mac OS X 10.4 Tiger (including 10.4.11) and earlier all had bash 2.05 or earlier. So, you should really only use bash’s built-in regular expression support if you know the environment will have version 3.0 or newer.

I know, it certainly dashed my hopes a bit too.

In Which We Come to Understand an Alternative

However, all is not lost, there is a rudimentary alternative in read. It’ll never be as powerful as regular expressions, but it can allow simple captures like the example discussed above. Let me just throw you into the deep end and see if I can then explain how to swim.

Again, here’s that bash regular expression code snippet I came up with to parse the volume name out of a path:

if [[ "/Volumes/Macintosh HD/Users/Shared/" =~ ^/Volumes/([^/]+) ]]; then
	vol="${BASH_REMATCH[1]}"
fi

And here’s that same capture using read:

IFS=/ read -r -d '' _ _ vol _ <<< "Volumes/Macintosh HD/Users/Shared/"

Wow, it’s certainly more compact, but it doesn’t look like it contains much actual functionality, right? Just a couple switches and some underscores.

Let’s step through it, argument by argument:

  1. IFS=/ – Characters found in $IFS are word delimiters, so we’re setting our delimiter to “/”.
  2. read – Well, that’s the read command we’re calling to pull all this off.
  3. -r – Specify “raw” input (no backslash escaping).
  4. -d '' – Read until we hit ‘’ (an empty string) instead of a newline (so, essentially, read the entire input).
  5. _ _ vol _ – This is confusing part, this is actually where we tell read which variable to store each matching field in. Let’s break it down further:
    1. _ – The first character of our input string is a “/” (and so is our delimiter), so the first field is going to match an empty string (everything between the start and the first “/”, i.e. nothing), so we’ll just dump that in $_ to discard it.
    2. _ – The second match is going to be “Volumes” (everything between the first “/” and the second “/”), but we don’t care about that either, so discard it into $_ as well.
    3. vol – The third match (everything between the second “/” and third “/”) is what we’re actually looking for (the volume name), so we’ll store that in $vol.
    4. _ – The fourth match (and all further matches; everything between the third “/” and fourth “/”, and so on, and so on) are also nothing we care about, so also toss them into $_.
  6. <<< – This is a bash “here string” operator, it indicates that the following string be sent as standard input to the command.
  7. "Volumes/Macintosh HD/Users/Shared/" – This is the string we want to run through read to capture from.

Putting it back together a bit, we’d have something like this:

  1. IFS=/ – Split on the “/”.
  2. read -r -d '' _ _ vol _ – Store the 3rd field in $vol.
  3. <<< "Volumes/Macintosh HD/Users/Shared/" – From the string “Volumes/Macintosh HD/Users/Shared”.

And, just like the regular expression code, we end up with the following match stored in $vol:

Macintosh HD

Okay, you may have caught on that that read example was not actually the exact same capture as the regular expressions was, here’s why: the string doesn’t have to start with “/Volumes/”. We could’ve matched against “/Users/Shared/” and it would’ve captured “Shared”. That’s not going to cut it!

Fortunately, we could just wrap the call to read with a string comparison of the first zero through nine characters of the path name against “/Volumes/”, as so:

path="Volumes/Macintosh HD/Users/Shared/"
if [ "${path:0:9}" = "/Volumes/" ]; then
	IFS=/ read -r -d '' _ _ vol _ <<< "$path"
fi

Not so scary now, I hope, and far more backwards compatible with older versions of bash.

If you’re looking to capture from a string that can be reasonably split on a delimiter, like we did with the “/”, read is an excellent alternative to regular expressions (esp. when paired with other string comparisons). That said, if you know you can rely on having bash 3.0+, by all means, use the regular expressions!

  Textile Help