Name:Introduction to Regular Expressions
Category:Regular Expressions
Author:Conor
Tip:

Regular Expressions are a powerful tool found throughout linux. They typically have the same syntax in a variety of languages/programs, so learning them is very handy. The syntax I describe below is common to perl, sed, and a variety of linux utilities. Sadly, it doesnt work so well for IDL.

The most basic usage is a simple match. To find text, you would use a command like "m/match this string/" (perl syntax). To execute a find and replace, you typically do something like "s/find this string/replace with this/g". So for example the unix command:

> sed "s/ /,/g" < filein.txt > fileout.txt

will replace all spaces with commas in the file"filein.txt", and write the output to "fileout.txt". Regular expression have a number of useful modifiers. The most common are ?, *, ., and + A question mark means that the character preceding it may or may not be present. An asterix means to means to match the character preceding it 0 or more times. A + means to match the preceding character 1 or more times. A period matches any character except a newline. So for instance you could modify the previous example to do this:

> sed "s/ +/,/g" < filein.txt > fileout.txt

Now, sed will replace any number of contiguous spaces with a single comma, accomplishing something like:

1234.00 2343.11 4321.33 => 1234.00,2343.11,4321.33

Regular expressions also have a number of useful character classes. These are (mainly) w, W, d, D, s, S Use w to match any word character (i.e. a-z or A-Z). d matches any numeric character (0-9), and s matches any space character (spaces, tabs, etc). The uppercase variants match the exact opposite. W matches any non-word character, D matches anything that isnt a digit, and S matches anything that is not a space.

Finally, you can make up your own character classes using brackets. For instance [abc] will match the letters a, b, or c. So the code: s/[abc]/d/g acting on the phrase "cabs are costly" would return: "ddds are dostly"

Taken together, these are all very powerful and can do lots for you. Considering that these regular expression are built into linux everywhere, learning them will be worth your time.

Oh, also. The "g" at the end of the slash tells it to do a global replace - replace all instances rather than just one. If you added an "i" it would become case-insensitive (i.e. s/something/value/gi)