From Schmid.wiki
Jump to: navigation, search

Tutorial

Regular expressions (regex's) are used to find specific strings in text. The syntax described here is defined in the POSIX 1003.2 standard as modern or extended regular expressions.

Basic Syntax

Here is a really boring example of a regex:

regex matches
hello hello

Regex's have some symbols that has special meanings, called metacharacters. Among these are . ( ) | ? + *. The point '.' can match any character:

regex matches
h.llo hello, hallo, h8llo, h@llo, ...

The pipe '|' matches one regex or another:

regex matches
hello|hi hello, hi

Regex's can be nested using the '( )' metacharacters:

regex matches
h(e|a)llo hello, hallo

If you want to match a string with a character that may or may not occur, you can use the ? metacharacter:

regex matches
mpe?g mpg, mpeg

If you want to match strings where a character occurs 0 or more times, you can use the '*' metacharacter. Similarly, the '+' metacharacter matches the character 1 or more times.

regex matches
ba*h bh, bah, baah, baaah, ...
ba+h bah, baah, baaah, ...

You can even specify the exact number or interval that characters may occur with '{ }' curly brackets. Inside the brackets you can write a single number, a range, or open-ended intervals like so:

regex matches
ba{3}h baaah
ba{2,3}h baah, baaah
ba{,2}h bh, bah, baah
ba{2,}h baah, baaah, baaaah, ...

A set of characters can be specified using the '[ ]' operators. Inside the brackets, you can write the matched characters. Contrarily, '[^ ]' matches the characters not inside the brackets. Furthermore, a range of characters can be specified with syntax like '[a-z]'

regex matches
h[ea]llo hello, hallo
b[^abcdef]h bgh, bhh, b5h, b@h, ...
h[a-c]llo hallo, hbllo, hcllo

Metacharacters

These are the most important of the metacharacters:

. ^ $   ( ) |   ? + * { }   [ ] [^ ]

The caret '^' and the dollar sign '$' matches the start and the end of the string, respectively.

Using the metacharacters as ordinary characters:

Of course, if you want to match on of the characters . ^ $ ( ) | ? + * { } [ ] as an ordinary character, you will have to do some trick. The trick is prefixing the character with a backslash '\', called escaping the character:

regex matches
\?+ ?, ??, ???, ...
www\.h[ea]llo\.org www.hello.org , www.hallo.org

If you want to match a backslash, you should also escape it '\\'

Combinations

The basic syntax can be combined to more clever matching:

regex matches
h([ea]llo|i) hello, hallo, hi
(cos|sin)\([xy]\) cos(x), sin(x), cos(y), sin(y)
.+\.(mpe?g|avi|mov|qt|wmv) movie file names

Substitution

The power of regex's really show when doing substitution. This is basically the same as 'search and replace'. This is important in editors and programming languages.

In general, the syntax for substitution is

s/search/replace/g

's' means 'substitute' and 'g' means 'global', signifying that the substitution should be done for all matches in the string. A simple example:

string substitution command result
abracadabra s/a/u/g ubrucudubru
hi s/i/ello/g hello

When the search string is a regex, the substitution replaces all the substrings matching the regex:

string substitution command result
hi and hello s/h([ea]llo|i)/good morning/g good morning and good morning
hi s/.*/hello/g hello

The replacement is not a regex, which, if you think about it, is quite understandable. However, the replacement string does have some special characters used for more flexible substutition.

The parts of the search regex enclosed in parentheses '( )' is called a group. In the replacement string, the group in the actual matched string can be inserted by '\n', where n is the number of the group. So the first group is inserted by writing '\1' and so on. This requires a few examples:

string substitution command result
hello there s/(hello|hi) there/\1, yourself/g hello, yourself
hi there " hi, yourself
bra s/(.*)/a\1cada\1/g abracadabra
2x s/([0-9]+)([a-z])/\2 times \1/g x times 2
9a + 12c " a times 9 + c times 12

Using Regular Expressions

Below are some programs that use regular expressions:

Program Function RegEx type Example of Usage
vim editor variant %s/\(hello\|hi\) there/\1, yourself/g
find file finding emacs variant $ find /usr/bin -regex ".*\(.+\)/f\1."
grep text searching POSIX (egrep) $ egrep "[abcdr]{9,}" /usr/share/dict/words
sed text filtering POSIX (with -r option) $ echo "hi there" |

sed -re "s/(hello|hi) there/\1, yourself/g"

PHP programming language POSIX <?php

print ereg_replace("(hello|hi) there", "\\1, yourself\n", "hi there");
?>

Python programming language POSIX (with extensions) import re
p = re.compile('(hello|hi) there')

print p.sub('\1, yourself', 'hello there')

Ruby programming language POSIX (with extensions) a = "hi there"

print a.sub(/(hello|hi) there/, '\1, yourself')

Perl programming language POSIX (with extensions) $s = "hi there";

$a =~ s/(hello|hi) there/\1, yourself/;
print "$a\n";

awk programming language POSIX (with extensions) $ echo "hi there" |

awk '{x = gensub(/(hi|hello) there/, "\\1, yourself", "g", $0);
printf("%s -> %s\n", $0, x);}'

References