Guides/Regular Expressions/Basic Patterns
Basic Patterns
The following is basic description of regular expressions.
A regular expression pattern is a sequence of elements which matches successive portions of a character string. For example, simple letters are elements which match the same characters in the string. The asterisk indicates that the previous element should be matched 0 or more times. So, a pattern of abcd must match in the string exactly; a pattern of ab*cd matches the letter a followed by 0 or more occurrences of the letter b , followed by the letters cd .
Characters
Non-special characters match exactly. Non-special characters are anything other than:
[ ] ( ) { } $ ^ . * + ? | \
A special character is included as simple text by preceding it with a backslash.
Character sets
The special character . matches any character (except the null character, 0{a. )
The special characters ^ and $ match the start and end of lines.
Sets of characters are defined by enclosing the list of characters in brackets:
[aeiou] matches a single vowel character
Ranges can also be included within the brackets:
[a-z] matches any lower case letter
Combinations of the above are acceptable:
[a-zA-Z13579] matches any lower case, upper case, or odd digit
Fixed sets (classes) of characters can be included in the list, as a name within bracket-colon pairs:
[#[:digit:]abc] matches the character #, a digit, or any of the letters a, b, or c
The character classes defined are:
alnum alphanumeric alpha alphabetic blank tab+space cntrl control chars digit digits graph printable (except space) lower lowercase print printable punct punctuation space whitespace upper uppercase xdigit hex digits
If a set begins with ^ , then the pattern will match with any character not in the set.
Subexpressions
A series of elements may be combined by enclosing them in parenthesis. Subexpression are affected by closures such as * just as simple characters are:
([a-z][0-9])* matches any number of occurrences of a letter followed by a digit
The result of searches for a pattern return a match for the overall pattern, and a separate match for each subexpression.
A \ followed by a digit, N, matches the same substring which occurred in the Nth subexpression:
([[:digit:]]+)#\1 matches one or more digits, followed by a # , followed by the same string of digits.
Closures
A * following an element matches 0 or more occurrences of that element:
[aeiou]* matches 0 or more vowels
A + following an element matches 1 or more occurrences of that element:
[[:alpha:]]+ matches 1 or more alphabetic characters
A ? following an element matches 0 or 1 occurrences of that element:
-?[[:digit:]]+ matches an optional hyphen, followed by 1 or more digits
An interval expression, {m,n} , follows an element to allow it to match at least m, and no more than n, occurrences of the element:
[[:digit:]]{3,5} matches 3, 4, or 5 digits
Alternation
Multiple regular expressions can be separated with a vertical bar | to match any of them:
print|list|exit matches any of the strings print , list , and exit
Matches
When searching for a pattern in a string, it is possible to find multiple substrings which match the pattern. The one that is returned is the one which starts earliest in the string. If more than one match starts at the same place, the longest one is returned.
Even once a particular match is located, it is possible for there to be multiple combinations of the contents of the subexpressions which make it up. As a rule, whenever possible the subexpressions which begin earlier in the string will be as long as possible.
The result of a match is a table which describes the match. The first row covers the whole match, and subsequent rows describe where the subexpressions in the pattern match in the string. Each row has two elements: index of the first character of the start of the match, and the length of the match. Any row which doesn't participate in the match is filled with _1 0.