regexp/5¶
regexp(+ RegExp,+ String,+ Opts,? SubMatchVars)*
Match regular expression RegExp to input string String according to options Opts. The variable SubMatchVars should be originally unbound or a list of unbound variables all will contain a sequence of matches, that is, the head of SubMatchVars will contain the characters in String that matched the leftmost parenthesized subexpression within RegExp, the next head of list will contain the characters that matched the next parenthesized subexpression to the right in RegExp, and so on.
The options may be:
-
nocase
: Causes upper-case characters in String to be treated as lower case during the matching process. -
indices
: Changes what is stored in SubMatchVars. Instead of storing the matching characters from String, each variable will contain a term of the form IO-IF giving the indices in String of the first and last characters in the matching range of characters.
In general there may be more than one way to match a regular expression to an input string. For example, consider the command
regexp( "(a*)b*" , "aabaaabb" , [], [X,Y])
Considering only the rules given so far, X and Y could end up with the values "aabb"
and "aa"
, "aaab"
and "aaa"
, "ab"
and "a"
, or any of several other combinations. To resolve this potential ambiguity regexp
chooses among alternatives using the rule first then longest
. In other words, it considers the possible matches in order working from left to right across the input string and the pattern, and it attempts to match longer pieces of the input string before shorter ones. More specifically, the following rules apply in decreasing order of priority:
-
If a regular expression could match two different parts of an input string then it will match the one that begins earliest.
-
If a regular expression contains "|" operators then the leftmost matching sub-expression is chosen.
-
In *, +, and ? constructs, longer matches are chosen in preference to shorter ones.
-
In sequences of expression components the components are considered from left to right.
In the example above, "(a\*)b\*"
matches "aab"
: the "(a\*)"
portion of the pattern is matched first and it consumes the leading "aa"
; then the "b\*"
portion of the pattern consumes the next "b"
. Or, consider the following example:
regexp( "(ab|a)(b*)c" , "abc" , [], [X,Y,Z])
After this command X will be "abc"
, Y will be "ab"
, and Z will be an empty string. Rule 4 specifies that "(ab|a)"
gets first shot at the input string and Rule 2 specifies that the "ab"
sub-expression is checked before the "a"
sub-expression. Thus the "b"
has already been claimed before the "(b\*)"
component is checked and (b\*)
must match an empty string.