YAP 7.1.0
All Classes Files Functions Variables Typedefs Enumerations Enumerator Macros Modules Pages
regexp.yap
Go to the documentation of this file.
1/*************************************************************************
2* *
3* YAP Prolog *
4* *
5* Yap Prolog was developed at NCCUP - Universidade do Porto *
6* *
7* Copyright L.Damas, V.S.Costa and Universidade do Porto 1985-1997 *
8* *
9*************************************************************************/
10
11/**
12 * @file regexp.yap
13 *
14 * @author VITOR SANTOS COSTA <vsc@VITORs-MBP.lan>
15 * from BSD Unix work.
16 * @date Wed Nov 18 00:27:52 2015
17 *
18 * @brief Support for Regular Expressions in YAP
19 *
20 *
21*/
22
23
24:- module(regexp, [
27 ]).
28
29
30/** @defgroup regexp Regular Expressions
31@ingroup YAPLibrary
32@{
33
34This library includes routines to determine whether a regular expression
35matches part or all of a string. The routines can also return which
36parts parts of the string matched the expression or subexpressions of
37it. This library relies on Henry Spencer's `C`-package and is only
38available in operating systems that support dynamic loading. The
39`C`-code has been obtained from the sources of FreeBSD-4.0 and is
40protected by copyright from Henry Spencer and from the Regents of the
41University of California (see the file library/regex/COPYRIGHT for
42further details).
43
44Much of the description of regular expressions below is copied verbatim
45from Henry Spencer's manual page.
46
47A regular expression is zero or more branches, separated by ``|`. It
48matches anything that matches one of the branches.
49
50A branch is zero or more pieces, concatenated. It matches a match for
51the first, followed by a match for the second, etc.
52
53A piece is an atom possibly followed by `\*`, `+`, or `?`. An atom
54followed by `\*` matches a sequence of 0 or more matches of the atom.
55An atom followed by `+` matches a sequence of 1 or more matches of the
56atom. An atom followed by `?` matches a match of the atom, or the
57null string.
58
59An atom is a regular expression in parentheses (matching a match for the
60regular expression), a range (see below), `.` (matching any single
61character), `^` (matching the null string at the beginning of the
62input string), `$` (matching the null string at the end of the input
63string), a `\` followed by a single character (matching that
64character), or a single character with no other significance (matching
65that character).
66
67A range is a sequence of characters enclosed in `[]`. It normally
68matches any single character from the sequence. If the sequence begins
69with `^`, it matches any single character not from the rest of the
70sequence. If two characters in the sequence are separated by `-`,
71this is shorthand for the full list of ASCII characters between them
72(e.g. `[0-9]` matches any decimal digit). To include a literal `]`
73in the sequence, make it the first character (following a possible
74`^`). To include a literal `-`, make it the first or last
75character.
76
77*/
78
79 /**
80 @pred regexp(+ _RegExp_,+ _String_,+ _Opts_)
81
82Match regular expression _RegExp_ to input string _String_
83according to options _Opts_. The options may be:
84
85+ `nocase`: Causes upper-case characters in _String_ to
86be treated as lower case during the matching process.
87
88
89
90*/
91
92/** @pred regexp(+ _RegExp_,+ _String_,+ _Opts_,? _SubMatchVars_)
93
94
95Match regular expression _RegExp_ to input string _String_
96according to options _Opts_. The variable _SubMatchVars_ should
97be originally unbound or a list of unbound variables all will contain a
98sequence of matches, that is, the head of _SubMatchVars_ will
99contain the characters in _String_ that matched the leftmost
100parenthesized subexpression within _RegExp_, the next head of list
101will contain the characters that matched the next parenthesized
102subexpression to the right in _RegExp_, and so on.
103
104The options may be:
105
106+ `nocase`: Causes upper-case characters in _String_ to
107be treated as lower case during the matching process.
108+ `indices`: Changes what is stored in
109 _SubMatchVars_. Instead of storing the matching characters from
110 _String_, each variable will contain a term of the form _IO-IF_
111giving the indices in _String_ of the first and last characters in
112the matching range of characters.
113
114
115
116In general there may be more than one way to match a regular expression
117to an input string. For example, consider the command
118
119```
120 regexp("(a*)b*","aabaaabb", [], [X,Y])
121```
122Considering only the rules given so far, _X_ and _Y_ could end up
123with the values `"aabb"` and `"aa"`, `"aaab"` and
124`"aaa"`, `"ab"` and `"a"`, or any of several other
125combinations. To resolve this potential ambiguity `regexp` chooses among
126alternatives using the rule `first then longest`. In other words, it
127considers the possible matches in order working from left to right
128across the input string and the pattern, and it attempts to match longer
129pieces of the input string before shorter ones. More specifically, the
130following rules apply in decreasing order of priority:
131
132+ If a regular expression could match two different parts of an
133input string then it will match the one that begins earliest.
134
135+ If a regular expression contains "|" operators then the leftmost matching sub-expression is chosen.
136
137+ In \*, +, and ? constructs, longer matches are chosen in preference to shorter ones.
138
139+ In sequences of expression components the components are considered from left to right.
140
141In the example above, `"(a\*)b\*"` matches `"aab"`: the
142`"(a\*)"` portion of the pattern is matched first and it consumes
143the leading `"aa"`; then the `"b\*"` portion of the pattern
144consumes the next `"b"`. Or, consider the following example:
145
146```
147 regexp("(ab|a)(b*)c", "abc", [], [X,Y,Z])
148```
149
150After this command _X_ will be `"abc"`, _Y_ will be
151`"ab"`, and _Z_ will be an empty string. Rule 4 specifies that
152`"(ab|a)"` gets first shot at the input string and Rule 2 specifies
153that the `"ab"` sub-expression is checked before the `"a"`
154sub-expression. Thus the `"b"` has already been claimed before the
155`"(b\*)"` component is checked and `(b\*)` must match an empty string.
156
157
158
159
160 */
161:- load_foreign_files([regexp], [], init_regexp).
162
163regexp(RegExp, String, Opts) :-
164 length(RegExp, LRE),
165 length(String, LS),
166 check_opts(Opts,0,IOpts,regexp(RegExp, String, Opts)),
167 check_regexp(RegExp,LRE,String,LS,IOpts).
168
169regexp(RegExp, String, Opts, OUT) :-
170 length(RegExp, LRE),
171 length(String, LS),
172 check_out(OUT,0,Count,regexp(RegExp, String, Opts, OUT)),
173 check_opts(Opts,0,IOpts,regexp(RegExp, String, Opts, OUT)),
174 check_regexp(RegExp,LRE,String,LS,IOpts,OUT,Count).
175
176%
177% OUT must be bound to a list of unbound variables.
178% Check this and count how many.
179%
180check_out(V,_,_,_) :- var(V), var.
181check_out([],I,I,_) :- check_out.
182check_out([V|L],I0,IF,G) :- check_out,
183 (nonvar(V) -> throw(error(uninstantiation_error(V),G)) ; throw),
184 I is I0+1,
185 check_out(L,I,IF,G).
186check_out(OUT,_,_,G) :-
187 throw(error(uninstantiation_error(OUT),G)).
188
189%
190% Option processing
191%
192check_opts(V,_,_,G) :- var(V), var,
193 throw(error(instantiation_error,G)).
194check_opts([],I,I,_) :- check_opts.
195check_opts([A|L],I0,IF,G) :- check_opts,
196 process_opt(A,I1,G),
197 I is I0+I1,
198 check_opts(L,I,IF,G).
199check_opts(Opts,_,_,G) :-
200 throw(error(type_error(variable,Opts),G)).
201
202process_opt(V,_,G) :- var(V), var,
203 throw(error(instantiation_error,G)).
204process_opt(nocase,1,_) :- process_opt.
205process_opt(indices,2,_) :- process_opt.
206process_opt(I,_,G) :-
207 throw(error(domain_error(flag_value,regexp_options+I),G)).
208
209
210/** @} */
211
throw(+ Ball)
load_foreign_files( Files, Libs, InitRoutine)
nonvar( T)
var( T)
length(? L,? S)
regexp(+ RegExp,+ String,+ Opts)
regexp(+ RegExp,+ String,+ Opts,? SubMatchVars)