Without a title - 4.2 Data-structures

4.2 Data-structures

YAILS is a full memory incremental learning program. This means that all seen examples are kept in memory thus enabling the validation of future modifications of the theory. A key point on the design of the program was the choice of data-structures which should enable not only an efficient way of storing a possibly big quantity of examples, but also the efficient motion through the search space (i.e. application of search operators).

Basically there were two hypoteses, either to store the examples as they were given or to transform them into a more efficient representation. The possible transformation was a table of selectors (TOS) upon which each existing selector was stored with references to the examples that contained it. As an example of such transformations see the following figure :

Example nr. 1 :: color = white /\ height= 1.75 /\ hair = blond /\ eyes = blue
Example nr. 2 :: color = black /\ height= 1.8 /\ hair = black /\ eyes = brown
Example nr. 3 :: color = white /\ height= 1.75 /\ hair = black /\ eyes = brown

Equivalent TOS :

color = white [1,3]
color = black [2]
height = 1.75 [1,3]
height = 1.8 [2]
hair = black [2,3]
hair = blonde [1]
eyes = brown [2,3]
eyes = blue [1]

The chosen representation was the TOS. There are several advantages of this representation over the former. First of all if we assume that the examples have a reasonable ammount of communalities (which is reasonable) then this representation can be more efficient in terms of space. Secondely, and more important, this representation optimizes one of the more important and frequent operations on YAILS. That is finding what examples satisfy a certain complex (a candidate rule). This is true because the final rules are much more general than the examples. Being so the search of YAILS is done most part on the upper (more general) part of the search space. Thus having a representation near this final format of the rules optimizes the search. In effect to known which examples satisfy the complex "color = black /\ eyes = brown" we just need to intersect the list of the respective selectors (on the previous example [2] Ç [2,3] = [2]). This operation is done thousands of times during learning by YAILS so being more effective may represent several seconds or even minutes on learning time. As an example the Lymphography domain has 18 attributes and the obtained rules have 2 to 3 selectors on average. This means that the search is done far away from the given examples thus we can see how much more appropriate is the TOS representation.

In effect the operation of finding the coverage of one complex can became even more effective if we adopt another representation for the information about the examples that satisfy each entry in the TOS. That is using a set of ordered list, one for each concept. So we can have something like this :

The reason for the inverse order instead of standard order lies on the fact that examples have growing id's so this way the insertion of an id on such ordered list is straightforward (put it on the head).

All the main data structures are stored in YAP internal database which is very quick having several indexing mechanisms to enhance its performance. Bellow I present the description in BNF of the main data structures used by YAILS :

EXAMPLE ::: (CLASS,COMPLEX)

RULE ::: (CLASS,COMPLEX,COVERAGE_LIST,QUALITY)

COMPLEX ::: [ ( SELECTOR )^* ]

SELECTOR ::: ATTR OP VALUE | (ATTR OP_I1 number,ATTR OP_I2 number)

COVERAGE_LIST ::: [ ( (CLASS,number,[ ( EX_ID )⁺ ]) )^* ]

QUALITY ::: number

EX_ID ::: number

ATTR ::: symbol

VALUE ::: number | symbol

OP ::: = | > | >= | < | =<

OP_I1 ::: > | >=

OP_I2 ::: < | =<

Note : The second element (number) of the COVERAGE_LIST structure represents the number of examples on the following list (third element). It's there for reasons of efficiency.

<< , >> , up , Title , Contents