The Format of the Data Sets


The data sets in this repository are provided in two separate files. One file with extension *.data contains the cases available for training, while the other file with extension *.domain contains information on the variables (attributes) used to describe each case.

The file with the training cases (*.data), contains one case per line. Each line is a set of attribute values separate either by comma, space or tab characters. The last value in this list is the goal variable value (i.e. the variable to be predicted by the learned model).

The file with the information on the used variables (*.domain) contains as many lines as there are variables. Each line describes one variable. The order of the lines is exactly the order in which the variable values appear in each case of the data file. This means that the first line describes the variable that appears in the first position in the list of values of each case, and so on. Thus the last line describes the goal variable.

The description of each variable has the following format:
variable_name : variable_type.

The types of the variables can be:

An example:


smail
LIACC-CIUP
R. Campo Alegre, 823
4150 PORTO
PORTUGAL
UP
LIACC
MLgroup
Luís Torgo
email : ltorgo@liacc.up.pt
WWW : http://www.liacc.up.pt/~ltorgo
phone : (+351) 2 6078830
fax : (+351) 2 6003654