Detailed description of the used Data Sets

Detailed description of the used Data Sets

Abalone – This data sets consists of predicting the age of abalones (actually the number of rings in an abalone), based on other biometrics of the animals. The data can be obtained in the UCI machine learning repository (Merz & Murphy, 1996).

Pole – Telecommunications data set used in Weiss & Indurkhya (1995), kindly provided by Nitin Indurkhya. Details on the task are not available due to its commercial character.

Elevators and Ailerons – Two data sets from a control problem. The task consists of flying a F-16 fighter, and the two data sets have traces of control actions on two variables (ailerons and elevators). The data was collected and provided by Rui Camacho (rcamacho@garfield.fe.up.pt).

CompActiv. and CompActiv(s) – These two data sets were obtained from the Delve repository (http://www.cs.toronto.edu/~delve/). These data sets have a collection of computer activity measures in an university department. The tasks consist of predicting the usr computation time, based on different measures of computer activity.

Kinematics – This data set was obtained from the Delve repository (http://www.cs.toronto.edu/~delve/). The task consists of predicting the forward kinematics of a 8 link robot arm. Among the available data sets for this task we have used the "8nm" variant, corresponding to a 8 input variables task, with high non-linearity and medium noise.

Fried1 – This is an artificial data set created by Jerome Friedman (Friedman, 1991), also described in Breiman, L. (1996). The example is supposed to simulate the impedance of an alternating current circuit. The data consists of 4 continuous input variables and the target is given by :

where
0 £ x₁ £ 100 ;
20 £ x₂/2p £ 280 ;
0 £ x₃ £ 1 ;
1 £ x₄ £ 11 and e is a normally distributed noise signal.

Census16H and Census8L - These two data sets were obtained from the Delve repository (http://www.cs.toronto.edu/~delve/). These data sets were designed based on the data provided by the Census Bureau (http://www.census.gov/) of the USA.From the 4 available tasks we have used the "house-price-8L" which is considered an easier task of predicting the housing prices based on 8 input variables, and "house-price-16H" which is considered a more difficult task. Both data sets use the same original cases but completely different input variables.

2Dplanes – This is an artificial domain described in Breiman et al. (1984), p.238.

Mv1 – This is an artificial data set generated using the following rules :

IF x₂> 2 THEN y = 35 – 0.5 x₄

IF –2 £ x_{4 £} 2 THEN y = 10 – 2 x₁

IF x₇= yes THEN y = 3 –x₁/ x₄

IF x₈= normal THEN y = x₆ + x₁

ELSE x₁/2

where
x₁ is an uniformly distributed real over the interval –5..5 ;
x₂ is an uniformly distributed real over the interval –15..-10 ;
x₃ has value "green" if x1 > 0, else it has value "red" with 0.4 probability and value "brown" with 0.6 probability;
x₄ has value x₁/2 with probability 0.2; value x₁+x_2´2 if x₃="green" or value x₂/2;
x₅ is an uniformly distributed real over the interval –1..1 ;
x₆ is given by x₄ times an uniformly distributed real over 0..5;
x₇ has value "yes" with 0.3 , else it has value "no";
x₈ has value "normal" if x₅ < 0.5, else it has value "large";
x₉ is an uniformly distributed real over the interval 100..500 ;
and x₁₀ is an uniformly distributed real over the interval 1000..1200.

References :

Breiman,L. (1996) : Bagging Predictors. Machine Learning, 24, (p.123-140). Kluwer Academic Publishers.

Breiman,L. , Friedman,J.H., Olshen,R.A. & Stone,C.J. (1984): Classification and Regression Trees, Wadsworth Int. Group, Belmont, California, USA, 1984.

Friedman, J. (1991) : Multivariate Adaptative Regression Splines. In Annals of Statistics, 19:1, 1-141.

Weiss, S. and Indurkhya, N. (1995) : Rule-based Machine Learning Methods for Functional Prediction. Journal of Artificial Intelligence Research (JAIR), 3, pp.383-403.