Search-based Class Discretization - 2.1 An illustrative example

2.1 An illustrative example

In this example we use the servo data set (see details in section 4). We have coupled RECLA with C4.5 and evaluated the learned model with the MAE statistic (see section 3). We have performed two experiments with different discretization methods. In the first experiment we use the VNI search operator with the KM splitting algorithm. Table 1 presents the discretizations of this first experiment (KM+VNI). The first column shows the number of intervals tried in each iteration[2]. The second column shows the obtained intervals by the KM splitting method. The second line of this column includes the median of the values within the intervals (the used "classes"). The last column gives the internal 5-fold CV error estimate of each tentative set of intervals. In this example we have used the value 1 for the "Lookahead" parameter mentioned before. The solution of this method is thus 4 intervals (the trial with best estimated error).

Table 1 - Trace of KM+VNI method.

In the second experiment we use the same splitting algorithm but with the INI search operator. The results are given in Table 2. We also include the estimated error of each interval (the value in parenthesis). The next tried iteration is dependent on these estimates. The intervals whose error is greater than the median of the estimates are split in two intervals. For instance, in the third iteration we can observe that the third interval was maintained from the second trial, while the other were obtained by splitting a previous interval.

Table 2 - Trace of KM+INI method.

The two methods follow different strategies for grouping the values. In this example the second alternative lead to lower error estimates and consequently this alternative was preferred by RECLA.

An interesting effect of increasing the number of intervals is that after some threshold the algorithm's performance decreases. This may be caused by the decrease of the number of cases per class leading to unreliable estimates due to overfitting the data.

<< , >> , up , Title , Contents