We have conducted several experiments with four real world domains.
The main goal of the experiments was to assert each discretization methodology performance when the input conditions vary. These conditions are the regression domain, the classification learning system and the regression error measure used to evaluated the learned models. Some of the characteristics of the used data sets are summarized on Table 1. These data sets were obtained from the UCI machine learning data base repository :
Data Set N. Examples N. Attributes housing 506 13 servo 167 4 auto-mpg 398 7 machine 209 6Table 1- The used data sets.
As we already referred we have used C4.5 and CN2 in our experiments. We have used as regression accuracy measures the MAE and MAPE statistics.
We linked our methodologies to each of these learning algorithms obtaining two new regression learners. We evaluate the performance of these systems with the two statistics on the chosen domains. We have obtained this evaluation by means of a 5-fold cross validation test. On each iteration of this process we have forced our discretization system to use each of the six alternative discretization methods producing six different discrete data sets that were then given to C4.5 and CN2. We have collected a set of true prediction errors for each of the regression models learned with the six discrete data sets. The goal of these experiments was to compare the results obtained by each discretization method under different setups on the same data. The 5-CV average predictive accuracy on the "auto-mpg" data set is given on Table 2 (the small numbers in italic are the standard deviations):
VNI KNI EP EW KM EP EW KM MAE C4.5 2.877 2.796 2.783 2.982 3.127 3.134 0.333 0.308 0.299 0.360 0.282 0.272 CN2 3.405 4.080 3.597 3.311 3.695 3.053 0.266 0.654 0.641 0.188 0.407 0.340 MAPE C4.5 12.50 12.556 12.200 12.072 11.600 11.996 1.520 1.980 1.623 2.549 2.074 1.195 CN2 15.189 15.474 15.282 14.485 14.930 13.871 1.049 1.631 2.235 0.936 1.291 2.069Table 2 - Experiments with "auto-mpg".
The best score for each setup is in bold. Due to space reasons we summarize the overall results on Table 3 where we present the winning strategies for all data sets :
Set Up Servo Machine Housing Auto-mpg C4.5 / MAE VNI+KM SIC+KM VNI+EW VNI+KM CN2 / MAE SIC+EP SIC+KM SIC+EW SIC+KM C4.5 / MAPE VNI+KM SIC+KM SIC+KM SIC+EW CN2 / MAPE SIC+EP SIC+EW VNI+KM SIC+KMTable 3 - Summary of overall results.
Table 4 gives the rank score of each strategy. The numbers in the columns of this table represent the number of times a strategy was ranked as the Nth best strategy. The last column gives the average ranking order for each method. The methods are presented ordered by average rank (lower values are better) :
1st 2nd 3rd 4th 5th 6th Avg. Rank SIC+KM 6 2 3 1 0 4 2,24 VNI+EP 0 7 3 5 1 0 2,29 VNI+KM 4 1 3 6 2 0 2,33 SIC+EW 3 0 3 2 7 1 2,90 SIC+EP 2 3 3 5 2 3 3,10 VNI+EW 1 3 1 0 3 8 3,48Table 4 - Rank scores of the methods.
The main conclusion to draw from our experiments is the clear dependency of the best methodology on the used set up. There is no clear winning strategy on all domains. This proves the validity of our search-based approach to class discretization.
Table 3 shows that our selective specialization strategy (SIC) is most of the times (11 out of 16) one of the components of the best discretization method. Looking at the average rank results (Table 4) SIC+KM is the best followed by VNI+EP. SIC+KM is also the method that is the best more often. The two methods using EW splitting method have bad averages, nevertheless these methods sometimes are the best so they are not useless. On the contrary VNI+EP which was the second best on average was never the best strategy in all our experiments. Another interesting observation is the regularity observed on the "machine" data set (see Table 3) contrasting with the other data sets.