<< , >> , up , Title , Contents

1 Introduction

Machine learning (ML) researchers have traditionally concentrated their efforts on classification problems. However, many interesting real world domains demand for regression tools. In this paper we present and evaluate a discretization method that extends the applicability of existing classification systems to regression domains. The discretization of the target variable values provides a different granularity of predictions that can be considered more comprehensible. In effect, it is a common practice in statistical data analysis to group the observed values of a continuous variable into class intervals and work with this grouped data (Bhattacharyya & Johnson, 1977). The choice of these intervals is a critical issue as too many intervals impair the comprehensibility of the models and too few hide important features of the variable distribution. The methods we propose provide means to automatically find the optimal number and width of these intervals. The motivation for transforming regression into classification is to obtain a different tradeoff between comprehensibility and accuracy of regression models. As a by-product of our methods we also broaden the applicability of classification systems.

We argue that mapping regression into classification is a two-step process. First we have to transform the observed values of the goal variable into a set of intervals. These intervals may be considered values of an ordinal variable (i.e. discrete values with an implicit ordering among them). Classification systems deal with discrete target variables. They are not able to take advantage of the given ordering. We propose a second step whose objective is to overcome this difficulty. We use misclassification costs which are carefully chosen to reflect the ordering of the intervals as a means to compensate for the information loss regarding the ordering.

We describe several alternative ways of transforming a set of continuous values into a set of intervals. Initial experiments revealed that there was no clear winner among them. This fact lead us to try a search-based approach to this task of finding an adequate set of intervals. We use a wrapper technique (John et al., 1994; Kohavi, 1995) as a method for finding near-optimal settings for this mapping task.

We have tested our methodology on four regression domains with three different classification systems : C4.5 (Quinlan, 1993); CN2 (Clark & Nibblet, 1988); and a linear discriminant (Fisher, 1936; Dillon & Goldstein, 1984). The results show the validity of our search-based approach and the gains in accuracy obtained by adding misclassification costs to classification algorithms.

The next section describes how to transform a continuous target variable into a set of intervals. In section 3 we describe our proposal of using misclassification costs to deal with ordinal variables. The experiments we carried out done are described on section 4. Finally, we comment the relations to other work and present the conclusions of this paper.


<< , >> , up , Title , Contents