<< , >> , up , Title , Contents

1. Introduction

Regression is an important problem in data analysis. Given a sample of a set of independent variables x1,...,xn together with a dependent variable y, we are trying to approximate the function y = (x1,...,xn).

The machine learning (ML) field has been mainly concerned with the problem of classification. Regression differs from classification because each example is attributed a real number "class" instead of a finite set of symbolic labels. One of the main advantages of ML systems compared to systems from other research fields is on comprehensibility of the results [11]. This is a key motivation for trying to apply ML algorithms to regression problems.

In the field of ML most of the existing work on these problems is based on regression trees (CART [2], M5 [12, 13], RETIS [7, 8]). These algorithms have a very simple and efficient algorithm. M5 and RETIS are able to learn trees with linear regression models in the leaves. CART is limited to have average values in the leaves thus producing poor extrapolation from the original data. One exception to regression trees is the work presented in [19] on regression rules. This system transforms the data into a classification problem by a process of discretization. This process will inevitably carry some loss of information.

Regression methods were extensively studied for many years in the field of statistics. Simple and efficient methods like linear least squares regression have proven to be very effective in many real-world applications. Nevertheless, non-linearity has been receiving increasing attention by the statistics community. Among new methods that appeared in recent years we can mention projection pursuit [6] and MARS [5].

Neural networks (NN's) are also good function approximators and much work exist on these tasks (see for instance the work of [9] with back-propagation NN's).

In this paper we present R2, a propositional inductive learning system. This system is able to learn a set of regression rules from samples of a function mapping. Compared to regression trees, R2 has the advantage of richer descriptive language (rules). The target models of these systems are similar[1] R2 to the current version of R2 (with the exception of CART). Nevertheless, we intend to extend the set of models used by our system in the near future.

R2 has the advantage of comprehensibility over the statistics and neural nets approaches. This can be an important factor in many domains. However, we should also be concerned with experimental results on accuracy. We still did not make any comparisons with these type of systems. An interesting account on evaluating techniques from all these fields on common data sets can be found in [11].

Regression problems appear in many real world applications. Among them we can mention the field of control. Several authors have tried to develop controllers of dynamic systems based on records of human interaction with the system (see for instance [18]). The controlled variables of these problems are usually numeric so we can look at this as a regression problem. Most of the work on applying ML systems to these tasks has been done using classification systems. Regression systems like R2 could bring some improvement on the accuracy of these machine-generated controllers.

The following section gives an overview of R2. We describe its main components and give an illustrative example. We then show some results of comparative experiments carried out on artificial domains. We end with some conclusions and future work.


<< , >> , up , Title , Contents