TITLE PAGE - 3.1 the data sets

3.1 The data sets

To attain the goals of the experiments we needed data sets for which we could find results of other systems. We decided to use some data sets of a competition held out in DGOR (Deutsche Gesellschaft für Operations Research) [16] in 1982. These problems had the advantage of being well described and presenting lots of results of 12 different competitors. The main disadvantage of these data sets was that they were quite small. In effect the ones we used had around 100 examples for learning. This small number of examples can limit the learning performance of propositional learners, in particular when lots of attributes are present (which is the case in some of our trials where lots of attributes are introduced).

From the 15 time series problems that were used in the DGOR competition we had access to 5. Their names were ZR03, ZR04, ZR06, ZR11 and ZR15. These data sets were basically industry data sets (products sales, etc.).

These datasets needed to be pre-processed according to the methods referred in section 2.2.1. We compared the following strategies of introducing new attributes. We always tried data sets constructed by using 1 to 5 lagged variables. We named these trials with the index "_tN" where N is the used time lag. For instance "ZR03_t4" is the data set built from problem ZR03 using 4 attributes which represent the last four class values. We also tested some variants with differences. They have the suffix "...dN" where N is the level of differences used. For instance "ZR06_t5d2" uses 5 lagged variables and it as also 4 attributes of first level differences and 3 of differences of differences (so it uses 12 attributes). We also included in the tests some variants with weighted averages of previous class values (the ones with "sm" in their name). We did not exhaust all possible combinations. We heuristically chose some of them. For instance the trial "ZR03_smt5d2" uses the same attributes as "ZR03_t5d2" but it has an extra attribute whose value is calculated as (5*yt-1 + 4*yt-2 + 3*yt-3 + 2*yt-4 + yt-5)/15. This is a kind of weighted moving average linear model [7]. There where also other variants with different weights on the previous values of the variable.

<< , >> , up , Title , Contents