t1-V1

Next: Data Dictionary

Data Mining I: First Assignment

Deadline: November 9th 2015

October 13th, 2015

The main goal of this work is to apply feature importance methods to rank variables that are more predictive of cardiac pathologies. Before trying to achieve this goal, it may be necessary to preprocess the data. For example, data cleaning, data transformation, data normalization, as well as removal of irrelevant features, such as the ID, will be needed to help improving data quality.

In data cleaning, you need to look for inconsistencies, noise and missing values. For example, you may look for different variable values that convey the same meaning (M or Masculine, where both mean the Masculine gender) or variables that contain spurious values (for example, impossible values for ages, dates etc), you may need to remove duplicate entries/objects, remove variables that contain too many missing values (for example, more than 50% missing values, if missing values have no meaning), or remove variables that have the same value for all objects.

Data transformation involves applying functions that will convert variables or objects from one space to another. Methods commonly used are smoothing (binning, clustering or regression, for example), aggregation, generalization (for example, to transform low level feature values into higher level feature values - age represented in higher level intervals), normalization (for example, $x_{new}=\frac{x-x_{min}}{x_{max}-x_{min}}$ ) and standardization $x_{new}=\frac{x-\mu}{\tau}$ to make data fall into a small specified range, or feature construction (for example, creation of new attributes such as mass body index from height and weight). Be careful with outliers. If your data has too many outliers, normalization may push all your ``normal'' values to a very small interval. In that case, it is recommended to use standardization.

After preprocessing, it is usual to proceed with univariate analysis (summaries, histograms, boxplots, mean, standard deviation, range percentiles, interquartile range etc to measure data spread), bivariate analysis (correlations, for example Pearson, Spearman, Kendall etc), and mutivariate analysis (regression, clustering, factoring-PCA, mutual information etc). These are useful to study relevant or redundant variables and help to perform a pre-selection of features. Note that strong correlations among variables mean that they are dependent in some way. When performing feature selection, usually (but not always, because it depends on the domain), if two or more variables are strongly correlated, it may mean that they equally contribute to the analysis and some of them may be thrown away. For example, if height and weight are strongly correlated with IMC, we shouls use only IMC. Methods of multiple mutual information perform feature selection by choosing only relebant and non-redundant variables.

Besides performing statistical analysis, we can also apply machine learning algorithms to extract knowledge from data and generate predictive models. Examples of such algorithms are Support Vector Machines, Decision Trees, Random Forests, Bayesian Networks, Ensemble methods, Neural Networks etc.

There are several different libraries available out there that can perform all those tasks. Be aware that algorithms offer a wide range of parameter choices. You need to be careful with that choice, and need to understand what algorithm is implemented and what the results mean.

In order to assess the performance (quality-wise or quantity-wise) of each method on the same dataset, we need to choose some evaluation metric. This can be the error rate, the rate of correctly classified instances (CCI), sensitivity (Recall, True Positive Rate), sensibility, precision, etc. The metric to be used to assess performance will usually depend on the domain in hand. For example, if data is skewed (unbalanced number of objects per class), error rate may not be the adequate choice (why?).

Next: Data Dictionary

Inês de Castro Dutra 2015-10-21