-----------------------------------------------------------------
- finished slides about Data Visualization
-----------------------------------------------------------------

- Class on board:

-----------------------------------------------------------------
   - Example:
     Given the following 2 sets of data: (3 instances, 2 variables)
          X   Y             X   Y
      I1  10  21       I1   10  21
      I2  50  22       I2   90  22
      I3  90  23       I3   50  23

Calculate the Euclidean distance between I1-I2, I1-I3 and I2-I3.
Apply normalization and standardization to both sets and calculate
again the new distances.

Q1: does the distribution of each data set change after the transformations?
A1: no. (Explain why)   

Q2: what is the difference between normalizing or standardizing both datasets?
A2: None. Both will maintain the data distribution. Standardization
always transform a variable such that the new values will have mean
zero and sdandard deviation equals to 1. Normalization changes the
variables values to be between 0 and 1.

Q3: Do distance distributions change after applying the
transformations?
A3: no. (Explain why)

Standardization is preferred when your variable has outliers. (Explain
why.)  (In the presence of outliers, normalization brings the "normal"
values to very small intervals)

-------------------------------------------------------------------------------

- PCA: the idea behind PCA is to reduce data dimension. It is a method
  for factor analysis and it is a multivariate method. It reduces
  dimension by transforming the original data coordinates and ranking
  principal components. Principal components are given by unit vectors
  that are orthogonal in the various dimensions. PCA finds the
  orthogonal unit vectors and data is then represented using the new
  set of coordinates defined by these unit vectors ("principal
  components") ranked by principal component importance (unit vectors
  are found by maximizing ranges in each dimension). Any book in
  statistics explain well how this works. Good links:
  http://www.cs.cmu.edu/~elaw/papers/pca.pdf and
  https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/
  
- difference between Pearson correlation and simple linear regression:
both of them calculate coefficients "a" and "b" of the regression:
y=ax+b, but regression allows "prediction". In other words, it
generates a model that, given a value for x, allows to infer a value
for y. Correlation does not explicitly generate models. It only
produces coefficients and the correlation value. Regression models are
only applied when our class variable (the one we are interested in
predicting) is numeric. "Logistic" regression can be applied to
nominal class variables. Logistic regression gives a probability of an
istance belonging to a given class.

- regression as a prediction model and Support Vector Machines (SVM):
Regression and other linear models can be used for classification
where all attributes are numeric. Their biggest disadvantage is that
they can only represent linear boundaries between classes. Support
Vector Machines use linear models to implement nonlinear class
boundaries. It does that by transforming the input data using a
nonlinear mapping, in other words, it transforms the instance space
into a new space. With a nonlinear mapping, a straight line does not
look straight in the original instance space. The idea is to find a
special kind of linear model: the maximum margin hyperplane. Having
instances of two classes, the maximum margin hyperplane is the one
that gives the greates separation between the classes by maximizing
the distances to the support vectors. 

------------------------------------------------------------------------
  
- Clustering: non-supervised learning (can be supervised if we have a
  class variable, but traditionally, it is not supervised). The most
  common algorithm is k-means. In this algorithm we need to select a
  number of clusters k. k instances are randomly selected as the
  centers of the k clusters. The algorithm proceeds by calculating the
  distance/similarity between this center and all other instances that
  are not centers. Once this step is done, centroids are calculated
  for each one of the k clusters. Centroids are calculated by
  averaging each dimension of all instances belonging to a cluster -
  assume you have points (1,3,5), (-1,4,2) and (2,3,4) in one fo the
  clusters. The centroid will be (2/3,10/3,11/3). Although this
  instance may not exist in the original dataset, it is used to repeat
  the k-means steps. The clustering process ends when the centroids do
  not change anymore.