----------------------------------------------------------------- - finished slides about Data Visualization ----------------------------------------------------------------- - Class on board: ----------------------------------------------------------------- - Example: Given the following 2 sets of data: (3 instances, 2 variables) X Y X Y I1 10 21 I1 10 21 I2 50 22 I2 90 22 I3 90 23 I3 50 23 Calculate the Euclidean distance between I1-I2, I1-I3 and I2-I3. Apply normalization and standardization to both sets and calculate again the new distances. Q1: does the distribution of each data set change after the transformations? A1: no. (Explain why) Q2: what is the difference between normalizing or standardizing both datasets? A2: None. Both will maintain the data distribution. Standardization always transform a variable such that the new values will have mean zero and sdandard deviation equals to 1. Normalization changes the variables values to be between 0 and 1. Q3: Do distance distributions change after applying the transformations? A3: no. (Explain why) Standardization is preferred when your variable has outliers. (Explain why.) (In the presence of outliers, normalization brings the "normal" values to very small intervals) ------------------------------------------------------------------------------- - PCA: the idea behind PCA is to reduce data dimension. It is a method for factor analysis and it is a multivariate method. It reduces dimension by transforming the original data coordinates and ranking principal components. Principal components are given by unit vectors that are orthogonal in the various dimensions. PCA finds the orthogonal unit vectors and data is then represented using the new set of coordinates defined by these unit vectors ("principal components") ranked by principal component importance (unit vectors are found by maximizing ranges in each dimension). Any book in statistics explain well how this works. Good links: http://www.cs.cmu.edu/~elaw/papers/pca.pdf and https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/ - difference between Pearson correlation and simple linear regression: both of them calculate coefficients "a" and "b" of the regression: y=ax+b, but regression allows "prediction". In other words, it generates a model that, given a value for x, allows to infer a value for y. Correlation does not explicitly generate models. It only produces coefficients and the correlation value. Regression models are only applied when our class variable (the one we are interested in predicting) is numeric. "Logistic" regression can be applied to nominal class variables. Logistic regression gives a probability of an istance belonging to a given class. - regression as a prediction model and Support Vector Machines (SVM): Regression and other linear models can be used for classification where all attributes are numeric. Their biggest disadvantage is that they can only represent linear boundaries between classes. Support Vector Machines use linear models to implement nonlinear class boundaries. It does that by transforming the input data using a nonlinear mapping, in other words, it transforms the instance space into a new space. With a nonlinear mapping, a straight line does not look straight in the original instance space. The idea is to find a special kind of linear model: the maximum margin hyperplane. Having instances of two classes, the maximum margin hyperplane is the one that gives the greates separation between the classes by maximizing the distances to the support vectors. ------------------------------------------------------------------------ - Clustering: non-supervised learning (can be supervised if we have a class variable, but traditionally, it is not supervised). The most common algorithm is k-means. In this algorithm we need to select a number of clusters k. k instances are randomly selected as the centers of the k clusters. The algorithm proceeds by calculating the distance/similarity between this center and all other instances that are not centers. Once this step is done, centroids are calculated for each one of the k clusters. Centroids are calculated by averaging each dimension of all instances belonging to a cluster - assume you have points (1,3,5), (-1,4,2) and (2,3,4) in one fo the clusters. The centroid will be (2/3,10/3,11/3). Although this instance may not exist in the original dataset, it is used to repeat the k-means steps. The clustering process ends when the centroids do not change anymore.