Next: About this document ...
List of Exercises: Data Mining 1
October 26th, 2015
- In a given application, we have information about the ages of a
set of 12 people. Their values are 12, 30, 24, 10, 10, 23, 43, 67, 79, 34, 56, 51.
- What is the median of these ages? Explain.
- What is the mode of these ages? Explain.
- How would you obtain the difference between the 99%
percentile and the 10% of this set of ages, in R?
e o percentil 1% deste conjunto de idades.
- What are the results of normalizing and standardizing these
- Suppose that we add two more age values to the set mentione in
item (1): 10 months and 100 years.
- Apply normalization and standardization to this new set of data.
- Should you give any preference to apply normalization or
standardization to this new set of data?
- Answer the following questions:
- What is the objective of boxplot graphs?
- What are the functions of the spread measures: ``range'' and
``interquartile range''. Is there any advantage of using one over
- What other spread measures can we use to analyse data?
- Figure 1 shows a ``scatterplot''. What information
can you infer from this graph?
- Figure 2 shows a ``parallel plot''. What kind of
information can you infer from this graph?
Parallel plot Example
- When visualizing data, it may be important to reorder/rearrange variables or
sort variable values. Give na example, where this order can yield a
better visualization than visualization of the original not
- The ``nearest neighbours'' strategy can also be used to ``impute''
values to unlnown variable values. Explain how you can use nearest
neighb ours to impute missing variable values.
- What is the main idea behind PCA - Principal Component
Analsysis and why is it useful?
- Give a brief description of the k-means clustering algorithm.
- Suppose you are given a CSV (Comma-Separated Values) table. When
you read this table using the R function read.csv, what are
the variable types stored internally if a variable has only two
values? How about when you read the same data in the WEKA software?
- What are the basic variable types used in data analysis?
- Explain the difference between the distance calculated using
``simple matching'' and the ``Jaccard'' distance. In what situation,
we apply one or the other?
- What is ``supervised'' machine learning?
- What is ``cluster analysis'' used for?
- What is the difference between Pearson correlation and simple
- Consider the following data table:
Given a new observation (1.4,1.6), which two observations in the
table are nearer the new data point, using the Euclidean distance?
- What is the difference between hierarchical agglomerative
clustering and hierarchical divisive clustering?
- Explain why using the ``Error rate'' to evaluate a
classification model may not be a good approach.
- Which approach would you use to plot a histogram for a
continuous numeric variable.
- What is the objective of data sampling?
Next: About this document ...
Inês de Castro Dutra