List of Exercises: Data Mining 1
November 4th, 2015
- Classify the following attributes as binary, discrete, or continuous. Also
classify them as qualitative (nominal or ordinal) or quantitative (interval or
ratio). Some cases may have more than one interpretation, so briefly indicate
your reasoning if you think there may be some ambiguity.
Example: Age in years. Answer: Discrete, quantitative, ratio
- Time in terms of AM or PM.
- Brightness as measured by a light meter.
- Brightness as measured by people’s judgments.
- Angles as measured in degrees between 0 and 360 degrees.
- Bronze, Silver, and Gold medals as awarded at the Olympics.
- Height above sea level.
- Number of patients in a hospital.
- ISBN numbers for books.
- Ability to pass light in terms of the following values: opaque, translucent, transparent.
- Military rank.
- Distance from the center of campus.
- Density of a substance in grams per cubic centimeter.
- Coat check number. (When you attend an event, you can often give
your coat to someone who, in turn, gives you a number that you can
use to claim your coat when you leave.)
- You are approached by the marketing director of a local company,
who believes that he has devised a foolproof way to measure
customer satisfaction. He explains his scheme as follows: ``It's so
simple that I can't believe that no one has thought of it before. I
just keep track of the number of customer complaints for each
product. I read in a data mining book that counts are ratio
attributes, and so, my measure of product satisfaction must be a
ratio attribute. But when I rated the products based on my new
customer satisfaction measure and showed them to my boss, he told me
that I had overlooked the obvious, and that my measure was
worthless. I think that he was just mad because our best-selling
product had the worst satisfaction since it had the most
complaints.'' Who is right,
the marketing director or his boss? If you answered, his boss, what
would you do to fix the measure of satisfaction?
- An educational psychologist wants to use association analysis to
analyze test results. The test consists of 100 questions with four
possible answers each.
- How would you convert this data into a form suitable
for association analysis?
- In particular, what type of attributes would you have and how
many of them are there?
- Distinguish between noise and outliers. Be sure to consider the following
- Is noise ever interesting or desirable? Outliers?
- Can noise objects be outliers?
- Are noise objects always outliers?
- Are outliers always noise objects?
- Can noise make a typical value into an unusual one, or vice versa?
- The following attributes are measured for members of a herd of
Asian elephants: weight, height, tusk length, trunk length, and ear
area. Based on these measurements, what sort of similarity measure
would you use to compare or group these elephants?
Justify your answer and explain any special circumstances.
- You are given a set of m objects that is divided into K groups,
where the group is of size . If the goal is to obtain a
sample of size , what is the difference between the following
two sampling schemes? (Assume sampling with replacement.)
- We randomly select
elements from each group.
- We randomly select n elements from the data set, without regard for
the group to which an object belongs.
- Explain why computing the proximity between two attributes is
often simpler than computing the similarity between two objects.
- Describe how you would create visualizations to display information that
describes the following types of systems.
Be sure to address the following issues:
- Representation. How will you map objects, attributes, and relationships to visual elements?
- Arrangement. Are there any special considerations that need to be
taken into account with respect to how visual elements are displayed?
Specific examples might be the choice of viewpoint, the use of transparency, or the separation of certain groups of objects.
- Selection. How will you handle a large number of attributes and data
- Computer networks. Be sure to include both the static aspects of the
network, such as connectivity, and the dynamic aspects, such as traffic.
- The distribution of specific plant and animal species around the world for a specific moment in time.
- The use of computer resources, such as processor time, main memory,
and disk, for a set of benchmark database programs.
- The change in occupation of workers in a particular country over the
last thirty years. Assume that you have yearly information about each
person that also includes gender and level of education.