You have seen several ways of finding a typical or central element in a data collection, and
several ways of measuring how spread out the data is. The best way — that is, the statistics
that best summarize the important facts about your data — depends on where the data comes from.
In particular, it depends on the importance of extreme values, or
outliers. We’ll now look at some examples of this.
Test scores
The six students in a class got the following scores on a
test (out of 100):
Find the median, quartiles, and
interquartile range of these test scores.
What is the mean score on the test?
Click to see the squared distance from the mean of each
student’s test score in the table below.
Student | Score | Squared deviation between score and mean
|
What is the variance of the test scores? (That is, the mean of the squared deviations
computed in the table above.) Round your answer to two decimal places.
What is the standard deviation of the test scores? (That is, the
square root of the variance.) Round your answer to two decimal places.
| |
We want to use statistics to understand how hard the test was for most students.
We’d like to know a typical or central score, and also how spread out the scores are, due to the
difficulty of the test. However, Fiona was sick during the test and had to leave early, which is
why she got such a low score. So her score doesn’t say very much about how hard the test
was.
The scores of all the students except Fiona are plotted to the left and shown in the table
below.
Find the median, quartiles, and interquartile range of the test scores other than
Fiona’s.
What is the mean of all the test scores other than Fiona’s?
Click to see the squared distance from the mean of each
student’s test score in the table below.
Student | Score | Squared deviation between score and mean
|
What is the variance of these test scores?
What is the standard deviation of these test scores? Round your answer
to two decimal places.
| |
Before removing Fiona’s score, you computed the median to be
79 and the mean to be 71. Which gives a better estimate of a typical score
after Fiona’s score is removed: the median of all the scores, or the mean of
all the scores?
| |
Because Fiona was sick, the teacher allows her to take a makeup test. The table
below shows you the test scores of the six students, including Fiona’s makeup exam.
Find the median, quartiles, and interquartile range of the test scores after Fiona’s
makeup exam.
The mean, standard deviation, medians, and interquartile ranges you’ve computed in the
last two questions and this one are summarized in the table below. The mean and
standard deviation of the test scores after Fiona’s makeup exam have also been filled in for
you.
Statistic | Value before makeup exam | Value without Fiona
| Value after makeup exam
|
Which changed more when Fiona’s original score was removed: the
median or the mean?
| |
Which changes more when Fiona’s original score is replaced by her
makeup exam score: the median or the mean?
| |
Which changed more when Fiona’s original score was removed: the
interquartile range or the standard deviation?
| |
Which changes more when Fiona’s original score is replaced by her
makeup exam score: the interquartile range or the standard deviation?
| |
A data value which is far away from the normal pattern of values (like Fiona’s original
score) is called an outlier. The test scores of students in a class
provide an example of a data collection where outliers may not accurately reflect a test’s
difficulty — those scores are likely to be from students who are absent, sick, or otherwise not
doing as well as they could.
If you want to summarize test scores using statistics that aren’t affected as much by
outliers, which are better to use?
A: The median and interquartile range
B: The mean and standard deviation
| |
Flooding damage
In the previous questions, we looked at a situation where it was better to use
statistics that weren’t strongly affected by outliers. We’ll now look at a different
situation, where the opposite is true.
The table below shows how many billions of dollars of flooding damage occurred in the United
States in each year between 2001 and 2010. This data is also graphed to the left.
Year | Flooding damage (billions of dollars)
|
This data collection has one outlier. In which year did that outlier
occur?
| |
Is that outlier much larger than the rest of the data
collection, or much smaller than the rest of the data collection?
| |
Sort the costs of flooding damage in each year from lowest to highest.
What are the median, quartiles, and interquartile range of the costs of flooding damage?
First quartile |
|
Median |
|
Third quartile |
|
Interquartile range |
|
What was the total amount of flooding damage over all the years given?
(Use scratch paper.)
| billion dollars |
What was the mean amount of flooding damage over the years given?
| billion dollars |
The squared distance from the mean of the amount of flooding damage in each year is given in
the table below, rounded to the nearest tenth of a billion dollars.
Year | Flooding damage (billions of dollars) | Squared distance from the mean
|
What is the variance of the flooding damage? (That is, the mean of
the squared distances in the table above.)
| |
What is the standard deviation of the flooding damage? (That is,
the square root of the variance.) Round to two decimal places.
| billion dollars |
Let’s use these data values to get an idea of how much money should be saved to pay for
repairing flooding damage in the next ten years. Suppose everyone decides to save enough money
to repair the damage from one moderately bad year, plus nine typical years. We want to know
which statistics would be best to use in this calculation.
If you think the median and quartiles are most useful, you might
recommend saving the third quartile of the cost for one year, and the median cost for each of
the other nine years. How much money would then be saved in total?
Would that have been enough to pay for the total flooding damage in
the years 2001–2010 (112.3 billion dollars)?
| |
If you think the mean and standard deviation are most useful, you might
recommend saving the mean plus the standard deviation of the cost for one year, and the
mean cost for each of the other nine years. How much money would then be saved in total?
Would that have been enough to pay for the total flooding damage in
the years 2001–2010?
| |
If you want to summarize the cost of flooding using statistics that treat outliers as
important values, which are better to use?
A: The median and quartiles
B: The mean and standard deviation
| |