Analyzing Outliers and Choosing Statistics

You have seen several ways of finding a typical or central element in a data collection, and several ways of measuring how spread out the data is. The best way — that is, the statistics that best summarize the important facts about your data — depends on where the data comes from. In particular, it depends on the importance of extreme values, or outliers. We’ll now look at some examples of this.


Test scores

The six students in a class got the following scores on a test (out of 100):

StudentScore

Find the median, quartiles, and interquartile range of these test scores.

What is the mean score on the test?

Click to see the squared distance from the mean of each student’s test score in the table below.

StudentScoreSquared deviation
between score and mean

What is the variance of the test scores? (That is, the mean of the squared deviations computed in the table above.) Give your answer to two decimal places.

What is the standard deviation of the test scores? (That is, the square root of the variance.) Give your answer to two decimal places.

We want to use statistics to understand how hard the test was for most students. We’d like to know a typical or central score, and also how spread out the scores are, due to the difficulty of the test. However, Fiona was sick during the test and had to leave early, which is why she got such a low score. So her score doesn’t say very much about how hard the test was.

The scores of all the students except Fiona are plotted to the left and shown in the table below.

StudentScore

Find the median, quartiles, and interquartile range of the test scores other than Fiona’s.

What is the mean of all the test scores other than Fiona’s?

Click to see the squared distance from the mean of each student’s test score in the table below.

StudentScoreSquared deviation
between score and mean

What is the variance of these test scores?

What is the standard deviation of these test scores? Give your answer to two decimal places.
Before removing Fiona’s score, you computed the median to be 79 and the mean to be 71. Which gives a better estimate of a typical score after Fiona’s score is removed: the median of all the scores, or the mean of all the scores?

Because Fiona was sick, the teacher allows her to take a makeup test. The table below shows you the test scores of the six students, including Fiona’s makeup exam.

StudentScore

Find the median, quartiles, and interquartile range of the test scores after Fiona’s makeup exam.

The mean, standard deviation, medians, and interquartile ranges you’ve computed in the last two questions and this one are summarized in the table below. The mean and standard deviation of the test scores after Fiona’s makeup exam have also been filled in for you.

StatisticValue before
makeup exam
Value
without Fiona
Value after
makeup exam
Which changed more when Fiona’s original score was removed: the median or the mean?
Which changes more when Fiona’s original score is replaced by her makeup exam score: the median or the mean?
Which changed more when Fiona’s original score was removed: the interquartile range or the standard deviation?
Which changes more when Fiona’s original score is replaced by her makeup exam score: the interquartile range or the standard deviation?

A data value which is far away from the normal pattern of values (like Fiona’s original score) is called an outlier. The test scores of students in a class provide an example of a data collection where outliers may not accurately reflect a test’s difficulty — those scores are likely to be from students who are absent, sick, or otherwise not doing as well as they could.

If you want to summarize test scores using statistics that aren’t affected as much by outliers, which are better to use?

A: The median and interquartile range
B: The mean and standard deviation

Flooding damage

In the previous questions, we looked at a situation where it was better to use statistics that weren’t strongly affected by outliers. We’ll now look at a different situation, where the opposite is true.

The table below shows how many billions of dollars of flooding damage occurred in the United States in each year between 2001 and 2010. This data is also graphed to the left.

YearFlooding damage
(billions of dollars)
This data collection has one outlier. In which year did that outlier occur?
Is that outlier much larger than the rest of the data collection, or much smaller than the rest of the data collection?

Sort the costs of flooding damage in each year from lowest to highest.

What are the median, quartiles, and interquartile range of the costs of flooding damage?

First quartile
Median
Third quartile
Interquartile range
What was the total amount of flooding damage over all the years given? (Use scratch paper.) billion dollars
What was the mean amount of flooding damage over the years given? billion dollars

The squared distance from the mean of the amount of flooding damage in each year is given in the table below, rounded to the nearest tenth of a billion dollars.

YearFlooding damage
(billions of dollars)
Squared distance
from the mean
What is the variance of the flooding damage? (That is, the mean of the squared distances in the table above.)
What is the standard deviation of the flooding damage? (That is, the square root of the variance.) Round to two decimal places. billion dollars

Let’s use these data values to get an idea of how much money should be saved to pay for repairing flooding damage in the next ten years. Suppose everyone decides to save enough money to repair the damage from one moderately bad year, plus nine typical years. We want to know which statistics would be best to use in this calculation.

If you think the median and quartiles are most useful, you might recommend saving the third quartile of the cost for one year, and the median cost for each of the other nine years. How much money would then be saved in total?

billion dollars
Would that have been enough to pay for the total flooding damage in the years 2001–2010 (112.3 billion dollars)?

If you think the mean and standard deviation are most useful, you might recommend saving the mean plus the standard deviation of the cost for one year, and the mean cost for each of the other nine years. How much money would then be saved in total?

billion dollars
Would that have been enough to pay for the total flooding damage in the years 2001–2010?

If you want to summarize the cost of flooding using statistics that treat outliers as important values, which are better to use?

A: The median and quartiles
B: The mean and standard deviation