Box Plots

In the last lesson, we looked at ways to find a “typical” or “average” value in a collection of data. We will now look at one way to measure the “spread” of a data collection using quartiles, and a way to visually summarize the data in a box plot. This will also allow us to compare two different collections of data, such as comparing the amount of rainfall in two different cities to see which of the cities is rainier.


Quartiles

The dot plot to the left shows the number of days that it rained in Nashville, Tennessee in 41 recent years.

If we are looking at 41 data values, which of those data values is the median? (Remember that, if $n$ is odd, the median of $n$ data values is the $${n+1}/2$$th smallest data value.) The st smallest
What is the median number of days that it rained in Nashville in the years plotted?

The median of our data values gives a typical number of rainy days in a year in Nashville, since it splits the data collection into equal halves. Now we’ll estimate how much the number of rainy days in a year tends to vary, by splitting the data collection into quarters.

The second dot plot to the left shows only the driest half of the years from the top dot plot. Because we started with 41 values — an odd number — we’re looking at the $${41-1}/2=20$$ smallest data values. (Other books or programs may use a different rule to split data collections in half when they contain an odd number of values.)

What is the median number of rainy days for just the driest 20 years? (Remember to use the definition of median for an even number of data values.)

Click to see a plot of the 20 years in Nashville when it rained the most.

What is the median number of rainy days in the wettest 20 years?

The median value of the lower half of a data collection is about one quarter of the way into the entire collection, so it is called the first quartile of the collection. The median value of the upper half of a data collection is about three quarters of the way into the entire collection, so it is called the third quartile of the collection.

What are the first and third quartiles of the number of rainy days in Nashville in the years shown in the top dot plot?
First quartile:
Third quartile:

The difference between the third and first quartiles gives a rough way of measuring how spread out a data collection is. It is called the interquartile range of the data collection.

What is the interquartile range of the number of rainy days in a year in Nashville?

The first quartile of a collection of $n$ data values is the median of the lower half of those data values: the median of the $$n/2$$ smallest values if $n$ is even, or of the $${n-1}/2$$ smallest if $n$ is odd.

The third quartile of a collection of $n$ data values is the median of the upper half of those data values: the median of the $$n/2$$ largest values if $n$ is even, or of the $${n-1}/2$$ largest if $n$ is odd.

The interquartile range of a collection of data values is the difference between its third quartile and its first quartile.

Box plots

The dot plot to the left shows the number of days it rained in Memphis, Tennessee in each of the 42 years from 1973 to 2014.

What is the median number of days it rained in Memphis?
If we are looking at 42 data values, how many of those data values lie in the lower half (the driest half)?
Which data value gives the first quartile of the rainy day data (the median of the lower half) for Memphis? The th lowest
What is the first quartile of the rainy day data for Memphis?
How many data values lie in the upper half?

When finding the third quartile (the median of the upper half), it’s best to count from the top, so you don’t have to look at the lower half at all.

Which data value gives the third quartile of the rainy day data? The th highest
What is the third quartile of the rainy day data for Memphis?
What is the interquartile range of the rainy day data for Memphis?

If you know the median, the first and third quartiles, and the smallest and largest values of the number of rainy days in Memphis over a long time period, that’s already enough to give a pretty good sense of how many rainy days you would see in a typical year.

Because of this, we’ll often summarize those five numbers in a box plot. A box plot has a box going from the first to the third quartile of a data collection, so the width of the box gives the interquartile range. A line is drawn within the box to show the median of the collection, and there are “whiskers” drawn out of the edges of the box to show the extremes (least and greatest values) of the collection. A box plot of the number of rainy days in Memphis in each year from 1973 to 2014 is shown to the left, below the dot plot of that data.

What is the least number of days it rained in Memphis in any year in the box plot?
What is the greatest number of days it rained in Memphis in any year in the box plot?

Click to see a box plot of the number of rainy days in Los Angeles, California over this same time period.

What is the approximate median of the rainy day data for Los Angeles? (That is, the approximate location of the line within the box.)
What are the approximate first and third quartiles of the rainy day data for Los Angeles? (That is, the left and right edges of the box.)
First quartile:
Third quartile:
What is the approximate interquartile range of the rainy day data for Los Angeles? (That is, the approximate width of the box.)

Comparing two box plots

Box plots are often useful for comparing two different data collections. In earlier questions, we drew box plots from left to right so they would line up with our dot plots. In the next few questions, we’ll draw them from bottom to top. (So the interquartile range is now given by the height of the box rather than the width.)

The rainy days in each year between 1973 and 2014 in both Memphis and Los Angeles are summarized in box plots to the left.

Is the bottom of the Memphis plot above, below, or at about the same height as the top of the Los Angeles plot?
In the driest year in Memphis between 1973 and 2014, did it rain on more days, fewer days, or about the same number of days as the wettest year in Los Angeles over that time?
In general, which city got more rain between 1973 and 2014: Memphis or Los Angeles?

For Memphis, the first quartile of the rainy day data is 65, the median is 74, and the third quartile is 82. This means that it rained on fewer than 65 days in about a quarter of the years, fewer than 74 days in about half of the years, and fewer than 82 days in about three quarters of the years.

In Los Angeles, the first quartile of the rainy day data is 16, the median is 19.5, and the third quartile is 29.

How often did it rain on fewer than 16 days in Los Angeles: in about one quarter of the years, in about half of the years, or in about three quarters of the years?
How often did it rain on fewer than 19.5 days in Los Angeles: in about one quarter of the years, in about half of the years, or in about three quarters of the years?
How often did it rain on fewer than 29 days in Los Angeles: in about one quarter of the years, in about half of the years, or in about three quarters of the years?

It rains so much more in Memphis than in Los Angeles that you probably don’t need a box plot to compare them. In this question, we’ll look at two cities with much more similar amounts of rainfall (Philadelphia, Pennsylvania and Indianapolis, Indiana), and see if we can still figure out which one is usually rainier.

Which city had the rainiest year: Philadelphia or Indianapolis?
Which city had the driest year: Philadelphia or Indianapolis?
Would you be able to tell which of these two cities was rainier by looking just at their rainiest and least rainy years?

The median amount of rain in Philadelphia (shown by the line within the box) is close to one of the values plotted for Indianapolis. Is it closest to Indianapolis’s first quartile (the bottom of the box), median (the line within the box), or third quartile (the top of the box)?

The median of the rainy day data for Philadelphia is 72 days.

How often did it rain on fewer than 72 days in Philadelphia: in about one quarter of the years, in about half of the years, or in about three quarters of the years?
How often did it rain on fewer than 72 days in Indianapolis: in about one quarter of the years, in about half of the years, or in about three quarters of the years?
Based on your last two answers, in which city is there more likely to be fewer than 72 days of rain in a year: Philadelphia or Indianapolis?

The median amount of rain in Indianapolis (shown by the line within the box) is close to one of the values plotted for Philadelphia. Is it closest to Philadelphia’s first quartile, median, or third quartile?

The median of the rainy day data for Indianapolis is 78.5 days.

How often did it rain on fewer than 78.5 days in Indianapolis: in about one quarter of the years, in about half of the years, or in about three quarters of the years?
How often did it rain on fewer than 78.5 days in Philadelphia: in about one quarter of the years, in about half of the years, or in about three quarters of the years?
Based on your last two answers, in which city is there more likely to be fewer than 78.5 days of rain in a year: Philadelphia or Indianapolis?

Suppose that you would find a year unusually rainy if it rains more frequently than it has in at least three quarters of the years you’ve experienced, and unusually dry if it rains less than it has in at least three quarters of the years you’ve experienced.

If you lived for a long time in Philadelphia and moved to Indianapolis, would a new typical (median) year seem unusually rainy, unusually dry, or neither?
If you lived for a long time in Indianapolis and moved to Philadelphia, would a new typical (median) year seem unusually rainy, unusually dry, or neither?
Which generally gets more rain: Philadelphia or Indianapolis?

Box plots for the amount of rain in both Detroit, Michigan and Jacksonville, Florida are shown to the left.

Which city has a smaller median amount of rain: Detroit or Jacksonville?
Is the median of the rainy day data for Detroit less than the first quartile of the data for Jacksonville, between the first and third quartiles of the data for Jacksonville, or greater than the third quartile of the data for Jacksonville?
Is the median of the rainy day data for Jacksonville less than the first quartile of the data for Detroit, between the first and third quartiles of the data for Detroit, or greater than the third quartile of the data for Detroit?
If you live for a long time in Detroit and then move to Jacksonville, would a new typical (median) year seem unusually rainy, unusually dry, or neither?
If you live for a long time in Jacksonville and then move to Detroit, would a new typical (median) year seem unusually rainy, unusually dry, or neither?
Given the answers to your last two questions, does Detroit tend to be a lot rainier than Jacksonville, a lot less rainy, or about as rainy?
Which box is taller: the one showing the quartiles of the Detroit distribution, or the one showing the quartiles of the Jacksonville distribution?
In which city does the number of rainy days have a larger interquartile range: Detroit or Jacksonville?
Is the number of rainy days more predictable in Detroit or in Jacksonville? That is, in which city are you more likely to see a year with a number of rainy days that’s close to the median number?

The interquartile range is one way to measure how spread out a collection of data is. Another way to measure spread is to see how far data values typically are from a single central value. You will learn about this in another lesson.