Please enable scripting (or JavaScript) in your web browser, and then reload this page.
In the last lesson, we looked at ways to find a “typical” or “average” value in a collection of data. We will now look at one way to measure the “spread” of a data collection using quartiles, and a way to visually summarize the data in a box plot. This will also allow us to compare two different collections of data, such as comparing the amount of rainfall in two different cities to see which of the cities is rainier.
The dot plot to the left shows the number of days that it rained in Nashville, Tennessee in 41 recent years.
The median of our data values gives a typical number of rainy days in a year in Nashville, since it splits the data collection into equal halves. Now we’ll estimate how much the number of rainy days in a year tends to vary, by splitting the data collection into quarters.
The second dot plot to the left shows only the driest half of the years from the top dot plot. Because we started with 41 values — an odd number — we’re looking at the $${41-1}/2=20$$ smallest data values. (Other books or programs may use a different rule to split data collections in half when they contain an odd number of values.)
Click to see a plot of the 20 years in Nashville when it rained the most.
The median value of the lower half of a data collection is about one quarter of the way into the entire collection, so it is called the first quartile of the collection. The median value of the upper half of a data collection is about three quarters of the way into the entire collection, so it is called the third quartile of the collection.
The difference between the third and first quartiles gives a rough way of measuring how spread out a data collection is. It is called the interquartile range of the data collection.
The first quartile of a collection of $n$ data values is the median of the lower half of those data values: the median of the $$n/2$$ smallest values if $n$ is even, or of the $${n-1}/2$$ smallest if $n$ is odd.
The third quartile of a collection of $n$ data values is the median of the upper half of those data values: the median of the $$n/2$$ largest values if $n$ is even, or of the $${n-1}/2$$ largest if $n$ is odd.
The interquartile range of a collection of data values is the difference between its third quartile and its first quartile.
The dot plot to the left shows the number of days it rained in Memphis, Tennessee in each of the 42 years from 1973 to 2014.
When finding the third quartile (the median of the upper half), it’s best to count from the top, so you don’t have to look at the lower half at all.
If you know the median, the first and third quartiles, and the smallest and largest values of the number of rainy days in Memphis over a long time period, that’s already enough to give a pretty good sense of how many rainy days you would see in a typical year.
Because of this, we’ll often summarize those five numbers in a box plot. A box plot has a box going from the first to the third quartile of a data collection, so the width of the box gives the interquartile range. A line is drawn within the box to show the median of the collection, and there are “whiskers” drawn out of the edges of the box to show the extremes (least and greatest values) of the collection. A box plot of the number of rainy days in Memphis in each year from 1973 to 2014 is shown to the left, below the dot plot of that data.
Click to see a box plot of the number of rainy days in Los Angeles, California over this same time period.
Box plots are often useful for comparing two different data collections. In earlier questions, we drew box plots from left to right so they would line up with our dot plots. In the next few questions, we’ll draw them from bottom to top. (So the interquartile range is now given by the height of the box rather than the width.)
The rainy days in each year between 1973 and 2014 in both Memphis and Los Angeles are summarized in box plots to the left.
For Memphis, the first quartile of the rainy day data is 65, the median is 74, and the third quartile is 82. This means that it rained on fewer than 65 days in about a quarter of the years, fewer than 74 days in about half of the years, and fewer than 82 days in about three quarters of the years.
In Los Angeles, the first quartile of the rainy day data is 16, the median is 19.5, and the third quartile is 29.
It rains so much more in Memphis than in Los Angeles that you probably don’t need a box plot to compare them. In this question, we’ll look at two cities with much more similar amounts of rainfall (Philadelphia, Pennsylvania and Indianapolis, Indiana), and see if we can still figure out which one is usually rainier.
The median of the rainy day data for Philadelphia is 72 days.
The median of the rainy day data for Indianapolis is 78.5 days.
Suppose that you would find a year unusually rainy if it rains more frequently than it has in at least three quarters of the years you’ve experienced, and unusually dry if it rains less than it has in at least three quarters of the years you’ve experienced.
Box plots for the amount of rain in both Detroit, Michigan and Jacksonville, Florida are shown to the left.
The interquartile range is one way to measure how spread out a collection of data is. Another way to measure spread is to see how far data values typically are from a single central value. You will learn about this in another lesson.