Scatter Plots

In this lesson, we will compare pairs of numerical facts about the elements of a group, such as the location and maximum elevation of U.S. states. You will learn about some kinds of patterns that can occur when facts are compared in this way.


Clustering and outliers

$d$$h$State

The map to the left shows the geographical center of the contiguous United States as well as the geographical center of most of the states (the centers of some of the smaller East Coast states are not drawn). The $d$ column of the table to the left gives the east-west distance from the geographical center of the contiguous United States to the geographical center of each state, in miles. Positive numbers mean the state is east of the center, and negative numbers mean it’s west of the center. The $\cl"red"h$ column gives the highest elevation in each state, in feet above sea level.

Click . These points $\cl"red"{(d,h)}$ are plotted on the grid to the left. A plot like this — showing the relationship between two different values for each element in a collection — is called a scatter plot.

Notice that most of the red points are near the bottom right corner of the grid. These correspond to states which are east of the center of the United States and have fairly low maximum elevations. Are there any states which are near the top right (east of the center of the United States, with high maximum elevations)?
Are there any states which are near the bottom left (far west of the center of the United States, with low maximum elevations)?
Give the two-letter abbreviation for the state nearest the bottom right corner. (Hint: Its $d$ value is more than 1400, and its $\cl"red"h$ value is less than 1000.)
The points in the bottom right portion of the plot have $d$ values which are greater than 100 (they are more than 100 miles east of the center of the United States) and $h$ values which are less than 8000 (their maximum elevation is less than 8000 feet). For example, Alabama (AL) is 624 miles east of the center of the United States, and has a maximum elevation of 2413 feet. Give the two-letter abbreviation for another state in the bottom right portion of the plot.

Notice that there are two points much farther to the left (west) than any of the others, corresponding to Alaska (AK) and Hawaii (HI). Points on a scatter plot which “lie outside” most of the plot (have an $x$ or $y$ coordinate which is exceptionally large or small) are called outliers.

Click to see a scatter plot comparing east-west distance to highest elevation, with the Alaska and Hawaii points removed.

There is a group of red points in the top left of the grid. These are states with high maximum elevations (more than 10000 feet), and that are in the western part of the United States (more than 300 miles west of the center). Give the two-letter abbreviation for one of these states.

A group of points which lie in the same region of a scatter plot is called a cluster.

You can often use clusters to help analyze patterns. For example, there is a cluster of points in the top left of this scatter plot because there are several tall mountain ranges in the western U.S. (like the Rockies, the Sierra Nevadas, and the Cascades), and they are about the same height as each other.

Positive and negative association

$d$$p$State

The $d$ column of the table to the left again gives the east-west distance from the geographical center of the contiguous United States to the geographical center of each state, in miles. The $\cl"red"p$ column gives the population density of each state (the number of people per square mile). These points $\cl"red"{(d,p)}$ are plotted on the grid to the left. (Alaska and Hawaii are left off the plot to save space.)

Find the leftmost (westernmost) point which is plotted. What state in the table does this correspond to? Ignore Alaska and Hawaii, which are not plotted.
What is the population density of that state? people per square mile
Is that population density high or low compared to the other states? (As you can see on the scatter plot, the population densities of all the states go from about a little over 1 to a little over 1200.)
Find the topmost (highest population density) point which is plotted. What state in the table does this correspond to? (It is the only state with a population density larger than 1200.)
How far east of the center of the continguous United States is that state? miles
Is that state in the eastern or western part of the country?
As you move from left to right in the grid, do the points generally get higher or lower?
As you move east through the country, does state population density generally increase or decrease?
$r$$e$State

The $r$ column of the table to the left gives the average yearly amount of precipitation (rain plus snow) in each state, in inches. The $\cl"red"e$ column gives the average elevation of each state, in feet above sea level. These points $\cl"red"{(r,e)}$ are plotted on the grid to the left.

Find the topmost (highest average elevation) point which is plotted. What state in the table does this correspond to? (It has an average elevation of 6800 feet.)
What is the average yearly precipitation in that state? inches
Is that state wet (high precipitation) or dry (low precipitation) compared to the other states?
The rightmost (wettest) state plotted is Hawaii. What is the second wettest state plotted?
What is the average elevation of that state? feet above sea level
Is the average elevation of that state high or low compared to other states?
As you move from left to right in the grid, do the points generally get higher or lower?
Are states with a lot of precipitation (wet states) generally at a higher or lower elevation than states without much precipitation (dry states)?

Two values have a positive association if they tend to increase together (when one of them is large, so is the other one). Two values have a negative association if they tend to move in opposite directions (when one of them is large, the other one is small).

Is there a positive or a negative association between average yearly precipitation ($r$) and average elevation ($\cl"red"e$)?
Click to see the scatter plot from pos-assoc-qn again. Remember that this shows the relationship between east-west distance ($d$) and population density ($\cl"red"p$). Is there a positive or a negative association between east-west distance ($d$) and population density ($\cl"red"p$)?

Linear association

$d$$A$State

The $d$ column of the table to the left again gives the east-west distance from the geographical center of the contiguous United States to the geographical center of each state, in miles. The $\cl"red"A$ column gives the area of each state, in thousands of square miles. These points $\cl"red"{(d,A)}$ are plotted on the grid to the left. Alaska and Hawaii are omitted in this question.

Is there a positive or a negative association between east-west distance ($d$) and area ($\cl"red"A$)?

The points plotted to the left look like they are close to a straight line. When this happens, we say the two variables (in this case $d$ and $\cl"red"A$) have a linear association. Click to graph a line on the grid to the left. The number 51.11 after the $~$, shown to the right of the graph, is called the root-mean-square error for the graphed line ($A=40$). This number tells you how far away the line is from the points (a smaller number means the line is a better “fit” to the points).

Use the sliders below the graph to change the slope and $A$-intercept of the red line, to find a line which gives a better approximation to the points on the grid (that is, has a smaller root-mean-square error). What is the equation of a line whose root-mean-square error is less than 35? (The equation for the line you are looking at is written below the grid.)

Click to see a line whose root-mean-square error is as small as possible. You can use this line to estimate the area of a state, if you only know its east-west position.

The slope of this line is $-0.044$. This means that on average, states decrease in area by 0.044 thousand square miles for every mile further east they are. For example, if one state is 500 miles east of another, you would expect it to be $500(0.044)=22$ thousand square miles smaller in area. If a state is 1000 miles east of another, how much smaller would you expect its area to be? thousand square miles
The $A$-intercept of this line is $78$. If a state is right at the center of the United States (has $d=0$), about how big would you expect its area to be? (Hint: Substitue 0 for $d$ in the equation for the line.) thousand square miles

Nonlinear association

$r$$p$State

The $r$ column of the table to the left again gives the average yearly amount of precipitation in each state, in inches. The $\cl"red"p$ column again gives the population density of each state (the number of people per square mile). These points $\cl"red"{(r,p)}$ are plotted on the grid to the left.

What is the leftmost (driest) state plotted on the grid?
What is the population density of that state? people per square mile
Compared to all the other states (whose population density goes from a little over 1 to a little over 1200), is that state’s density high or low?
As you saw in precipitation-elevation-qn, the two wettest states are Louisiana (LA) and Hawaii (HI). What are the population densities of those states?
LA: people per square mile
HI: people per square mile
Compared to all the other states, are those densities high or low? (Are the points for those states high or low?)
In general, is the population density of dry states high or low?
In general, is the population density of very wet states high or low?
In general, are the states with high population density very wet, dry, or in between?

As you can see, there is an association between population density and precipitation: if you know how much rain a state gets, it will tell you something about how dense the state is. But the association is not simply positive or negative (because high-precipitation states and low-precipitation states are both lower-density than many of the states in the middle). Also, the association is not linear, since the points are not close to a single straight line. When an association is not linear, we say the two variables (in this case $r$ and $\cl"red"p$) have a nonlinear association.