In this lesson, we will compare pairs of numerical facts about the elements of a group, such
as the location and maximum elevation of U.S. states. You will learn about some kinds of
patterns that can occur when facts are compared in this way.
Clustering and outliers
The map to the left shows the geographical center of the
contiguous United States as well as the geographical center of most of
the states (the centers of some of the smaller East Coast states are not drawn). The $d$
column of the table to the left gives the east-west distance from the
geographical center of the contiguous United States to the
geographical center of each state, in miles. Positive numbers mean
the state is east of the center, and negative numbers mean it’s west of the center. The
$\cl"red"h$ column gives the highest elevation in each state, in feet above sea level.
Click . These points $\cl"red"{(d,h)}$ are plotted on
the grid to the left. A plot like this — showing the relationship between two different values
for each element in a collection — is called a scatter plot.
Notice that most of the red points are
near the bottom right corner of the grid. These correspond to states which are east of the center
of the United States and have fairly low maximum elevations. Are there any states which are near
the top right (east of the center of the United States, with high maximum elevations)?
| |
Are there any states which are near the bottom left (far west of
the center of the United States, with low maximum elevations)?
| |
Give the two-letter abbreviation for the state nearest the
bottom right corner. (Hint: Its $d$ value is more than 1400, and its
$\cl"red"h$ value is less than 1000.)
| |
The points in the bottom right portion of the plot have $d$ values
which are greater than 100 (they are more than 100 miles east of the center of the United
States) and $h$ values which are less than 8000 (their maximum elevation is less than 8000
feet). For example, Alabama (AL) is 624 miles east of the center of the United States, and has a
maximum elevation of 2413 feet. Give the two-letter abbreviation for another state in the bottom
right portion of the plot.
| |
Notice that there are two points much farther to the left (west) than any of the others,
corresponding to Alaska (AK) and Hawaii (HI). Points on a scatter plot which “lie outside” most
of the plot (have an $x$ or $y$ coordinate which is exceptionally large or small) are called
outliers.
Click to see a scatter plot comparing
east-west distance to highest elevation, with the Alaska and Hawaii points removed.
There is a group of red points in the top
left of the grid. These are states with high maximum elevations (more than 10000 feet), and that
are in the western part of the United States (more than 300 miles west of the center). Give the
two-letter abbreviation for one of these states.
| |
A group of points which lie in the same region of a scatter plot is called
a cluster.
You can often use clusters to help analyze patterns. For example, there is a cluster of
points in the top left of this scatter plot because there are several tall mountain ranges in
the western U.S. (like the Rockies, the Sierra Nevadas, and the Cascades), and they are about
the same height as each other.
Positive and negative association
The $d$ column of the table to the left again gives the
east-west distance from the geographical center of the contiguous United States to the
geographical center of each state, in miles. The $\cl"red"p$ column gives the population
density of each state (the number of people per square mile). These points $\cl"red"{(d,p)}$ are
plotted on the grid to the left. (Alaska and Hawaii are left off the plot to save space.)
Find the leftmost (westernmost) point which is plotted. What state
in the table does this correspond to? Ignore Alaska and Hawaii, which are not plotted.
| |
What is the population density of that state?
| people per square mile |
Is that population density high or low compared to the
other states? (As you can see on the scatter plot, the population densities of all the states go
from about a little over 1 to a little over 1200.)
| |
Find the topmost (highest population density) point which is plotted.
What state in the table does this correspond to? (It is the only state with
a population density larger than 1200.)
| |
How far east of the center of the continguous United States is that
state?
| miles |
Is that state in the eastern or western part of the
country?
| |
As you move from left to right in the grid, do the points
generally get higher or lower?
| |
As you move east through the country, does state population density
generally increase or decrease?
| |
The $r$ column of the table to the left gives the
average yearly amount of precipitation (rain plus snow) in each state, in inches. The
$\cl"red"e$ column gives the average elevation of each state, in feet above sea level. These
points $\cl"red"{(r,e)}$ are plotted on the grid to the left.
Find the topmost (highest average elevation) point which is plotted.
What state in the table does this correspond to? (It has an average elevation
of 6800 feet.)
| |
What is the average yearly precipitation in that state?
| inches |
Is that state wet (high precipitation) or dry (low
precipitation) compared to the other states?
| |
The rightmost (wettest) state plotted is Hawaii. What is the second
wettest state plotted?
| |
What is the average elevation of that state?
| feet above sea level |
Is the average elevation of that state high or low
compared to other states?
| |
As you move from left to right in the grid, do the points
generally get higher or lower?
| |
Are states with a lot of precipitation (wet states) generally at a
higher or lower elevation than states without much precipitation (dry states)?
| |
Two values have a positive association if they
tend to increase together (when one of them is large, so is the other one). Two values have a
negative association if they tend to move in opposite directions (when
one of them is large, the other one is small).
Is there a positive or a negative association between
average yearly precipitation ($r$) and average elevation ($\cl"red"e$)?
| |
Click to see the scatter plot from
pos-assoc-qn again. Remember that this shows the relationship between
east-west distance ($d$) and population density ($\cl"red"p$). Is there a positive or a
negative association between east-west distance ($d$) and population density
($\cl"red"p$)?
| |
Linear association
The $d$ column of the table to the left again gives the east-west distance from
the geographical center of the contiguous United States to the geographical center of each
state, in miles. The $\cl"red"A$ column gives the area of each state, in thousands of square
miles. These points $\cl"red"{(d,A)}$ are plotted on the grid to the left. Alaska and Hawaii
are omitted in this question.
Is there a positive or a negative association between
east-west distance ($d$) and area ($\cl"red"A$)?
| |
The points plotted to the left look like they are close to a straight line. When this
happens, we say the two variables (in this case $d$ and $\cl"red"A$) have a
linear association. Click to graph a
line on the grid to the left. The number 51.11 after the $~$, shown to
the right of the graph, is called the root-mean-square error for the
graphed line ($A=40$). This number tells you
how far away the line is from the points (a
smaller number means the line is a better “fit” to the
points).
Use the sliders below the graph to change the slope and $A$-intercept
of the red line, to find a line which gives a
better approximation to the points on the grid (that is, has a smaller
root-mean-square error). What is the equation of a line whose
root-mean-square error is less than 35? (The equation for the
line you are looking at is written below the grid.)
| |
Click to see a line whose
root-mean-square error is as small as possible. You can use this line
to estimate the area of a state, if you only know its east-west position.
The slope of this line is $-0.044$. This
means that on average, states decrease in area by 0.044 thousand square miles for every mile
further east they are. For example, if one state is 500 miles east of another, you would expect
it to be $500(0.044)=22$ thousand square miles smaller in area. If a state is 1000 miles east of
another, how much smaller would you expect its area to be?
| thousand square miles |
The $A$-intercept of this line is $78$. If a
state is right at the center of the United States (has $d=0$), about how big would you expect
its area to be? (Hint: Substitue 0 for $d$ in the equation for the
line.)
| thousand square miles |
Nonlinear association
The $r$ column of the table to the left again gives the average yearly amount of
precipitation in each state, in inches. The $\cl"red"p$ column again gives the population
density of each state (the number of people per square mile). These points $\cl"red"{(r,p)}$ are
plotted on the grid to the left.
What is the leftmost (driest) state plotted on the grid?
| |
What is the population density of that state?
| people per square mile |
Compared to all the other states (whose population density goes from
a little over 1 to a little over 1200), is that state’s density high or low?
| |
As you saw in precipitation-elevation-qn, the
two wettest states are Louisiana (LA) and Hawaii (HI). What are the population densities of
those states?
|
LA: | people per square mile
| HI: | people per square mile |
|
Compared to all the other states, are those densities high or
low? (Are the points for those states high or
low?)
| |
In general, is the population density of dry states high or
low?
| |
In general, is the population density of very wet states high
or low?
| |
In general, are the states with high population density very
wet, dry, or in between?
| |
As you can see, there is an association between population density and precipitation: if you
know how much rain a state gets, it will tell you something about how dense the state is. But
the association is not simply positive or negative (because high-precipitation states and
low-precipitation states are both lower-density than many of the states in the middle). Also,
the association is not linear, since the points are not close to a single straight line. When
an association is not linear, we say the two variables (in this case $r$ and $\cl"red"p$) have a
nonlinear association.