You will often be able to measure two different quantities and will want to know how they
relate to each other. In this lesson, we will look at the relationship between the
temperature in a city and how far north it is. We will try to answer questions like, “As you
travel south through the United States, how much warmer does it get?” and “Which has a larger
impact on temperature, traveling south or traveling east?”
In your last two answers, you found linear models with one or two residuals close
to zero. We would like to find a linear model whose residuals are all close to zero, so it does
a good job of estimating all the temperatures we know about.
To figure out how good a job a linear model does, we can't just add up all the residual
errors to get a total error in the model, or average them to get a mean error, because positive
and negative residuals could cancel each other out (add up to 0). We had the same problem in an
earlier lesson with deviations when measuring spread. So in the definitions of
variance and standard deviation we squared
each deviation to get a number that wasn’t negative. We can do the same thing to residuals, to
help combine them into a single number that measures the overall error in our model.
The table below shows each residual, its square (rounded to three decimal places), and the
mean of all the squares at the bottom. It is automatically updated whenever you slide the
sliders.
City  Residual
 Square of residual

Mean of squares 

The mean of the squares of the residuals gives us a typical square of an error, so its
square root gives us a typical error. This square root of the mean of the
squares of the residual errors is called the
rootmeansquare error of the model. The rootmeansquare error is the
number shown after the $±$ symbol (“plus or minus”) to the right of the top grid.
What is a linear model whose mean squared error is less than 25? (That
is, whose rootmeansquare error is less than 5?)
 
Click to see the linear model that
makes the rootmeansquare error as small as possible. The model that does this is sometimes
called the best fit linear model for a collection of data. What is the residual $e$ for
Philadelphia when this model is used? Round your answer to one decimal place.
(Look in the table of residuals above, or click on the middle dot on the
bottom grid.)
 degrees Fahrenheit (°F) 
Which of the five cities has a residual that is farthest from zero
when this model is used?
 
By clicking in the top grid or using the input box to its right, set
the vertical bar on the graph to $d_\N=400$. What value of $T$ does this correspond to
when this model is used? (The value of $T$ is shown to the right of the top
grid. It is the number before the $±$ symbol.)
 °F 
The value you just found is the estimate which this linear model gives for the mean daily
high temperature of a city 400 miles north of Houston.
What estimate does this linear model give for the mean daily high
temperature of a city 500 miles north of Houston?
 °F 
The top scatter plot to the left now shows the distance north of Houston ($d_\N$)
and mean daily high temperature ($T$) of the 50 largest cities in the United States. (Two of
these cities — San Antonio and Miami — are south of Houston, so the $d_\N$ value for those
cities is negative.) The best fit linear model for this data is also graphed, and the
temperatures within 1 standard deviation of the mean are shaded.
What estimate does this new linear model give for the mean daily
high temperature of a city 400 miles north of Houston? (If you have slid the
sliders since you started this question, click the Next Question
button above to see the best fit linear model again.)
 °F 
What estimate does this linear model give for the mean daily
high temperature of a city 0 miles north of Houston (that is, for a city which is
exactly as far north as Houston)?
 °F 
Notice that the estimate for a city which is exactly as far north as Houston is the same
as the $T$intercept of the linear model. This will be true for any linear model of temperature
in terms of distance: if the model is of the form $T=md_\N + b$, it predicts that a city with
$d_\N=0$ will have temperature $b$.
The slope $m$ of a model $T=md_\N + b$ tells you how much the temperature $T$ changes when
the distance north $d_\N$ increases by 1 mile. For example, the best fit linear model for these
50 cities has a slope of $0.0225$. So it predicts that, for every mile you go north, the
temperature will change by $0.0225$ °F (that is, decrease by 0.0225 °F).
The city of Detroit, Michigan is 165 miles north of the city of
Columbus, Ohio. How many degrees Fahrenheit colder than Columbus does this model
predict that Detroit will be? Round to one decimal place.
 °F 
In general:
For a linear model of the form $y=mx+b$, $b$ is the model’s estimate for
$y$ when $x=0$, and the slope $m$ equals the change in the estimated $y$ when $x$ increases by
1.
Correlation coefficient
The scatter plot to the left now shows the distance east of Houston
($d_\E$) and temperature ($T$) of each of the 50 largest cities in the United States. (For
cities which are west of Houston, $d_\E$ is negative. By “distance east of Houston” we mean the
distance you would travel east from Houston, before traveling north or south to reach the
city.) The best fit linear model for this data is also graphed.
Does this best fit linear model give accurate estimates for the
temperatures of most of the cities?
 
Is there a strong relationship between $T$ and $d_\E$? That is, if
all you knew about a city was how far east of Houston it was, would that tell you very much
about how warm that city was?
 
You’ve seen that a city’s temperature $T$ is strongly related to $d_\N$, and not so strongly
related to $d_\E$. We would like to have a number which measures how much two quantities relate
to each other, or correlate, by measuring how accurate their best fit linear model
is.
The standard deviation of the 50 temperatures is roughly $8.047$.
What is the rootmeansquare error of the best fit linear model for $T$ in terms of $d_\E$?
(The rootmeansquare error is the number shown after the $±$ symbol to the
right of the grid. If you’ve slid the sliders, click the Next
Question button above to see the best fit linear model again.)
 
Click to see the scatter plot of
$d_\N$ and $T$ again, along with its best fit linear model. What is the rootmeansquare error
of this linear model?
 
Notice that for both $d_\E$ and $d_\N$, the rootmeansquare error of the best fit linear
model is smaller than the standard deviation of temperature (8.047).
Does the quantity with the stronger relationship to temperature
($d_\N$) have a larger or smaller rootmeansquare error than the quantity with
the weaker relationship ($d_\E$)?
 
You can think of the rootmeansquare error as telling us how much of the spread in
temperature isn’t described by the model. Because standard deviation is a measure of the
total amount of spread, and both rootmeansquare error and standard deviation are square roots,
we’ll consider the quantity
$$
(\text"rootmeansquare error")^2/(\text"standard deviation of temperature")^2
$$
to be the fraction of the spread which isn’t described by the model.
What fraction of the spread in $T$ isn’t described by the best fit linear model using $d_\N$?
Round your final answer to four decimal places. (Give as many decimal places as you can fit
for the intermediate steps.)
You have found the fraction of the spread that is not described by this
model. We want to know the fraction of the spread that is described by the
model, which is the difference between 1 and the fraction you found:
$$
1(\text"rootmeansquare error")^2/(\text"standard deviation of temperature")^2
$$
For the best fit linear model, this difference is called the square of the
correlation coefficient $r$. That is,
$$
r^2=1(\text"rootmeansquare error of best fit linear model")^2/
(\text"standard deviation of temperature")^2
$$
Use your previous answer to find the value of $r^2$ for $T$ in terms of $d_\N$.
Click to see the scatter plot of $d_\E$ and $T$ again, along
with its best fit linear model. What is the value of $r^2$ for $T$ in terms of $d_\E$? Round
your final answer to four decimal places.
The correlation coefficient $r$ itself is one of the two square roots of $r^2$. If the slope
of the best fit linear model is positive, $r$ is the positive square root; if the slope is
negative, $r$ is the negative square root.
For example, the best fit linear model for $T$ in terms of $d_\N$ had a slope of roughly
$0.0225$, so the correlation coefficient for that relationship is negative. Since the $r^2$ for
the relationship was roughly $0.7593$, this means $r≈√{0.7593}≈0.8714$.
Is the slope of the best fit linear model for $T$ in terms of $d_\E$
positive or negative?
 
Is the correlation coefficient for $T$ in terms of $d_\E$
positive or negative?
 
What is the correlation coefficient for $T$ in terms of $d_\E$? Use
the value of $r^2$ for this relationship that you found above. Round your answer to four decimal
places.
 
In general:
If the best fit linear model for the quantities $x$ and $y$ is
$y=mx+b$, the correlation coefficient $r$ for $y$ in terms of $x$ can
be found by using the equation
$$
r^2=1(\text"rootmeansquare error of best fit linear model")^2 /
(\text"standard deviation of " y \text" data")^2
$$
The correlation coefficient $r$ is positive if $m$ is positive, negative if $m$ is negative,
and 0 if $m$ is 0.
A correlation coefficient near 0 shows a weak relationship between $x$ and $y$. A correlation
coefficient near $1$ or $1$ shows a strong relationship. A positive correlation coefficient
means $y$ increases when $x$ increases; in this case $x$ and $y$ are said to be positively
correlated. A negative correlation coefficient means $y$ decreases when $x$ increases; in
this case $x$ and $y$ are said to be negatively correlated.
In fact, if you have two quantities $x$ and $y$, the correlation coefficient for $y$ in terms
of $x$ is always equal to the correlation coefficient for $x$ in terms of $y$. So people
usually refer to either one as simply the correlation coefficient of $x$ and $y$.
In the last question, you saw that distance north of Houston had a strong
relationship with temperature. This is because going north causes temperature to drop:
in the Northern Hemisphere, cities that are further north will get less sunlight and therefore
be colder. However, not all strong relationships between quantities can be explained in this
way.
The scatter plot to the left now shows the shortest distance $d_\M$ to Mumbai (the largest
city in India) and the temperature $T$ for each of the 50 largest cities in the United
States, along with a graph of the best fit linear model.
What is the rootmean square error of the best fit linear model for
$T$ in terms of $d_\M$? (The rootmeansquare error is the number shown
after the $±$ symbol to the right of the grid.)
 
Is the correlation between $T$ and $d_\M$ positive or
negative? (That is, is the slope of the best fit linear model positive or negative?)
 
What is $r^2$ (the square of the correlation coefficient) for $T$ and $d_\M$? Round your
final answer to four decimal places.
What is the correlation coefficient of $T$ and $d_\M$? Round your
answer to four decimal places.
 
Based on the value of the correlation coefficient as well as the
scatter plot, is the relationship between $T$ and $d_\M$ strong or weak?
 
Looking at the graph, are the cities that are closer to Mumbai (have smaller $d_\M$)
hotter or colder than the cities that are farther from Mumbai?
 
The mean daily high temperature in Mumbai is 89.2 degrees Fahrenheit,
which is hotter than the mean daily high temperature of any major city in the United States. Is
it possible that getting closer to Mumbai causes cities in general to be colder?
 
In fact, the shortest route from anywhere in the continental United States to Mumbai
goes close to the North Pole. So it will be shorter for cities which are closer to the North
Pole: that is, further north. In other words, being further north causes cities (in the United
States) to be close to Mumbai, and also causes cities to be colder. But being close to Mumbai
does not directly cause cities to be colder.
This situation — where there is a strong relationship between two variables even though one
doesn’t directly cause the other — is very common. So there’s a standard slogan people use to
remind each other that it might be happening: “Correlation is not causation.”