Data Relating Two Measurements

You will often be able to measure two different quantities and will want to know how they relate to each other. In this lesson, we will look at the relationship between the temperature in a city and how far north it is. We will try to answer questions like, “As you travel south through the United States, how much warmer does it get?” and “Which has a larger impact on temperature, traveling south or traveling east?”


Best fit linear model

The table below lists the five largest (most populous) cities in the United States, how far north of Houston (the farthest south of the cities) each city is, and each city’s mean daily high temperature. This information is also plotted on the grid to the left as a scatter plot. That is, we plot one point for each city, with the point’s coordinates $(d_\N, T)$ given by that city’s distance north of Houston ($d_\N$) and its mean daily high temperature ($T$).

CityDistance $d_\N$ north
of Houston (miles)
Mean daily
high temperature $T$
(degrees Fahrenheit)
Which city in the table is the farthest north?
Which city in the table is the warmest?
Which city in the table is the coolest?
Are cities that are further north warmer on average, or cooler?

The sliders for $m$ and $b$ to the left allow you to move the graph of $T = md_\N + b$ by setting its slope and $T$-intercept. Try to find a line that comes close to passing through all the points.

Will a line that comes close to passing through all the points have a positive slope, or a negative slope?

As you saw in the last question, you can find a line that nearly (but not quite) passes through the five points representing the five cities’ distance north and temperature. You can think of an equation for a line like this as an attempt at estimating the temperature of a city if all you know about the city is its location. A linear equation which is used to estimate one quantity in terms of another is called a linear model; that is, the center equation to the left is a linear model for temperature ($T$) in terms of distance north of Houston ($d_\N$).

The bottom graph to the left shows the difference between the actual temperature for each city and the temperature estimated by the linear model. This difference is called a residual error of the model ($e$) or simply a residual. The residuals tell us how far the model is from being exactly right. Slide the sliders and try to make various residuals be close to 0.

What is a linear model whose residual error $e$ for Houston (the leftmost point) is between $-2$ and 2?
What is a linear model whose residual error $e$ for Chicago (the rightmost point) is between $-2$ and 2?
What is a linear model whose residual error $e$ for both Houston and Chicago is between $-2$ and 2?

In your last two answers, you found linear models with one or two residuals close to zero. We would like to find a linear model whose residuals are all close to zero, so it does a good job of estimating all the temperatures we know about.

To figure out how good a job a linear model does, we can't just add up all the residual errors to get a total error in the model, or average them to get a mean error, because positive and negative residuals could cancel each other out (add up to 0). We had the same problem in an earlier lesson with deviations when measuring spread. So in the definitions of variance and standard deviation we squared each deviation to get a number that wasn’t negative. We can do the same thing to residuals, to help combine them into a single number that measures the overall error in our model.

The table below shows each residual, its square (rounded to three decimal places), and the mean of all the squares at the bottom. It is automatically updated whenever you slide the sliders.

CityResidual Square of residual
Mean of squares

The mean of the squares of the residuals gives us a typical square of an error, so its square root gives us a typical error. This square root of the mean of the squares of the residual errors is called the root-mean-square error of the model. The root-mean-square error is the number shown after the $±$ symbol (“plus or minus”) to the right of the top grid.

What is a linear model whose mean squared error is less than 25? (That is, whose root-mean-square error is less than 5?)
Click to see the linear model that makes the root-mean-square error as small as possible. The model that does this is sometimes called the best fit linear model for a collection of data. What is the residual $e$ for Philadelphia when this model is used? Round your answer to one decimal place. (Look in the table of residuals above, or click on the middle dot on the bottom grid.) degrees Fahrenheit (°F)
Which of the five cities has a residual that is farthest from zero when this model is used?
By clicking in the top grid or using the input box to its right, set the vertical bar on the graph to $d_\N=400$. What value of $T$ does this correspond to when this model is used? (The value of $T$ is shown to the right of the top grid. It is the number before the $±$ symbol.) °F

The value you just found is the estimate which this linear model gives for the mean daily high temperature of a city 400 miles north of Houston.

What estimate does this linear model give for the mean daily high temperature of a city 500 miles north of Houston? °F

The top scatter plot to the left now shows the distance north of Houston ($d_\N$) and mean daily high temperature ($T$) of the 50 largest cities in the United States. (Two of these cities — San Antonio and Miami — are south of Houston, so the $d_\N$ value for those cities is negative.) The best fit linear model for this data is also graphed, and the temperatures within 1 standard deviation of the mean are shaded.

What estimate does this new linear model give for the mean daily high temperature of a city 400 miles north of Houston? (If you have slid the sliders since you started this question, click the Next Question button above to see the best fit linear model again.) °F
What estimate does this linear model give for the mean daily high temperature of a city 0 miles north of Houston (that is, for a city which is exactly as far north as Houston)? °F

Notice that the estimate for a city which is exactly as far north as Houston is the same as the $T$-intercept of the linear model. This will be true for any linear model of temperature in terms of distance: if the model is of the form $T=md_\N + b$, it predicts that a city with $d_\N=0$ will have temperature $b$.

The slope $m$ of a model $T=md_\N + b$ tells you how much the temperature $T$ changes when the distance north $d_\N$ increases by 1 mile. For example, the best fit linear model for these 50 cities has a slope of $-0.0225$. So it predicts that, for every mile you go north, the temperature will change by $-0.0225$ °F (that is, decrease by 0.0225 °F).

The city of Detroit, Michigan is 165 miles north of the city of Columbus, Ohio. How many degrees Fahrenheit colder than Columbus does this model predict that Detroit will be? Round to one decimal place. °F

In general:

For a linear model of the form $y=mx+b$, $b$ is the model’s estimate for $y$ when $x=0$, and the slope $m$ equals the change in the estimated $y$ when $x$ increases by 1.

Correlation coefficient

City$d_\N$$d_\E$$T$

The scatter plot to the left now shows the distance east of Houston ($d_\E$) and temperature ($T$) of each of the 50 largest cities in the United States. (For cities which are west of Houston, $d_\E$ is negative. By “distance east of Houston” we mean the distance you would travel east from Houston, before traveling north or south to reach the city.) The best fit linear model for this data is also graphed.

Does this best fit linear model give accurate estimates for the temperatures of most of the cities?
Is there a strong relationship between $T$ and $d_\E$? That is, if all you knew about a city was how far east of Houston it was, would that tell you very much about how warm that city was?

You’ve seen that a city’s temperature $T$ is strongly related to $d_\N$, and not so strongly related to $d_\E$. We would like to have a number which measures how much two quantities relate to each other, or correlate, by measuring how accurate their best fit linear model is.

The standard deviation of the 50 temperatures is roughly $8.047$. What is the root-mean-square error of the best fit linear model for $T$ in terms of $d_\E$? (The root-mean-square error is the number shown after the $±$ symbol to the right of the grid. If you’ve slid the sliders, click the Next Question button above to see the best fit linear model again.)
Click to see the scatter plot of $d_\N$ and $T$ again, along with its best fit linear model. What is the root-mean-square error of this linear model?

Notice that for both $d_\E$ and $d_\N$, the root-mean-square error of the best fit linear model is smaller than the standard deviation of temperature (8.047).

Does the quantity with the stronger relationship to temperature ($d_\N$) have a larger or smaller root-mean-square error than the quantity with the weaker relationship ($d_\E$)?

You can think of the root-mean-square error as telling us how much of the spread in temperature isn’t described by the model. Because standard deviation is a measure of the total amount of spread, and both root-mean-square error and standard deviation are square roots, we’ll consider the quantity

$$ (\text"root-mean-square error")^2/(\text"standard deviation of temperature")^2 $$

to be the fraction of the spread which isn’t described by the model.

What fraction of the spread in $T$ isn’t described by the best fit linear model using $d_\N$? Round your final answer to four decimal places. (Give as many decimal places as you can fit for the intermediate steps.)

You have found the fraction of the spread that is not described by this model. We want to know the fraction of the spread that is described by the model, which is the difference between 1 and the fraction you found:

$$ 1-(\text"root-mean-square error")^2/(\text"standard deviation of temperature")^2 $$

For the best fit linear model, this difference is called the square of the correlation coefficient $r$. That is,

$$ r^2=1-(\text"root-mean-square error of best fit linear model")^2/ (\text"standard deviation of temperature")^2 $$

Use your previous answer to find the value of $r^2$ for $T$ in terms of $d_\N$.

Click to see the scatter plot of $d_\E$ and $T$ again, along with its best fit linear model. What is the value of $r^2$ for $T$ in terms of $d_\E$? Round your final answer to four decimal places.

The correlation coefficient $r$ itself is one of the two square roots of $r^2$. If the slope of the best fit linear model is positive, $r$ is the positive square root; if the slope is negative, $r$ is the negative square root.

For example, the best fit linear model for $T$ in terms of $d_\N$ had a slope of roughly $-0.0225$, so the correlation coefficient for that relationship is negative. Since the $r^2$ for the relationship was roughly $0.7593$, this means $r≈-√{0.7593}≈-0.8714$.

Is the slope of the best fit linear model for $T$ in terms of $d_\E$ positive or negative?
Is the correlation coefficient for $T$ in terms of $d_\E$ positive or negative?
What is the correlation coefficient for $T$ in terms of $d_\E$? Use the value of $r^2$ for this relationship that you found above. Round your answer to four decimal places.

In general:

If the best fit linear model for the quantities $x$ and $y$ is $y=mx+b$, the correlation coefficient $r$ for $y$ in terms of $x$ can be found by using the equation

$$ r^2=1-(\text"root-mean-square error of best fit linear model")^2 / (\text"standard deviation of " y \text" data")^2 $$

The correlation coefficient $r$ is positive if $m$ is positive, negative if $m$ is negative, and 0 if $m$ is 0.

A correlation coefficient near 0 shows a weak relationship between $x$ and $y$. A correlation coefficient near $1$ or $-1$ shows a strong relationship. A positive correlation coefficient means $y$ increases when $x$ increases; in this case $x$ and $y$ are said to be positively correlated. A negative correlation coefficient means $y$ decreases when $x$ increases; in this case $x$ and $y$ are said to be negatively correlated.

In fact, if you have two quantities $x$ and $y$, the correlation coefficient for $y$ in terms of $x$ is always equal to the correlation coefficient for $x$ in terms of $y$. So people usually refer to either one as simply the correlation coefficient of $x$ and $y$.

In the last question, you saw that distance north of Houston had a strong relationship with temperature. This is because going north causes temperature to drop: in the Northern Hemisphere, cities that are further north will get less sunlight and therefore be colder. However, not all strong relationships between quantities can be explained in this way.

The scatter plot to the left now shows the shortest distance $d_\M$ to Mumbai (the largest city in India) and the temperature $T$ for each of the 50 largest cities in the United States, along with a graph of the best fit linear model.

What is the root-mean square error of the best fit linear model for $T$ in terms of $d_\M$? (The root-mean-square error is the number shown after the $±$ symbol to the right of the grid.)
Is the correlation between $T$ and $d_\M$ positive or negative? (That is, is the slope of the best fit linear model positive or negative?)

What is $r^2$ (the square of the correlation coefficient) for $T$ and $d_\M$? Round your final answer to four decimal places.

What is the correlation coefficient of $T$ and $d_\M$? Round your answer to four decimal places.
Based on the value of the correlation coefficient as well as the scatter plot, is the relationship between $T$ and $d_\M$ strong or weak?
Looking at the graph, are the cities that are closer to Mumbai (have smaller $d_\M$) hotter or colder than the cities that are farther from Mumbai?
The mean daily high temperature in Mumbai is 89.2 degrees Fahrenheit, which is hotter than the mean daily high temperature of any major city in the United States. Is it possible that getting closer to Mumbai causes cities in general to be colder?

In fact, the shortest route from anywhere in the continental United States to Mumbai goes close to the North Pole. So it will be shorter for cities which are closer to the North Pole: that is, further north. In other words, being further north causes cities (in the United States) to be close to Mumbai, and also causes cities to be colder. But being close to Mumbai does not directly cause cities to be colder.

This situation — where there is a strong relationship between two variables even though one doesn’t directly cause the other — is very common. So there’s a standard slogan people use to remind each other that it might be happening: “Correlation is not causation.”