Finding Formulas for Approximately Linear Data

In the previous lessons you have worked with linear equations. You are now going to use what you learned in those lessons to find equations that describe the relationship between points that appear to be roughly in a straight line.

The data that you will use in this lesson shows the relationship between the number of cigarettes bought and the death rates from cancer in different states during 1960. This data was taken from http://lib.stat.cmu.edu/DASL/Datafiles/cigcancerdat.html.


Bladder cancer

$x$$y$State
30.343.46AK
18.202.90AL
18.242.99AR
25.823.52AZ
28.604.46CA
31.105.11CT
40.465.60DC
33.604.78DE
28.274.46FL
22.124.23IA
20.103.08ID
27.914.75IL
26.184.09IN
21.842.91KS
23.442.86KY
21.584.65LA
26.924.69MA
25.915.21MD
28.924.79ME
24.965.27MI
22.063.72MN
27.564.04MO
16.083.06MS
23.753.95MT
19.962.89ND
23.323.72NE
28.645.98NJ
21.162.90NM
42.406.54NV
29.145.30NY
26.384.47OH
23.442.93OK
23.784.89PA
29.184.99RI
18.063.25SC
20.943.64SD
20.082.94TN
22.573.21TX
14.003.31UT
25.894.63VT
21.174.04WA
21.255.14WI
22.864.78WV
28.043.20WY

The numbers in the $x$ column of this table tell us how many hundreds of cigarettes were sold per person in each state during 1960. For example, in Alabama (AL), the number in the $x$ column is 18.2. This means that for each person living in Alabama in 1960, 1,820 cigarettes were sold (that is, 18.2 hundreds).

In 1960, how many hundreds of cigarettes were sold per person in Arizona (AZ)?
Which state sold the most cigarettes per person?
Which state sold the fewest cigarettes per person?

The numbers in the $y$ column tell us how many people died of bladder cancer for each 100,000 people living in the state. For example, in Alabama (AL), for each 100,000 people, 2.9 people died of bladder cancer.

In 1960, for every 100,000 people living in Idaho (ID), how many died of bladder cancer?
Which state had the highest death rate due to bladder cancer?
Which state had the lowest death rate due to bladder cancer?

Finding the “best fit” line

Just by looking at the table, it’s hard to tell whether there is a relationship between the number of cigarettes sold in each state and the number of deaths due to bladder cancer. Another way to look at the data in the table is to graph it. You can see a graph showing the points that represent the data to the left.

When the points on a grid are not all on a straight line, but seem to have a somewhat linear pattern, you can find a line that is the “best fit line” (closest to the pattern).

The number 2.981, shown to the right of the graph, is called the root-mean-square error for the graphed line ($y=0.05x$). This number tells you how far away the line is from the points (a smaller number means the line is a better “fit” to the points). (In this course, you will not usually be asked to compute the root-mean-square error of a line; it will be given to you.)

Use both sliders in the lower left portion of your screen to change the values of $m$ and $b$ in the equation $y=mx+b$, and find an equation for a line that is a better approximation of the points on the grid (i.e. has a smaller root-mean-square error). The equation for the line you are looking at is written below the grid. Try to make the root-mean-square error smaller than 0.69. (Hint: Set the $b$ slider to each of $0.05$, $1.10$, and $1.75$, and then experiment with the $m$ slider. For one of those three $b$ values, you should be able to find an $m$ value that makes the root-mean-square error small enough.)

What value for $m$ did you find?
What value for $b$ did you find?
What is your equation for $y$?
What is the root-mean-square error for your line?

Lung cancer

$x$$y$State
30.3425.88AK
18.2017.05AL
18.2415.98AR
25.8219.80AZ
28.6022.07CA
31.1022.83CT
40.4627.27DC
33.6024.55DE
28.2723.57FL
22.1216.59IA
20.1013.58ID
27.9122.80IL
26.1820.30IN
21.8416.84KS
23.4417.71KY
21.5825.45LA
26.9222.04MA
25.9126.48MD
28.9220.94ME
24.9622.72MI
22.0614.20MN
27.5620.98MO
16.0815.60MS
23.7519.50MT
19.9612.12ND
23.3216.70NE
28.6425.95NJ
21.1614.59NM
42.4023.03NV
29.1425.02NY
26.3821.89OH
23.4419.45OK
23.7812.11PA
29.1823.68RI
18.0617.45SC
20.9414.11SD
20.0817.60TN
22.5720.74TX
14.0012.01UT
25.8921.22VT
21.1720.34WA
21.2520.55WI
22.8615.53WV
28.0415.92WY

Let’s look at deaths due to lung cancer. The numbers in the $x$ column of this table are the same as before, but the numbers in the $y$ column tell us how many people died of lung cancer for each 100,000 people living in the state.

For each 100,000 people living in Kansas (KS) in 1960, how many died of lung cancer in 1960?
Which state (or district) had the highest death rate due to lung cancer?
Which state had the lowest death rate due to lung cancer?

Find the point on the grid that represents the cigarette sales and lung cancer death rates for Nevada (NV) in 1960.

Is this point above all the other points on the grid, below all the other points, to the right of all the other points, to the left of all the other points, or none of these?
Does this mean that Nevada has the highest death rate from lung cancer, the lowest death rate from lung cancer, the most cigarette sales, the least cigarette sales, or none of these?

Look at the graph to see how the death rate from lung cancer relates to the cigarette sales. In states where more cigarettes were sold, do there tend to be more deaths from lung cancer, fewer deaths from lung cancer, or about the same number of deaths from lung cancer?

Slide $m$ and $b$ to find an equation for a line that is a better approximation of the points on the graph. Try to make your root-mean-square error smaller than 3.2. (One of $m=0.64$, $m=0.75$, or $m=0.81$ will work with some $b$.)

What value for $m$ did you find?
What value for $b$ did you find?
What is your equation for $y$?
What is the root-mean-square error for your line?

Leukemia

$x$$y$State
30.344.90AK
18.206.15AL
18.246.94AR
25.826.61AZ
28.607.06CA
31.107.20CT
40.467.08DC
33.606.45DE
28.276.07FL
22.127.69IA
20.106.62ID
27.917.27IL
26.187.00IN
21.847.42KS
23.446.41KY
21.586.71LA
26.926.89MA
25.916.81MD
28.926.24ME
24.966.91MI
22.068.28MN
27.566.82MO
16.086.08MS
23.756.90MT
19.966.99ND
23.327.80NE
28.647.12NJ
21.165.95NM
42.406.67NV
29.147.23NY
26.387.38OH
23.447.46OK
23.786.83PA
29.186.35RI
18.065.82SC
20.948.15SD
20.086.59TN
22.577.02TX
14.006.71UT
25.896.56VT
21.177.48WA
21.256.73WI
22.867.38WV
28.045.78WY

This time, let’s analyze the relationship between the number of cigarettes sold and the number of people who died from leukemia.

Which state is represented by the highest dot on this graph?
Think about what it means to be represented by the highest dot. Does that state have the highest death rate from leukemia, the lowest death rate from leukemia, the most cigarette sales, or the least cigarette sales?
In states where more cigarettes were sold, do there tend to be more deaths from leukemia, fewer deaths from leukemia, or about the same number of deaths from leukemia?

Use the sliders for $m$ and $b$ to find a line with the smallest root-mean-square error you can. Try to get below 0.66. (One of $b=5.8$, $b=6.0$, or $b=6.8$ will work.)

What value for $m$ did you find?
What value for $b$ did you find?
What is your equation for $y$?
What is the root-mean-square error for your line?

Kidney cancer

$x$$y$State
30.344.32AK
18.201.59AL
18.242.02AR
25.822.75AZ
28.602.66CA
31.103.35CT
40.463.13DC
33.603.36DE
28.272.41FL
22.122.90IA
20.102.46ID
27.912.95IL
26.182.81IN
21.842.88KS
23.442.13KY
21.582.30LA
26.923.03MA
25.912.85MD
28.923.22ME
24.962.97MI
22.063.54MN
27.562.55MO
16.081.77MS
23.753.43MT
19.963.62ND
23.322.92NE
28.643.12NJ
21.162.52NM
42.402.85NV
29.143.10NY
26.382.95OH
23.442.45OK
23.782.75PA
29.182.84RI
18.062.05SC
20.943.11SD
20.082.18TN
22.572.69TX
14.002.20UT
25.893.17VT
21.172.78WA
21.252.34WI
22.863.28WV
28.042.66WY

You are looking at the data for deaths from kidney cancer. Use the sliders to find the line with the smallest root-mean-square error that you can. Try to get your root-mean-square error below 0.49. (One of $m=0.008$, $m=0.048$, or $m=0.092$ will work.)

What value for $m$ did you find?
What value for $b$ did you find?
What is your equation for $y$?
What is the root-mean-square error for your line?

In states where more cigarettes were sold, do there tend to be more deaths from kidney cancer, fewer deaths from kidney cancer, or about the same number of deaths from kidney cancer?

The meaning of the slope

From the table below, you can see that in the case of bladder cancer, the slope of the line is ?. This means that for every extra 100 cigarettes sold per person in 1960, the death rate by bladder cancer increased by about ? people per 100,000 population.

Diseaseslope ($m$)$y$-intercept
($b$)
EquationError
Bladder Cancer
Lung Cancer
Leukemia
Kidney Cancer
Think about what the slope tells you in the case of lung cancer. For every extra 100 cigarettes sold per person in 1960, by about how much did the death rate by lung cancer per 100,000 people increase?
For every extra 100 cigarettes sold per person in 1960, by about how much did the death rate by leukemia per 100,000 people increase?
For every extra 100 cigarettes sold per person in 1960, by about how much did the death rate by kidney cancer per 100,000 people increase?
Which of these four death rates appears to be most affected by cigarette sales?
Which of these four death rates appears to be least affected by cigarette sales?