In the previous lessons you have worked with linear equations. You are now going to use
what you learned in those lessons to find equations that describe the relationship between
points that appear to be roughly in a straight line.
The data that you will use in this lesson shows the relationship between the number of
cigarettes bought and the death rates from cancer in different states during 1960. This
data was taken from http://lib.stat.cmu.edu/DASL/Datafiles/cigcancerdat.html.
Bladder cancer
$x$  $y$  State 
30.34  3.46  AK

18.20  2.90  AL

18.24  2.99  AR

25.82  3.52  AZ

28.60  4.46  CA

31.10  5.11  CT

40.46  5.60  DC

33.60  4.78  DE

28.27  4.46  FL

22.12  4.23  IA

20.10  3.08  ID

27.91  4.75  IL

26.18  4.09  IN

21.84  2.91  KS

23.44  2.86  KY

21.58  4.65  LA

26.92  4.69  MA

25.91  5.21  MD

28.92  4.79  ME

24.96  5.27  MI

22.06  3.72  MN

27.56  4.04  MO

16.08  3.06  MS

23.75  3.95  MT

19.96  2.89  ND

23.32  3.72  NE

28.64  5.98  NJ

21.16  2.90  NM

42.40  6.54  NV

29.14  5.30  NY

26.38  4.47  OH

23.44  2.93  OK

23.78  4.89  PA

29.18  4.99  RI

18.06  3.25  SC

20.94  3.64  SD

20.08  2.94  TN

22.57  3.21  TX

14.00  3.31  UT

25.89  4.63  VT

21.17  4.04  WA

21.25  5.14  WI

22.86  4.78  WV

28.04  3.20  WY

The numbers in the $x$ column of this table tell us how
many hundreds of cigarettes were sold per person in each state during 1960. For example,
in Alabama (AL), the number in the $x$ column is 18.2.
This means that for each person living in Alabama in 1960, 1,820 cigarettes
were sold (that is, 18.2 hundreds).
In 1960, how many hundreds of cigarettes were sold per person in
Arizona (AZ)?
 
The numbers in the $y$ column tell us how
many people died of bladder cancer for each 100,000 people living in the state.
For example, in Alabama (AL), for each 100,000 people, 2.9 people died of bladder cancer.
In 1960, for every 100,000 people living in Idaho (ID), how many
died of bladder cancer?
 
Finding the “best fit” line
Just by looking at the table, it’s hard to tell whether there is a relationship between the
number of cigarettes sold in each state and the number of deaths due to bladder cancer.
Another way to look at the data in the table is to graph it. You can see a graph showing the
points that represent the data to the left.
When the points on a grid are not all on a straight line, but
seem to have a somewhat linear pattern, you can find a line that
is the “best fit line” (closest to the pattern).
The number 2.981, shown to the right of the graph, is called the
rootmeansquare error for the graphed line
($y=0.05x$).
This number tells you how far away the line is from the
points (a smaller number means the line is a
better “fit” to the points). (In this course, you will not usually be
asked to compute the rootmeansquare error of a line; it will be given to you.)
Use both sliders in the lower left portion of your screen to change the values of $m$
and $b$ in the equation $y=mx+b$,
and find an equation for a line that is a better approximation of the
points on the grid (i.e. has a smaller rootmeansquare error).
The equation for the line you are looking at is written below the grid.
Try to make the rootmeansquare error smaller than 0.69. (Hint: Set the $b$
slider to each of $0.05$, $1.10$, and $1.75$, and then experiment with the $m$ slider. For one
of those three $b$ values, you should be able to find an $m$ value that makes the
rootmeansquare error small enough.)
What value for $m$ did you find?
 
What value for $b$ did you find?
 
What is your equation for $y$?
 
What is the rootmeansquare error for your line?
 
Lung cancer
$x$  $y$  State 
30.34  25.88  AK

18.20  17.05  AL

18.24  15.98  AR

25.82  19.80  AZ

28.60  22.07  CA

31.10  22.83  CT

40.46  27.27  DC

33.60  24.55  DE

28.27  23.57  FL

22.12  16.59  IA

20.10  13.58  ID

27.91  22.80  IL

26.18  20.30  IN

21.84  16.84  KS

23.44  17.71  KY

21.58  25.45  LA

26.92  22.04  MA

25.91  26.48  MD

28.92  20.94  ME

24.96  22.72  MI

22.06  14.20  MN

27.56  20.98  MO

16.08  15.60  MS

23.75  19.50  MT

19.96  12.12  ND

23.32  16.70  NE

28.64  25.95  NJ

21.16  14.59  NM

42.40  23.03  NV

29.14  25.02  NY

26.38  21.89  OH

23.44  19.45  OK

23.78  12.11  PA

29.18  23.68  RI

18.06  17.45  SC

20.94  14.11  SD

20.08  17.60  TN

22.57  20.74  TX

14.00  12.01  UT

25.89  21.22  VT

21.17  20.34  WA

21.25  20.55  WI

22.86  15.53  WV

28.04  15.92  WY

Let’s look at deaths due to lung cancer. The numbers in the $x$ column of
this table are the same as before, but the numbers in the $y$ column
tell us how many people died of lung cancer for each 100,000 people living in the state.
For each 100,000 people living in Kansas (KS) in 1960, how many died of
lung cancer in 1960?
 
Find the point on the grid that represents
the cigarette sales and lung cancer death rates for Nevada (NV) in 1960.
Is this point above all
the other points on the grid, below all the other points, to the
right of all the other points, to the left of all the other points, or
none of these?
 
Does this mean that Nevada has the highest death
rate from lung cancer, the lowest death rate from lung cancer, the
most cigarette sales, the least cigarette sales, or
none of these?
 
Look at the graph to see how the death rate from lung
cancer relates to the cigarette sales. In states where more cigarettes were sold, do there tend
to be more deaths from lung cancer, fewer deaths from lung
cancer, or about the same number of deaths from lung cancer?
 
Slide $m$ and $b$ to find an equation for a line
that is a better approximation of the points on the graph.
Try to make your rootmeansquare error smaller than 3.2.
(One of $m=0.64$, $m=0.75$, or $m=0.81$ will work with some $b$.)
What value for $m$ did you find?
 
What value for $b$ did you find?
 
What is your equation for $y$?
 
What is the rootmeansquare error for your line?
 
Leukemia
$x$  $y$  State 
30.34  4.90  AK

18.20  6.15  AL

18.24  6.94  AR

25.82  6.61  AZ

28.60  7.06  CA

31.10  7.20  CT

40.46  7.08  DC

33.60  6.45  DE

28.27  6.07  FL

22.12  7.69  IA

20.10  6.62  ID

27.91  7.27  IL

26.18  7.00  IN

21.84  7.42  KS

23.44  6.41  KY

21.58  6.71  LA

26.92  6.89  MA

25.91  6.81  MD

28.92  6.24  ME

24.96  6.91  MI

22.06  8.28  MN

27.56  6.82  MO

16.08  6.08  MS

23.75  6.90  MT

19.96  6.99  ND

23.32  7.80  NE

28.64  7.12  NJ

21.16  5.95  NM

42.40  6.67  NV

29.14  7.23  NY

26.38  7.38  OH

23.44  7.46  OK

23.78  6.83  PA

29.18  6.35  RI

18.06  5.82  SC

20.94  8.15  SD

20.08  6.59  TN

22.57  7.02  TX

14.00  6.71  UT

25.89  6.56  VT

21.17  7.48  WA

21.25  6.73  WI

22.86  7.38  WV

28.04  5.78  WY

This time, let’s analyze the relationship between the number of cigarettes
sold and the number of people who died from leukemia.
Think about what it means to be represented by the highest
dot. Does that state have the highest death rate from
leukemia, the lowest death rate from leukemia, the most cigarette
sales, or the least cigarette sales?
 
In states where more cigarettes were sold, do there tend
to be more deaths from leukemia, fewer deaths from leukemia,
or about the same number of deaths from leukemia?
 
Use the sliders for $m$ and $b$ to find a line
with the smallest rootmeansquare error you can. Try to get below 0.66.
(One of $b=5.8$, $b=6.0$, or $b=6.8$ will work.)
What value for $m$ did you find?
 
What value for $b$ did you find?
 
What is your equation for $y$?
 
What is the rootmeansquare error for your line?
 
Kidney cancer
$x$  $y$  State 
30.34  4.32  AK

18.20  1.59  AL

18.24  2.02  AR

25.82  2.75  AZ

28.60  2.66  CA

31.10  3.35  CT

40.46  3.13  DC

33.60  3.36  DE

28.27  2.41  FL

22.12  2.90  IA

20.10  2.46  ID

27.91  2.95  IL

26.18  2.81  IN

21.84  2.88  KS

23.44  2.13  KY

21.58  2.30  LA

26.92  3.03  MA

25.91  2.85  MD

28.92  3.22  ME

24.96  2.97  MI

22.06  3.54  MN

27.56  2.55  MO

16.08  1.77  MS

23.75  3.43  MT

19.96  3.62  ND

23.32  2.92  NE

28.64  3.12  NJ

21.16  2.52  NM

42.40  2.85  NV

29.14  3.10  NY

26.38  2.95  OH

23.44  2.45  OK

23.78  2.75  PA

29.18  2.84  RI

18.06  2.05  SC

20.94  3.11  SD

20.08  2.18  TN

22.57  2.69  TX

14.00  2.20  UT

25.89  3.17  VT

21.17  2.78  WA

21.25  2.34  WI

22.86  3.28  WV

28.04  2.66  WY

You are looking at the data for deaths from kidney cancer. Use the sliders to
find the line with the smallest rootmeansquare error that you can.
Try to get your rootmeansquare error below 0.49. (One of $m=0.008$,
$m=0.048$, or $m=0.092$ will work.)
What value for $m$ did you find?
 
What value for $b$ did you find?
 
What is your equation for $y$?
 
What is the rootmeansquare error for your line?
 
In states where more cigarettes were sold, do there
tend to be more deaths from kidney cancer, fewer deaths from
kidney cancer, or about the same number of deaths from kidney cancer?
 
The meaning of the slope
From the table below, you can see that in the case of
bladder cancer, the slope of the line is ?. This means that for
every extra 100 cigarettes sold per person in 1960, the death rate by bladder cancer
increased by about ? people per 100,000 population.
Think about what the slope tells you in the case of lung cancer. For
every extra 100 cigarettes sold per person in 1960, by about how much did the death rate by lung
cancer per 100,000 people increase?
 
For every extra 100 cigarettes sold per person in 1960, by about how much
did the death rate by leukemia per 100,000 people increase?
 
For every extra 100 cigarettes sold per person in 1960, by about how much
did the death rate by kidney cancer per 100,000 people increase?
 
Which of these four death rates appears to be most affected by cigarette sales?
 
Which of these four death rates appears to be least affected by cigarette sales?
 