Lab 4B

Directions: Follow along with the slides, completing
the questions in **blue** on your
computer, and answering the questions in **red** in your
journal.

Space, Click, Right Arrow or swipe left to move to
the next slide.

- In the previous lab, we learned we could make predictions about one variable by utilizing the information of another.
- In this lab, we will learn how to measure the accuracy of our
predictions.
- This in turn will let us evaluate how well a model performs at making predictions.
- We’ll also use this information later to compare different models to find which model makes the best predictions.

- Load the
`arm_span`

data again.- Create an
`xyplot`

with`height`

on the y-axis and`armspan`

on the x-axis. - Type
`add_line()`

to run the`add_line`

function; you’ll be prompted to click twice in the plot window to create a line that you think fits the data well.

- Create an
- Fill in the blanks below to create a function
that will make predictions of people’s
`height`

s based on their`armspan`

:

- Fill in the blanks to include your predictions in
the
`arm_span`

data.

- Now that we’ve made our predictions, we’ll need to figure out a way
to decide how accurate our predictions are.
- We’ll want to compare our
*predicted heights*to the*actual heights*. - At the end, we’ll want to come up with a single number summary that describes our model’s accuracy.

- We’ll want to compare our

- A
*residual*is the difference between the actual and predicted value of a quantity of interest. - Fill in the blanks below to add a column of
residuals to
`arm_span`

.

**What do the residuals measure?**One method we might consider to measure our model’s accuracy is to sum the residuals.

Fill in the blanks below to calculate our accuracy summary.

Hint: Like

`mutate`

, the first argument of`summarize`

is a dataframe, and the second argument is the action to perform on a column of the dataframe. Whereas the output of`mutate`

is a column, the output of`summarize`

is (usually) a single number summary.**Describe and interpret, in words, what the output of your accuracy summary means.****Write down why adding positive and negative errors together is problematic for assessing prediction accuracy.**

When adding residuals, the positive errors in our predictions (underestimates) are cancelled out by negative errors (overestimates) which lead to the impression that our model is making better predictions than it actually is.

To solve this problem we calculate the squared values of the errors because squared values are always positive.

The

*mean squared error*(MSE) is calculated by squaring all of the residuals, and then taking the mean of the squared residuals.Fill in the blanks below to calculate the MSE of your line.

**Compare your MSE with a neighbor. Whose line was more accurate and why?**

- If you were to go around your class, each student would have created
a different line that they feel
*fit*the data best.- Which is a problem because everyone’s line will make slightly different predictions.

- To avoid this variation in predictions, data scientists will use
*regression lines*.- We also refer to
*regression lines*as*linear models*. - This line connects the mean
`height`

of people with similar`armspan`

s. - Fill in the blanks below to create a
*regression line*using`lm`

, which stands for*linear model*:

- We also refer to

Type

`best_fit`

into the console to see the slope and intercept of the regression line.Add this line to a scatterplot by filling in the blanks below.

- Making predictions with models
`R`

is familiar with is simpler than with lines, or models, we come up with ourselves.- Fill in the blanks to make predictions using
`best_fit`

:

- Fill in the blanks to make predictions using

- Hint: the
`predict`

function takes a linear model as input, and outputs the predictions of that model.

The

`lm()`

function creates the*line of best fit*equation by finding the line that minimizes the*mean squared error*. Meaning, it’s the*best fitting line possible*.Calculate the MSE for the values predicted using the regression line.

**Compare the MSE of the linear model you fitted using**`add_line()`

to the MSE of the linear model obtained with`lm()`

. Which linear model performed better?**Ask your neighbors if any of their lines beat the**`lm`

line in terms of the MSE. Were any of them successful?