Lab 4F

Directions: Follow along with the slides, completing
the questions in **blue** on your
computer, and answering the questions in **red** in your
journal.

Space, Click, Right Arrow or swipe left to move to
the next slide.

- So far, in the labs, we’ve learned how to make predictions using the
*line of best fit*, also known as*linear models*or*regression models*. - We’ve also learned how to measure our model’s prediction accuracy by cross-validation.
- In this lab, we’ll investigate the following question:

*Will including more variables in our model improve its
predictions?*

- Start by loading the
`movie`

data and split it into two sets (see Lab 4C for help).- A set named
`training`

that includes 75% of the data. - A set named
`test`

that includes the remaining 25%. - Remember to use
`set.seed`

.

- A set named
- Create a linear model, using the
`training`

data, that predicts`gross`

using`runtime`

.- Calculate the MSE of the model by making
predictions for the
`test`

data.

- Calculate the MSE of the model by making
predictions for the
**Do you think that a movie’s**`runtime`

is the only factor that goes into how much a movie will make? What else might affect a movie’s`gross`

?

- Data scientists often find that including more relevant information
in their models leads to better predictions.
- Fill in the blanks below to predict
`gross`

using`runtime`

and`reviews_num`

.

- Fill in the blanks below to predict

**Does this new model make more or less accurate predictions? Describe the process you used to arrive at your conclusion.****Write down the code you would use to include a 3rd variable, of your choosing, in your**`lm()`

.

**Write down which other variables in the**`movie`

data you think would help you make better predictions.**Are there any variables that you think would not improve our predictions?**

- Create a model for all of the variables you think
are relevant.
**Assess whether your model makes more accurate predictions for the**`test`

data than the model that included only`runtime`

and`reviews_num`

.

**With your neighbors, determine which combination of variables leads to the best predictions for the**`test`

data.