Lab 4E

Directions: Follow along with the slides and answer the questions in **red** font in your journal.

- So far, in the labs, we’ve learned how to make predictions using the
*line of best fit*- Which we also call
*linear models*or*regression models*.

- Which we also call
- We’ve also learned how to measure our model’s prediction accuracy by cross-validation.
- In this lab, we’ll investigate the following question:

*Will including more variables in our model improve its predictions?*

- Start by loading the
`movie`

data and split it into two sets (See Lab 4C for help). Remember to use`set.seed`

.- A set named
`training`

that includes 75% of the data. - A set named
`testing`

that includes the remaining 25%.

- A set named
- Create a linear model, using the
`training`

data, that predicts`gross`

using`runtime`

.- Compute the MSE of the model by making predictions for the
`testing`

data.

- Compute the MSE of the model by making predictions for the
**Do you think that a movie’s**`runtime`

is the only factor that goes into how much a movie will make? What else might affect a movie’s`gross`

?

- Data scientists often find that including more relevant information in their models leads to better predictions.
- Fill in the blanks below to predict
`gross`

using`runtime`

and`reviews_num`

.

- Fill in the blanks below to predict

**Does this new model make more or less accurate predictions? Describe the process you used to arrive at your conclusion.****Write down the code you would use to include a 3rd variable, of your choosing, in your**`lm()`

.

**Write down which other variables in the**`movie`

data you think would help you make better predictions.**Are there any variables that you think would not improve our predictions?**

**Create a model for all of the variables you think are relevant.****Assess whether your model makes more accurate predictions for the**`testing`

data than the model that included only`runtime`

and`reviews_num`

**With your neighbors, determine which combination of variables leads to the best predictions for the**`testing`

data.