This model is big enough for all of us!

Lab 4E

Building better models

• So far, in the labs, we've learned how to make predictions using the line of best fit
• Which we also call linear models or regression models.
• We've also learned how to measure our model's prediction accuracy by cross-validation.
• In this lab, we'll investigate the following question:

Will including more variables in our model improve its predictions?

Divide & Conquer

• Start by loading the movie data and split it into two sets (See Lab 4C for help). Remember to use set.seed.
• A set named training that includes 75% of the data.
• A set named testing that includes the remaining 25%.
• Create a linear model, using the training data, that predicts gross using runtime.
• Compute the MSE of the model by making predictions for the testing data.
• Do you think that a movie's runtime is the only factor that goes into how much a movie will make? What else might affect a movie's gross?

• Data scientists often find that including more relevant information in their models leads to better predictions.
• Fill in the blanks below to predict gross using runtime and reviews_num.
lm(____ ~ ____ + ____, data = training)

• Does this new model make more or less accurate predictions? Describe the process you used to arrive at your conclusion.
• Write down the code you would use to include a 3rd variable, of your choosing, in your lm().

• Write down which other variables in the movie data you think would help you make better predictions.
• Assess whether your model makes more accurate predictions for the testing data than the model that included only runtime and reviews_num
• With your neighbors, determine which combination of variables leads to the best predictions for the testing data.