Where we left off

In the last lab, we looked at how we can use computer simulations to compute estimates of simple probabilities.
- Like the probability of drawing a song genre from a playlist.
We also saw that performing more simulations:
- Took longer to finish.
- Had estimates that varied less.
In this lab, we’ll extend our simulation methods to cover situations that are more complex.
- We’ll learn how to estimate their probabilities.
- We will also look at the role of sampling with or without replacement.

Back to songs

In R, simulate a playlist of songs containing 30 "rap" songs, 23 "country" songs and 47 "rock" songs.
- Assign the combined playlist the name songs.
Simulate choosing a single song 50 times. Then use your simulated draws to estimate the probability of choosing a rap song.
- The actual (theoretical) probability of choosing a rap song in this case is 0.30.
- Write a sentence comparing your estimated probability to the actual probability.

With or Without?

So far, you’ve selected songs with replacement.
- We called it that, because each time you made a selection, you started with the same playlist. That is, you chose a song, wrote down its data, and then placed it back on the list.
It’s also possible to select without replacement by setting the replace option in the sample function to FALSE.
Take a sample of size 100 from our playlist of songs without replacement. Assign this sample the name without.
- Run tally(without) and describe the output. Does something similar happen if you sample with replacement?
  - Notice that the tilde ~ was not needed with the tally function. This is because without was not a variable within a data frame but rather a vector which acts like a lone variable.
- What happens if size = 101 and replace = FALSE?

Sample with? Or without?

Imagine the following two scenarios.
1. You have a coin with two sides: Heads and Tails. You’re not sure if the coin is fair and so you want to estimate the probability of getting a Head.
2. A child reaches into a candy jar with 10 strawberry, 50 chocolate and 25 watermelon candies. The child is able to grab three candies with their hand and you’re interested in the probability that all three candies will be chocolate.
Which of these scenarios would you sample with replacement and which would you sample without replacement? Why?
- Write down the line of code you would run to sample from the candy jar. Assume the simulated jar is named candies.

Simulations at work

In reality, songs from a playlist are chosen without replacement.
- This way, you won’t hear the same song several times in a row.
Let’s write a more realistic simulation and estimate the probability that if we select two songs at random, without replacement, that both are rap songs.
- Use the do function to perform 10 simulated samples of size 2, without replacement and assign the simulations the name draws and then View your file. Use set.seed(1).
What are the variable names? What happened in the first simulation? Did any of your 10 simulations contain two rap songs?

Simulations and probability

To estimate the probability from our simulations, we need to find the proportion of times that the event we’re interested in occurs in the simulations.
In other words, we need to count the number of times the desired events occurred, divided by the number of attempts we made (the number of simulations).
The next slides will show you two ways to do this.

Counting similar outcomes

One way we can estimate the probability of drawing two songs of the same genre is to use the following trick to count the number of rap songs in each of the 10 simulations:

mutate(draws, nrap = rowSums(draws == "rap"))

Let’s break down the code above by running each part of the code one piece at a time. As you run each line of code below describe the output.

draws == "rap"

rowSums(draws == "rap")

mutate(draws, nrap = rowSums(draws == "rap"))

Remember to assign a name to your mutated dataset.

Counting other outcomes

Another method we can use to estimate the probability of complex events is to use the following 2-step procedure:
1. Subset or filter the rows of the simulations that match our desired outcomes.
2. Count the number of rows in the subset and divide by the number of simulations.
The result that you obtain is an estimate of the probability that a specific combination of events occurred.
We’ll see an example of this method on the next slide.

Step 1: Creating a subset

Fill in the blanks below to:
1. Create a subset of our simulations when both draws were "rap" songs.
2. Count the number of rows in this subset.
3. And divide by the total number of repeated simulations.

draws_sub <- filter(draws, ___ == "rap",  ___ == "rap")

nrow(___) / ___

Estimating probabilities

Answer the following questions by performing 500 simulations of sampling 2 songs from a playlist of 30 "rap", 23 "country" and 47 "rock" songs. You might consider running set.seed so that your results can be reproduced:
Calculate estimated probabilities for the following situations:
1. You draw two "rap" songs.
2. You draw a "rap" song in the first draw and a "country" song in the 2nd.
Create a histogram that displays the number of times a "rap" song occurred in each simulation. That is, how often were zero rap songs drawn? A single rap song? Two rap songs?

On your own

Using what you’ve learned in the previous two labs, answer the following question by performing two computer simulations with 500 repetitions a piece:

If we draw 5 songs from a playlist of 30 rap, 23 country and 47 rock songs, how does the estimated probability of all 5 songs being rap songs change if we draw the songs with or without replacement?

For each simulation:
- Create a histogram for the number of rap songs that occurred for each of the 500 repetitions.
Describe how the distribution of the number of rap songs changes depending on if we use replacement or not.

Queue it up!