# A new direction

• For the past two labs, we’ve looked at ways that we can summarize data with numbers.
• Specifically, you learned how to describe the center, shape and spread of variables in our data.
• In this lab, we’re going to estimate the probability that a rap song will be chosen from a playlist with both rap and rock songs, if the choice is made at random.
• The playlist we’ll work with has 100 songs: 39 are rap and 61 are rock.

# Estimate what … ?

• To estimate the probability, we’re going to imagine that we select a song at random, write down its genre (rock or rap), put the song back in the playlist, and repeat 499 more times for a total of 500 times.
• The statistical question we want to address is: On average, what proportion of our selections will be rap?
• Why do we put a song back each time we make a selection?
• What would happen in our little experiment if we did not do this?

# Calculating probabilities

• Remember that a probability is the long-run proportion of time an event occurs.
• Many probabilities can be answered exactly with just a little math.
• The probability we draw a single rap song from our playlist of 39 rap and 61 rock songs is 39/100, or 39%.
• Probabilities can also be answered exactly if we were willing to randomly select a song from the playlist, write down its genre, place the song back in the list, and repeatedly do this forever.
• Literally, forever
• But we don’t have that much time. So we’re only going to do it 500 times which will give us an estimate of the probability.

# Estimating probabilities

• You might ask, Why are we estimating the probability if we know the answer is 39%?
• Sometimes, probabilities are too hard to calculate with simple division as we did above. In which case, we can often program a computer to run an experiment to estimate the probability.
• We refer to these programs as simulations.
• The techniques you learn in this lab could be applied to very simple probability calculations or very hard and complex calculations.
• In both cases, your estimated probability would be very close to the actual probability.

• Simulations are meant to mimic what happens in real-life using randomness and computers.
• Before we can start simulating picking songs from a playlist, we need to simulate that playlist in R.
• To simulate our 39 rap songs, we’ll use the repeat (rep) function.
rap <- rep("rap", times = 39)
• Use a similar line of code to simulate the rock songs in our playlist of 100.

# Put the songs in the playlist

• Now that we’ve got some different songs, we need to combine them together.
• To do this, we can use the combine function in R, c().
• Fill in the blanks to combine your different songs:
songs <- __(rap, ____)
• And with that, our playlist of songs should be ready to go.
• Type songs into the console and hit enter to see your individual songs.

# Pick a song, any song

• Data scientists call the act of choosing things randomly from a set, sampling.
• We can randomly choose a song from our playlist by using:
sample(songs, size = 1, replace = TRUE)
• Run this code 10 times and compute the proportion of "rap" songs you drew from the 10.
• Once everyone in your class has computed their proportions, calculate the range of proportions (The largest proportion minus the smallest proportion) for your class and write it down.

# Now do() it some more

• Instead of running the same line of code multiple times ourselves we can use R to do() multiple repetitions for us.
• Fill in the blanks below to do the sample code from the previous slide 50 times run:
do(___) * sample(___, ___ = ___, ___ = ___)
• Assign the 50 selected songs the name draws. Then fill in the blank below to tally how often each genre was selected:
tally(~___, data = draws)
• Compute the proportion of "rap" songs for your 50 draws and find out if the range for your class’ proportions is bigger or smaller than when we drew 10 songs.

# Proportions vs. Probability

• To review, so far in this lab we’ve:
• Simulated a “playlist” of songs.
• Repeatedly simulated drawing a song from the playlist, noting its genre and placing it back in the playlist.
• Computed the proportion of the draws that were "rap".
• These proportions are all estimates of the theoretical probability of choosing a rap song from a playlist.
• As we increase the number of draws, the range of proportions should shrink.

When using simulations to estimate probabilities, using a large number of repeats is better because the estimates have less variability and so we can be confident we’re closer to the actual value.

# Non-random Randomness

• We’ve seen that random simulations can produce many different outcomes.
• Some estimated probabilities in your class were smaller/larger relative to others.
• There are instances where you might like the same random events to occur for everyone.
• We can do this by using set.seed().
• For example, the output of this code will always be the same:
set.seed(123)
sample(songs, size = 1, replace = TRUE)
##  "rap"

# Playing with seeds

• With a partner, choose a number to include in set.seed then redo the simulation of 50 songs.
• Both partners should run set.seed(___) just before simulating the 50 draws.
• The blank in set.seed(___) should be the same number for both partners.
• Verify that both partners compute the same proportion of "rap" songs.
• Redo the 50 simulations one last time but have each partner choose a different number for set.seed(___).
• Are the proportions still the same? If so, can you find two different values for set.seed that give different answers?

• Include set.seed(123) in your code before you do 500 repeated samples.