Zooming Through Data

Lab 1D

Directions: Follow along with the slides and answer the questions in red font in your journal.

Data with Clarity

  • Previously, we've looked at graphs of entire variables (By looking at all of their values).
    • Doing this is helpful to get a big picture idea of our data.
  • In this lab, we'll learn how to zoom in on our data by learning how to subset.
    • We'll also learn a few ways to manipulate the plots we've been making to make them easier to use for analyses.
  • Import the data from your class' Food Habits campaign and name it food.

Splitting data sets

  • In lab 1B, we learned that we can facet (or split) our data based on a categorical variable.
  • Use the dotPlot() function to create a dotPlot of the amount of sugar in our food data.
    • The code to create a dotPlot is exactly like you'd use to make a histogram.
    • Make sure to use a capital P in dotPlot.
  • Split the dotPlot in two by faceting on our observations' salty/sweet variable.
    • Describe how R decides which observations go into the left or right plot.
    • What does each dot in the plot represent?

Altering the layout

  • It would be much easier to compare the sugar levels of salty and sweet snacks if the dotPlots were stacked on top of one another.
  • We can change the layout of our separated plots by including the layout option in our dotPlot function.
    • Add the following option to the code you used create the dotPlot split by salty_sweet
layout = c(1,2)
  • Hint: Use your history pane to see how we handled options with the bargraph function. Use a similar syntax to add the layout option to the dotPlot function.

Subsetting

  • Subsetting is a term we use to describe the process of looking at only the data that conforms to some set of rules:
    • Geologists may subset earthquake data by looking at only large earthquakes.
    • Stock market traders may subset their trading data by looking only at the previous day's trades.
  • There's many ways to subset data using RStudio, we'll focus on learning the most common methods.

The filter function

  • Creating two plots, one for salty and one for sweet is useful for comparing salty and sweet But what if we want to examine only one group by itself?
  • Start by creating a subset of the data:
    • Fill in the blanks below with the data and variable names needed to filter our food data based on people who ate Salty snacks:
food_salty <- filter(____ , ____ == "Salty")
  • View food_salty and write down the number of observations in it. Then use the subset data to make a dotPlot of the sodium in our Salty snacks.

So what's really going on?

  • Coding in R is really just about supplying directions in a way that R understands.
    • We'll start by focusing on everything to the right of the “<-” symbol
food_salty <- filter(____ , ____ == "Salty")
  • filter() tells R that we're going to look at only the values in our data that follow a rule.
  • The first blank should be the data we're going to filter down into a smaller set (Based on our rule).
  • salty_sweet == "Salty" is the rule to follow.

3 parts of defining rules

  • We can decompose our rule, salty_sweet == "Salty", into 3 parts:
    • (1) salty_sweet, is the particular variable we want to use to select our subset.
    • (2) "Salty", is the value of the variable that we want to select. We only want to see data with the value Salty for the variable salty_sweet.
    • (3) == describes how we want to relate our variable (salty_sweet) to our value ("Salty"). In this case, we want values of salty_sweet that are exactly equal to "Salty".
  • Notice: Values (that are also words) have quotation marks around them. Variables do not.

More on ==

  • We can use the head() function to help us see what's happening when we write salty_sweet == "Salty".
    • head() returns the values of the first 6 observations.
    • The tail() function returns the last 6 observations.
  • Run the following code and answer the question below:
head(~salty_sweet == "Salty", data = food)
  • What do the values TRUE and FALSE tell us about how our rule applies to the first six snacks in our data? Which of the first six observations were Salty?

Saving values

  • To use our subset data we need to save it first.
    • When we save something in R what we are really doing is giving a value, or set of values, a specific name for us to use later.
  • The arrow <- is called the “assignment” operator. It assigns names (on the left) to values (on the right)
    • We now focus on everything to the left of, and including, the “<-” symbol
food_salty <- filter(____ , ____ == "Salty")

Saving our subset

food_salty <- filter(____ , ____ == "Salty")
  • This code then:
    • takes our subset data, (everything to the right of “<-”) …
    • and assigns the subset data, by using the arrow “<-” …
    • the name food_salty.
  • We can now use food_salty to do anything we could do with the regular food data …
    • but only including those snacks who reported being Salty.

Including more filters

  • We often want to filter our data based on multiple rules.
    • For instance, we might want to filter our food data based on the food being salty AND costing less than 2 dollars.
  • We can include multiple filters to our subsets by separating each rule with a comma like so:
my_sub <- filter(food , salty_sweet == "Salty", cost <= 2)
  • View the my_sub data we filtered in the above line of code and verify that it only includes salty snacks that cost less than 2 dollars.

Put it all together

  • Use an appropriate dotPlot to answer each of the following questions:
    • About how much fat does the typical sweet snack have?
    • How does the typical amount of fat compare when healthy_level < 3 and when healthy_level > 3?
  • It can sometime be helpful to change the number of intervals, or bins, used in dotPlots and histograms.
    • To change the number of intervals in your plots, include the nint option. For instance, to have a plot with 3 bins, use nint = 3. To have a plot with 30 bins, use nint = 30.