Lab 1G

Directions: Follow along with the slides, completing
the questions in **blue** on your
computer, and answering the questions in **red** in your
journal.

Space, Click, Right Arrow or swipe left to move to
the next slide.

- In Lab 1F, we saw how we could
*clean*data to make it easier to use and analyze.- You cleaned a small set of variables from the American Time Use (ATU) survey.
- The process of cleaning and then analyzing data is
*very*common in Data Science.

- In this lab, we’ll learn how we can create frequency tables to
detect relationships between categorical variables.
- For the sake of consistency, rather than using the data that you cleaned, you will use the pre-loaded ATU data.
- Use the
`data()`

function to load the`atu_clean`

data file to use in this lab.

- When we’re dealing with categorical variables, we can’t just
calculate an
*average*to describe a*typical*value.- (Honestly, what’s the average of categories
*orange*,*apple*and*banana*, for instance?)

- (Honestly, what’s the average of categories
- When trying to describe categorical variables with numbers, we
calculate
*frequency tables*.

- When it comes to categories, about all you can do is
*count*or*tally*how often each category comes up in the data. **Fill in the blanks below to answer the following: How many more***females*than*males*are there in our ATU data?

- Counting the categories of a single variable is nice, but often times we want to make comparisons.
- For example, what if we wanted to answer the question:
- Does one
`gender`

seem to have a higher occurrence of physical challenges than the other?

- Does one
- We could use the following plot to try and answer this question:

- The split
`bargraph`

helps us get an idea of the answer to the question, but we need to provide precise values. - Use a line of code, that’s similar to how we
facet plots, to obtain a
`tally`

of the number of people with physical challenges and their genders.**Write down the resulting table.**

- Recall that there were 1153 more women than men in our dataset.
- If there are more women, then we might expect women to have more physical challenges (compared to men).

- Instead of using
*counts*we use*percentages*. - Include:
`format = "percent"`

as an option to the code you used to make your 2-way frequency table.**Does one**`gender`

seem to have a higher occurrence of physical challenges than the other? If so, which one and explain your reasoning.

- It’s often helpful to display totals in our 2-way frequency tables.
- To include them, include
`margin = TRUE`

as an option in the`tally`

function.

- To include them, include

- There is as difference between
`phys_challenge | gender`

and`gender | phys_challenge`

!

```
## gender
## phys_challenge Male Female
## No difficulty 4140 5048
## Has difficulty 530 775
## Total 4670 5823
```

```
## phys_challenge
## gender No difficulty Has difficulty
## Male 4140 530
## Female 5048 775
## Total 9188 1305
```

At first glance, the two-way frequency tables might look similar (especially when the

`margin`

option is excluded). Notice, however, that the totals are different.The totals are telling us that

`R`

calculates conditional frequencies by column!What does this mean?

- In the first two-way frequency table the groups being compared are
`Male`

and`Female`

on the distribution of physical challenges. - In the second two-way frequency table the groups being compared are
people with
`No difficulty`

and`Has difficulty`

on the distribution of gender.

- In the first two-way frequency table the groups being compared are
**Add the option**`format = "percent"`

to the first`tally`

function. How were the percents calculated? Interpret what they mean.

**Describe what happens if you create a 2-way frequency table with a numerical variable and a categorical variable.****How are the types of statistical investigative questions that 2-way frequency tables can answer different than 1-way frequency tables?****Which**`gender`

has a higher rate of*part time employment*?