A Diamond in the Rough

Lab 1F

Directions: Follow along with the slides and answer the questions in red font in your journal.

Messy data? Get used to it

Since lab 1, the data we've been using has been pretty clean.
Why do we call it clean?
- Variables were named so we could understand what they were about.
- There didn't seem to be any typos in the values.
- Numerical variables were considered numbers.
- Categorical variables were composed of categories.
Unfortunately, more often than not, data is messy until YOU clean it.
In this lab, we'll learn a few essentials for cleaning dirty data.

Messy data?

What do we mean by messy data?
Variables might have non-descriptive names
- Var01, V2, a, …
Categorical variables might have misspelled categories
- “blue”, “Blue”, “blu”, …
Numerical variables might have been input incorrectly. For example, if we're talk about people's height in inches:
- 64.7, 6.86, 676, …
Numerical variables might be incorrectly coded as categorical variables (Or vice-versa)
- “64.7”, “68.6”, “67.6”

The American Time Use Survey

To show you what dirty data looks like, we'll check out the American Time Use Survey, or ATU survey.
What is ATU survey?
- It's a survey conducted by the US government (Specifically the Bureau of Labor Statistics).
- They survey thousands of people to find out exactly what activities they do throughout a single day.
- These thousands of people combined together give an idea about how much time the typical person living in the US spends doing various activites.

Load and go:

Type the following commands into your console:

data(atu_dirty)

View(atu_dirty)

Just by viewing the data, what parts of our ATU data do you think need cleaning?

Description of ATU Variables

The description of the actual variables:
- caseid: Anonymous ID of survey taker.
- V1: The age of the respondent.
- V2: The gender of the respondent.
- V3: Whether the person is employed full-time or part-time.
- V4: Whether the person has a physical difficulty.
- V5: How long the person sleeps, in minutes.
- V6: How long the survey taker spent on homework, in minutes.
- V7: How long the respondent spent socializing, in minutes.

New name, same old data

To fix the variable names, we need to assign a new set of names in place of the old ones.
- Below is an example of the rename function:

atu_cleaner <- rename(atu_dirty, age = V1,
                       gender = V2)

Use the example code and the variable information on the previous slide to rename the rest of the variables in atu_dirty.
- Names should be short, contain no spaces and describe what the variable is related to. So use abbreviations to your heart's content.

Next up: Strings

In programming, a string is sort of like a word.
- It's a value made up of characters (i.e. letters)
The following are example of strings. Notice that each string has quotes before and after.

"string"

"A1B2c3"

"Hot Cocoa"

"0015"

Numbers are words? (Sometimes)

In some cases, R will treat values that look like numbers as if they were strings.
Sometimes we do this on purpose.
- For example, we can code Yes/No variables as "1"/"0".
Sometimes we don't mean for this to happen.
- The number of siblings a person has should not be a string.
Look at the structure of your data and the variable descriptions from a few slides back:
- Write down the variables that should be numeric but are improperly coded as strings or characters.

Changing strings into numbers

To fix this problem, we need to tell R to think of our "numeric" variables as numeric variables.
We can do this with the as.numeric function.
- An example using this function is below:

as.numeric("3.14")

[1] 3.14

Notice: We started with a string, "3.14", but as.numeric was able to turn it back into a number.

Mutating in action

Look at the variables you thought should be numeric and select one. Then fill in the blanks below to see how we can correctly code it as a number:

atu_cleaner <- mutate(atu_cleaner, 
                 age = as.numeric(age),
                 ___ = as.numeric(___))

Once you have this code working, use a similar line of code to correctly code the other numeric variables as numbers.

Deciphering Categorical Variables

We mentioned earlier that we sometimes code categorical variables as numbers.
- For example, our gender variable uses "01" and "02" for "Male" and "Female", respectively.
It's often much easier to analyze and interpret when we use more descriptive categories, such as "Male" and "Female".

Factors and Levels

R has a special name for categorical variables, called factors.
R also has a special name for the different categories of a categorical variable.
- The individual categories are called levels.
To see the levels of gender and their counts type:

tally(~gender, data = atu_cleaner)

Use similar code as we used above to write down the levels for the three factors in our data.

A level by any other name...

If we know that '01' means 'Male' and '02' means 'Female' then we can use the following code to recode the levels of gender.
Type the following command into your console:

atu_cleaner <- mutate(atu_cleaner, gender = 
                 recode(gender, 
                         "01"="Male", 
                         "02" = "Female"))

This code is definitely a bit of a mouthful. Let's break it down.

Allow me to explain

atu_cleaner <- mutate(atu_cleaner, gender = 
                  recode(gender, "01"="Male", 
                    "02" = "Female"))

This code is saying:
- Replace my current version of atu_cleaner…
- with a mutated one where …
- the gender variable's levels …
- have been recoded…“
- where "01" will now be "Male"…
- and "02" will now be "Female".

Finish it off!

Recode the categorical variable about whether the person surveyed had a physical challenge or not. The coding is currently:
- "01": Person surveyed did not have a physical challenge.
- "02": Person surveyed did have a physical challenge.
Write a script that:
- (1) Loads the atu_dirty data set
- (2) Cleans the the data as we have in this lab
- (3) Saves a copy of the cleaned data (see next slide).

The final lines

The last few lines of your script to clean and then save your American Time Use data might look like:

atu_clean <- atu_cleaner

save(atu_clean, file = "atu_clean.Rda")

Be sure to View your data to make sure it looks clean and tidy before saving.