Lab 4H

Directions: Follow along with the slides and answer the questions in **red** font in your journal.

- We've seen previously that data scientists have methods to predict values of specific variables.
- We used
*regression*to predict numerical values and*classification*to predict categories.

- We used
*Clustering*is similar to classification in that we want to group people into categories. But there's one important difference:- In
*clustering*, we don't know how many groups to use because we're not predicting the value of a known variable!

- In
- In this lab, we'll learn how to use the k-means clustering algorithm to group our data into clusters.

- The k-means algorithm works by splitting our data into
*k*different clusters.- The number of clusters, the value of
*k*, is chosen by the data scientist.

- The number of clusters, the value of
- The algorithm works
*only*for numerical variables and*only*when we have no missing data. - To start, use the
`data`

function to load the`futbol`

data set.- This data contains 23 players from the US Men's National Soccer team (USMNT) and 22 quarterbacks from the National Football League (NFL).

- Create a scatterplot of the players
`ht_inches`

and`wt_lbs`

and color each dot based on the`league`

they play for.

- After plotting the player's heights and weights, we can see that there are two clusters, or different types, of players:
- Players in the NFL tend to be taller and weigh more than the shorter and lighter USMNT players.

- Fill in the blanks below to use k-means to cluster the same height and weight data into two groups:

```
kclusters(____~____, data = futbol, k = ____)
```

- Use this code and the
`mutate`

function to add the values from`kclusters`

to the`futbol`

data. Call the variable`clusters`

.

- In comparing our football and soccer players, we
*know*for certain which league each player plays in.- We call this knowledge
*ground-truth*.

- We call this knowledge
- Knowing the
*ground-truth*for this example is helpful to illustrate how k-means works, but in reality, data-scientists would run k-means not knowing the*ground-truth*. **Compare the clusters chosen by k-means to the ground-truth. How successful was k-means at recovering the**`league`

information?

- Load your class'
`timeuse`

data (remember to run`timeuse_format`

so each row represents the mean time each student in spent participating in the various activities). - Create a scatterplot of
`homework`

and`videogames`

variables.- Based on this graph, identify and remove any outliers by using the
`subset`

function.

- Based on this graph, identify and remove any outliers by using the
- Use
`kclusters`

with`k=2`

for`homework`

and`videogames`

.**Describe how the groups differ from each other in terms of how long each group spends playing**`videogames`

and doing`homework`

.