# Scraping web data

Lab 3E

### The web as a data source

• The internet contains huge amounts of information.
• Using computers to gather this information in an automated fashion is referred to as scraping web data.
• Scraping data from the web can be difficult because each website displays & stores data differently.
• In this lab, we'll learn how to scrape data in two steps:
• Step 1: Gather information from the web.
• Step 2: Clean it up and turn it into a usable data frame for Lab 3F.

### Our first web scraper

• Copy and paste the link below into a web browser to view the website of data we'd like to scrape and analyze.

https://labs.idsucla.org/extras/webdata/mountains.html

• Briefly describe what the data on the website is about.
• Then write down 3 questions you'd be interested in answering by analyzing this data.

### HTML

• HTML is the code that's used to render every website you've ever visited.
• The following slide shows the HTML code used to create the first two rows of the web data.
• How is the data table in HTML different than the data tables we're used to seeing in R, for example, when we use the View() function?
• What do you think the tags <TABLE>, <TR>, <TH>, <TD> mean? How does HTML use these tags to display the table?

<TABLE>
<TR>
<TH>peak</TH>
<TH>range</TH>
<TH>state</TH>
<TH>long</TH>
<TH>lat</TH>
<TH>elev_ft</TH>
<TH>elev_m</TH>
<TH>prominence_ft</TH>
<TH>prominence_m</TH>
<TH>rank</TH>
</TR>
<TR>
<TD>Denali (Mount McKinley)</TD>
<TD>-151.0063</TD>
<TD>63.0690</TD>
<TD>20236</TD>
<TD>6168</TD>
<TD>20174</TD>
<TD>6149</TD>
<TD>1</TD>
</TR>
</TABLE>


### Get to scraping!

• Use your browser to go back to the website with the data we're interested in scraping.
• Find the URL address for the site and assign it the name data_url in R.
• Then fill in the blanks below to have R scrape every web table available on the site:
tables <- readHTMLTable(____)


### Find our data

• Since readHTMLTable() scrapes every table that is on a particular web URL, we need to find out which table has the data we're interested in.
• For example, wikipedia.org often has articles with 3 or more tables.
• This means we need to check all 3 tables to find the one we're interested in.
• Use the length() function to find out how many tables of data were scraped in our set of tables.

### Saving tables

• Now that we know how many tables we've scraped, we can go back and scrape individual tables by adding the which argument to the readHTMLTable() function.
• Use readHTMLTable() to re-scrape the data from the web but this time use the which argument to scrape just the individual table.
• The which argument should be the integer denoting which table you want scraped.
• Assign the scraped data the name mtns

### Check, save and use!

• After scraping the data, the only thing left to do is to save it and use it.

• Fill in the blanks to save the data and give it a file name

save(____, file = "____.Rda")

• What is the mean and standard deviation of elev_ft?
• Which state has the most mountains in our data?