Webscraping in R and Python
Jed Rembold
Monday, November 17, 2025
Announcements
- Homework 10 due on Wednesday
- 1 problem, 1 extra credit
- Projects!
- You have your project groups!
- I’ll be leaving time in the latter half of every class this week for
you to work with your group
Webscraping
- Webscraping is the act of extracting information from a webpage so
that it can be collected or otherwise used elsewhere
- It can take multiple forms, with varying degrees of complexity (most
of which depends on the website)
- Extracting information from a table of data
- Extracting other information on a webpage that in not necessarily
formatted
- Extracting information using a provided API endpoint
- Our goal today is to touch on how you could do each of these in
either Python or R
- In Python we’ll be using the
requests,
pandas,
beautifulsoup and
json libraries
- In R we’ll be using
httr,
rvest (with comes as part of tidyverse), and
jsonlite
Step 1: Get the HTML
Regardless of what language you are using, the first step is to
grab the necessary html
This is exactly what your browser is doing when it accesses a
webpage
In Python, this is done using the
requests library
get function:
html = requests.get(url).text
In R, this can be done using the GET
function from httr:
html <- content(GET(url))
Option 1: Data from Tables
One method in which data is frequently stored on a webpage is in
tables
- These are surrounded by the
< table >
tag
So long as the table is fairly simple, both Python and R have
very easy ways of grabbing the table information directly into a
corresponding dataframe
| Language |
Example |
| Python |
df = pandas.read_html(html) |
| R |
df <- html %>% html_table |
These will automatically correct for things like cells spanning
multiple rows, which is very nice
By default, both options technically return a list of dataframes
for every table on the page
Option 2: Other Data on a Page
- Sometimes the data you want on from a page isn’t clearly going to be
the text in a table
- Maybe it is the url from a link, or an image, or any other text or
number not in a table
- In these cases you need to rely on the tag structure of the html
document to select purely what you are interested
- You may also need to access the tag attributes to get
information such as link or image urls
- When selecting the tags you want, you can provide multiple separated
by spaces to provide a hierarchy of what you are looking for
- Looking for
'tr td' says you can all the
td tags that are inside a
tr tag
- Gathering the data in this way may generate a list of content, but
it won’t generally create more complicated tables of information, so you
would need to craft those yourself
Option 3: APIs
- Sometimes the amount of information in just so large that it can’t
fit nicely on a webpage, or data providers don’t want to make you
“scrape” a webpage for the information
- Instead, they might make available a public API where you can access
the information
- Most REST APIs look just like a web address, but if you navigate to
that url, instead of getting HTML to render a webpage, you get the data
directly, most often in a JSON format
- Some APIs also let you add extra information to the url to better
specify exactly what information you want back
- Some APIs will require you to register for a key, which is often
free. This is to safeguard against people slamming their servers with
billions of requests. Be respectful in both your API and
webscraping usage!
Option 3: Accessing API data
Practice Time!
- In your language of choice, see if you can:
- Extract information about our class schedule here
into a dataframe that you could then export to CSV
- Suppose you only wanted to get the names of the tests from that
table (the red options). Could you extract only those values?
- The API here
will get you information on all the humans currently in space. Create a
table of the names of the astronauts and what vessel they are currently
on.
Project Time!
- The remainder of class I have set aside for you to meet with your
group and get going on your project
- If you haven’t already talked, your initial discussion probably
needs to revolve around what data would you be interested in grabbing to
analyze in some fashion.
- Recall that you are shooting for bringing data from a
separate source for each member in your group
- Different webpages that need to be scraped
- CSVs from different sources/datasets
- Different API endpoints