Jed Rembold
June 3, 2025
DBI handles much of the general
interaction and background processingdbplyr uses
DBI to seamlessly mesh database tables with
R dataframesSQLAlchemy is my favorite (and generally
seems like the most popular) for handling database interactionPandas provides the dataframes and
meshes smoothly with SQLAlchemy to query your database tablesRPostgres or
RPostgreSQLpsycopg2
dBConnect command from
DBI is used to create a connection
objectRPostgres::Postgres() or
RPostgreSQL::PostgreSQL()dbname for the database namehost for the IP address (or localhost if
on same system)port for a specific portuser for the connecting usernamepassword for the corresponding
passwordcreate_engine command from
SQLAlchemy to create the connection object'postgresql://username:password@host:port/dbname'URL.create('postgresql', username=, password=, etc) Connecting to a generic Postgres database living on the same computer:
con <- dbConnect(RPostgres::Postgres(),
dbname = 'analysis',
host = 'localhost',
port = 5432,
user = 'postgres',
password = 'postgres')
con = create_engine(URL.create('postgres',
database = 'analysis',
host = 'localhost',
port = 5432,
username = 'postgres',
password = 'postgres'))Using dbplyr in R:
output <- tbl(con, sql("SELECT * FROM pokemon"))
output a database
table object, which is essentially a map to the information in the
databaseUsing pandas in Python:
output = pd.read_sql(text("SELECT * FROM pokemon"), con=con)
For writing information to the database, or any other SQL commands that don’t return a table, things are more straightforward (kinda)
Takes two required arguments: the connection and the SQL
For R, I would recommend using
DBI::dbExecute, as I’ve found it to be the
most straightforward and convenient
DBI::dbExecute(con, "INSERT INTO test VALUES ('Jed', 1985)")In Python, you need to open a transaction and then execute
with con.begin() as trans:
trans.execute(text("INSERT INTO test VALUES ('Jed', 1985)"))paste or
glue or Python’s f-strings
Uses ordered placeholders using a dollar sign followed by a
number: $1, $2,
etc
Can provide a parameterization list after the query:
dbExecute(con, "INSERT INTO test VALUES ($1, $2)", list('Jed', 1985))Could also provide a 1-row tibble of the necessary length
In Python, labeled placeholders are referenced in a dictionary:
trans.execute(
text("INSERT INTO test VALUES (:name, :year)"),
{'name': 'Jed', 'year': 1985}
)requests,
pandas,
beautifulsoup and
json librarieshttr,
rvest (with comes as part of tidyverse), and
jsonliteRegardless of what language you are using, the first step is to grab the necessary html
This is exactly what your browser is doing when it accesses a webpage
In Python, this is done using the
requests library
get function:
html = requests.get(url).textIn R, this can be done using the GET
function from httr:
html <- content(GET(url))One method in which data is frequently stored on a webpage is in tables
< table >
tagSo long as the table is fairly simple, both Python and R have very easy ways of grabbing the table information directly into a corresponding dataframe
| Language | Example |
|---|---|
| Python | df = pandas.read_html(html) |
| R | df <- html %>% html_table |
These will automatically correct for things like cells spanning multiple rows, which is very nice
By default, both options technically return a list of dataframes for every table on the page
'tr td' says you want all
the td tags that are inside a
tr tagIn Python, when parsing an html document by tag, it helps greatly to parse the raw html using the BeautifulSoup library
soup = bs4.BeautifulSoup(html)
links = soup.select('td a')In R, if you are using the rvest
library, you just need to pass the html into the proper function:
html_elements
links <- html %>% html_elements('td a')Generally, you then want some information from those tags, either the enclosed text or some attribute
To get the text associated with a tag:
| Language | Example |
|---|---|
| Python | link_text = [tag.text for tag in links] |
| R | link_text <- only_links %>% html_text |
To get an attribute value of a tag (content inside the
< > parts of the
tag)
| Language | Example |
|---|---|
| Python | link_url = [tag['href'] for tag in links] |
| R | link_url <- only_links %>% html_attr('href') |