--- title: DB Interfaces and Data Work author: Jed Rembold date: June 3, 2025 slideNumber: true theme: tokyo-night-light highlightjs-theme: tokyo-night-light width: 1920 height: 1080 transition: slide --- ## Announcements - Project proposal feedback will be out by the end of the week - You have Paper Critique #2 due next Monday night - Live on Canvas after class - Due at the end of the month: Data Summary # Connecting DB to Scripts ## An Interface - SQL is brilliant for working within a relational database - It is not generally as flexible to handle other common scripting tasks - You may at times need an _interface_ between a common scripting language (say R or Python) and SQL - These interfaces generally allow two major things to happen: - You can trigger certain SQL commands to happen in the database from within your other script - You can query results _out_ of a database using SQL and bring them directly into your scripting language, ready for use ## The Triad - In my experience, there are 2 fundamental "pieces" to connecting a scripting language to a database - A library to facilitate the interactions **with** the database (the main scripting tool) - A library to facilitate the connection **to** the database type (the database _driver_) - Then you have the individual commands to interact with the database ## Fundamental Database Libraries - In R: - `DBI` handles much of the general interaction and background processing - `dbplyr` uses `DBI` to seamlessly mesh database tables with R dataframes - In Python: - `SQLAlchemy` is my favorite (and generally seems like the most popular) for handling database interaction - `Pandas` provides the dataframes and meshes smoothly with SQLAlchemy to query your database tables ## Driving the Connection - Regardless of the scripting language, an interface is generally built around an initial _connection_ object that gets defined in the chosen language - These connection objects are often tailored to different types of databases, and thus a particular library might be needed to provide a _driver_ to your database type of choice - In R: `RPostgres` or `RPostgreSQL` - In Python: `psycopg2` ## Visually ![](../images/db_interfaces.excalidraw.png) ## Making the Connection - The connection object is initially formed by requiring information about the database you desire to connect to - IP address / Port - Database name - User you'll be connecting as - User password - All other commands will use the connection object as a method to communicate with the database - Frequently will form one connecting at the start and then maintain it, but you could connect and disconnect as needed ## Making a connection: R :::{style='font-size:.9em'} - The `dBConnect` command from `DBI` is used to create a connection object - Needs at least one input, corresponding to a DBI object representing the type of database to connect to, provided by your driver - For Postgres, that can be `RPostgres::Postgres()` or `RPostgreSQL::PostgreSQL()` - Further parameters can customize aspects of what database is connected to - `dbname` for the database name - `host` for the IP address (or localhost if on same system) - `port` for a specific port - `user` for the connecting username - `password` for the corresponding password - Save the connection object as something! You will need it! ::: ## Making a connection: Python - You use the `create_engine` command from SQLAlchemy to create the connection object - While you won't need to specify the driver, it does need to be installed or SQLAlchemy will fail to make the connection - You still need to specify all the same pieces of information to make the connection, but you can do so with either a connection string or have SQLAlchemy assemble the pieces for you - Connection string: `'postgresql://username:password@host:port/dbname'` - Assembly from `URL.create('postgresql', username=, password=, etc)` - Still save the connection object as something! You will need it! ## Example Connections - Connecting to a generic Postgres database living on the same computer: ```r con <- dbConnect(RPostgres::Postgres(), dbname = 'analysis', host = 'localhost', port = 5432, user = 'postgres', password = 'postgres') ``` ```python con = create_engine(URL.create('postgres', database = 'analysis', host = 'localhost', port = 5432, username = 'postgres', password = 'postgres')) ``` ## Reading information :::{style='font-size:.9em'} - Once you have a connection, you can query the database and output native script dataframes! - Using dbplyr in R: ```r output <- tbl(con, sql("SELECT * FROM pokemon")) ``` - This stores in the `output` a database table object, which is essentially a map to the information in the database - You can filter, aggregate, etc on this object as if it were a local dataframe, but _it is actually still on the server_! - dplyr converts all the R commands into corresponding SQL that is run serverside - Using pandas in Python: ```python output = pd.read_sql(text("SELECT * FROM pokemon"), con=con) ``` - This reads the resulting table directly into a dataframe stored in local memory ::: ## Writing information :::{style='font-size:.9em'} - For writing information to the database, or any other SQL commands that don't return a table, things are more straightforward (kinda) - Takes two required arguments: the connection and the SQL - For R, I would recommend using `DBI::dbExecute`, as I've found it to be the most straightforward and convenient ```r DBI::dbExecute(con, "INSERT INTO test VALUES ('Jed', 1985)") ``` - In Python, you need to open a transaction and then execute ```python with con.begin() as trans: trans.execute(text("INSERT INTO test VALUES ('Jed', 1985)")) ``` ::: ## Parameterizing Information - Often, you have gotten information from another source that you want to embed inside an SQL query - You could do this directly through string processing, using something like `paste` or `glue` or Python's f-strings - Doing this blindly though can open you up to SQL-Injection attacks! - A better way is through _query parameterization_ - Indicate placeholders in the query where you want information to be subbed in, and then supply that information when you execute - The interface library then takes care of sanitizing the input and making sure nothing nefarious is going on before inserting the desired data ## RPostgres Parameterization :::{style='font-size:.9em'} - Uses ordered placeholders using a dollar sign followed by a number: `$1`, `$2`, etc - This exact syntax can vary some from library to library - Can provide a parameterization list after the query: ```R dbExecute(con, "INSERT INTO test VALUES ($1, $2)", list('Jed', 1985)) ``` - Could also provide a 1-row tibble of the necessary length - In Python, labeled placeholders are referenced in a dictionary: ```python trans.execute( text("INSERT INTO test VALUES (:name, :year)"), {'name': 'Jed', 'year': 1985} ) ``` ::: # The Art of Scraping ## Webscraping :::{style='font-size:.9em'} - Webscraping is the act of extracting information from a webpage so that it can be collected or otherwise used elsewhere - It can take multiple forms, with varying degrees of complexity (most of which depends on the website) - Extracting information from a table of data - Extracting other information on a webpage that in not necessarily formatted - Extracting information using a provided API endpoint - Our goal today is to touch on how you could do each of these in either Python or R - In Python we'll be using the `requests`, `pandas`, `beautifulsoup` and `json`{.python} libraries - In R we'll be using `httr`, `rvest` (with comes as part of tidyverse), and `jsonlite` ::: ## Step 1: Get the HTML - Regardless of what language you are using, the first step is to grab the necessary html - This is exactly what your browser is doing when it accesses a webpage - In Python, this is done using the `requests` library `get` function: ```python html = requests.get(url).text ``` - In R, this can be done using the `GET` function from `httr`: ```R html <- content(GET(url)) ``` ## Step 2: Understanding Tags - HTML is comprised by information nested within what are called _tags_ - Tags can also be nested inside other tags - Extracting information from a webpage is frequently about knowing which tags have the information you want - Take advantage of the "Inspect" tool on most browsers, accessed by right clicking on a web page - Will give you the option to explore both the HTML and mouse around the webpage and highlight the corresponding HTML tags - At the very least, you should normally try to identify the overall tag that surrounds your data of interest ## Option 1: Data from Tables :::{style='font-size: .9em'} - One method in which data is frequently stored on a webpage is in tables - These are surrounded by the `< table >`{.html} tag - So long as the table is fairly simple, both Python and R have very easy ways of grabbing the table information directly into a corresponding dataframe | Language | Example | |---------:|-------------------------------| | Python | `df = pandas.read_html(html)` | | R | `df <- html %>% html_table` | - These will automatically correct for things like cells spanning multiple rows, which is very nice - By default, both options technically return a list of dataframes for every table on the page ::: ## Option 2: Other Data on a Page :::{style='font-size:.9em'} - Sometimes the data you want from a page isn't clearly going to be the text in a table - Maybe it is the url from a link, or an image, or any other text or number not in a table - In these cases you need to rely on the tag structure of the html document to select purely what you are interested - You may also need to access the tag _attributes_ to get information such as link or image urls - When selecting the tags you want, you can provide multiple separated by spaces to provide a hierarchy of what you are looking for - Looking for `'tr td'` says you want all the `td` tags that are inside a `tr` tag - Gathering the data in this way may generate a list of content, but it won't generally create more complicated tables of information, so you would need to craft those yourself ::: ## Option 2: Selecting the desired tags - In Python, when parsing an html document by tag, it helps greatly to parse the raw html using the BeautifulSoup library - Beautiful soup then gives you each methods to select only certain tags from the larger structure ```python soup = bs4.BeautifulSoup(html) links = soup.select('td a') ``` - In R, if you are using the `rvest` library, you just need to pass the html into the proper function: `html_elements` ```R links <- html %>% html_elements('td a') ``` ## Option 2: Information from tags :::{style='font-size:.9em'} - Generally, you then want some information _from_ those tags, either the enclosed text or some attribute - To get the text associated with a tag: | Language | Example | |---------:|----------------------------------------------------| | Python | `link_text = [tag.text for tag in links]`{.python} | | R | `link_text <- only_links %>% html_text`{.r} | - To get an attribute value of a tag (content inside the `<` `>` parts of the tag) | Language | Example | |---------:|------------------------------------------------------| | Python | `link_url = [tag['href'] for tag in links]`{.python} | | R | `link_url <- only_links %>% html_attr('href')`{.r} | ::: ## Dynamic Pages - Many modern webpages utilize Javascript to make calls to a database when the page loads, populating the page _dynamically_ - When you **first** access the page then, none of the desired information is present, and thus standard scraping will not work - What you need is a programmatic way to "mimic" a web browser: allowing javascript and the page to dynamically populate before you access it. ## Enter Selenium - _Selenium_ is software to accomplish exactly this - Libraries available for both R and Python - Usually requires downloading an external _driver_ as well: what browser do you want to spoof? - Commands to simulate user interactions: typing something in a space, or clicking on a button - Still need to understand tags and the structure of the page to grab the information you want - Selenium is more advanced that normal webscraping, but it is also extremely powerful. Getting some experience with it, should you choose, is a great skill to develop. # Break Time! # Team Data Work ## Team work - The rest of the evening is set aside for you to start to planning for your data acquisition and organization. - Do you need to be scraping? How will you do that? - Where will you be storing your data? Keep in mind you want it frequently synched across at least **two** computers! - How do you want to model the data? Just because you may have accessed it in a certain form does often **not** mean it is the best for your purposes - Work out a monthly timeline and add it to your project planning on GH