Joins

Jed Rembold

Datafest 2023

What are joins?

  • The act of collecting data from multiple tables based on particular rows and columns is called a join
  • A join pulls information from multiple tables into a new table
  • The columns that are matched across tables are called keys

Big Picture

  • The general idea is then to:
    • Set up your selection as usual from a single table
    • Join to that table another table
    • Specifying what columns in each table will act as keys along with a conditional relating them
      • Most common condition is equality
  • The result is a new table with columns from both the original tables, with rows matched according to your conditions
    • Depending on your implementation, the matched columns may or may not be duplicated as well

(Inner) Join

  • The basic join only keeps rows from table 1 and table 2 that matched on the given column keys
    • This is also called an inner join
  • Any row in table 1 that had no counterpart in table 2 is left out
    • Identically for any row in table 2 that had no counterpart in table 1
  • The key take-away is that it keeps what was in both tables
  • If a value appears twice in one table, it will be duplicated in the joined table as well
    • One reason that many times people try to join on columns that hold unique values, but not always necessary

Inner Joins Visualized

Inner Joins in R

  • If no condition provided, tries to match across all shared columns

    facebook %>% inner_join(linkedin)
  • If needing to match differently named columns across both tables, or only a sub-selection of similarly named columns

    facebook %>% inner_join(linkedin, 
                            by=join_by(name == name2)
                            )

Inner Joins in Pandas (Python)

  • Pandas uses the same method for all joins, but you need to specify the how

  • Not providing a condition here will match on the index

    • Unless you’ve set up your indexes, you’ll probably want to provide an on parameter

    • Even then, this will attempt to map the index of the second dataframe to the specified column in the first

      facebook.join(linkedin, on='name', how='inner')
  • Frequently will want to set the index to the desired column to aid matching

    facebook.set_index('name')

Inner Joins in SQL

  • Can run a lightweight SQL server using SQLite

  • SQL is where joins originated, so they are very well supported:

    SELECT *
    FROM facebook
    JOIN linkedin
      ON facebook.name = linkedin.name

Left and Right Join

  • Sometimes, you don’t want to include only the rows that were in both tables
  • Maybe you want all the rows from one table, but joining the other data when it is available
  • In these cases, you can use a LEFT JOIN or RIGHT JOIN
    • LEFT JOIN is decidedly the more common, and you can make any RIGHT JOIN a LEFT JOIN just by flipping the table ordering
  • Rows still need to have the same number of columns, so NULL or NAN values will be inserted for the secondary table columns if no match is found

Left Joins Visualized

Implementations

  • In R:

    facebook %>% left_join(linkedin, 
                           by=join_by(name == name))
  • In Pandas:

    facebook.join(linkedin, how='left')
  • In SQL:

    SELECT *
    FROM facebook LEFT JOIN linkedin
      ON facebook.name = linkedin.name

FULL OUTER JOIN

  • On occasion, you just want all the data from both tables
    • Matching where possible
    • But keeping data from both left or right tables if no match
  • In these cases, a FULL OUTER JOIN will do what you want
    • Essentially does a LEFT JOIN followed by a RIGHT JOIN with the existing table
  • Anything without a match is still represented with NULL or NAN values

Outer Joins Visualized

Implementations

  • In R:

    facebook %>% full_join(linkedin, 
                           by=join_by(name == name))
  • In Pandas:

    facebook.join(linkedin, how='outer')
  • In SQL:

    SELECT *
    FROM facebook FULL OUTER JOIN linkedin
      ON facebook.name = linkedin.name

Cross Join

  • Lastly, sometimes you want to see all the possible combinations between the rows of two tables
    • Sometimes called a cartesian product
  • A CROSS JOIN will return a table of all of these possibilities
  • Could imagine cross joining all the card ranks with the card suits to generate your standard 52 card playing deck
  • These can get very large very fast!
    • Do not run on tables of millions of rows!
  • You do not need a matching key for cross joins! They operate by row.

Cross Joins Visualized

Implementations

  • In R:

    facebook %>% cross_join(linkedin)
  • In Pandas:

    facebook.join(linkedin, how='cross')
  • In SQL:

    SELECT *
    FROM facebook CROSS JOIN linkedin

Go forth and join!

// reveal.js plugins // Added plugins