Homework 06

Data Management with SQL

There are just two problems this week, both of which deal with SQL grouping. For both problems, the repository contains a zip file which can be used to recreate a collection of tables of airplane flights over the month of January, 2020. If you are interested in more, similar type data, I grabbed the original data from the Bureau of Transportation Statistics here.

Accept Assignment


In the repository lives a zip file named hw6.zip. You should download that zip file and extract it, which has inside two SQL files (compressing made them small enough to upload to GitHub). These .sql files are a bit different from what you have seen in the past though, as they technically represent a database dump. These are a common way to generate exact duplicates of the contents of a database. However, interacting with them can have some differences. The easiest, and fastest, method is to use the command line (terminal) tool psql that should have been installed along with the rest of PostgreSQL at the start of the semester (though it may or may not have gotten correctly added to your system PATH). If you open a terminal (or command prompt) at the location of the hw6_fast.sql file (or open a terminal anywhere and then navigate to that location), the command that you want to run is simply:

psql -d analysis -U yourusername -f hw6_fast.sql

where analysis is presumably the name of the database where you’ve been placing everything, and yourusername is whatever user-name you’ve been using with your database. If you happened to set up the database server on a non-default port, you’ll also need to include -p portnum. Assuming all goes successfully, this will create a new hw6 schema in your analysis database and set up the 6 different tables, complete with data, constraints, and primary and foreign keys. You should be able to confirm this by opening whatever client you usually use and querying the tables.

If this is not working, for whatever reason, I’ve tried to give you another option as well. The hw6_slow.sql file is also a database dump, but one which has been tweaked to use standard INSERT INTO statements. The result is that you should be able to run this file like any other collection of SQL commands in your client, but the price is that it will be MUCH slower to populate the initial tables. I’d recommend the first option if possible, but if you are having absolutely no luck getting psql to work in the terminal, I think this will work. If you are still having trouble getting the tables generated, please reach out! I’m happy to help you through this!

The benefit of loading a database dump, of course, is that the database structure is all ready for you, with a central table flights having information about individual flights as well as references to other tables with information about airports, carriers, and city markets (to name a few). An image of the ERD is included below, as well as in the repository, for your reference.

Entity Relationship Diagram of the airlines database table structure.

Entity Relationship Diagram of the airlines database table structure.

Problem 1: Birds of a Feather

This initial batch of questions will focus on being able to use keywords such as GROUP BY effectively. For each you should include any queries you used to determine the answer, as well as the answer itself. You may need several queries or temporary tables to answer some parts. Show all your work.

  1. What is the name of the airport which, over the given month’s data, has the greatest average delay in arriving flights? Limit your answer to only include airports with at least 100 arriving flights over the course of the month.
  2. Which 3 US cities have the greatest number of inbound flights across all airports serving that city? (Some cities have multiple airports, which is what the city_market_id is supposed to assist with.)
  3. Which 5 airports have the greatest number of outgoing flights per day on average? How many flights (on average) are leaving those airports each day?
  4. For each major airline carrier, which two airports represent the endpoints of that carrier’s longest flight (by distance)? There are 17 distinct carriers in the month represented in the dataset, so your output table should also have 17 rows with the name of the carrier and the name of both airports. Dealing with round trip duplicates here can be a little tricky, so think about how you could reliably filter out one of the directions. It doesn’t matter whether you choose PDX to SEA or SEA to PDX (for instance), you just need one of the pairings for each carrier. Export the results to a CSV named longest_flights.csv with a header and upload it back to GitHub.

Problem 2: Some Plane Visualization

Just looking at the numeric outputs from our queries isn’t always the most enlightening. Sometimes a little data visualization goes a long way. Depending on what you have done in the past, you may have a host of data visualization tools in your toolbox, ranging from R through Python to Excel. Any are completely appropriate for the following questions, but regardless of what software you use, ensure that your charts have clearly labeled axes and a title describing what they depict. You may need to save the output of your query in a format that your visualization software can understand (frequently CSV).

  1. Create a simple bar chart showing the total number of departing flights each day of the week.
  2. This second part is more open-ended. Define a question that you are interested in investigating in this current data set. There is lots of potential data here that I haven’t already asked questions about that might be interesting! The only requirement is that you must use GROUP BY in the course of answering your question, and you must include a visualization to help you answer the question. Clearly state both your question, your work to answer the question, and your visualization in your submission.