As per usual, I have provided a markdown template file for each problem on this assignment. All problems this week deal with calculations. I’m going to ask that you only use techniques that we have talked about in class on this week’s homework. We will discuss GROUP BY later which makes some of this easier, but working with more limited options is good practice when learning those specific skills.
The data CSV for the problems this week is too large to store in a GitHub repository, so I’ve stored it in my public Willamette folder, linked below. It is a large CSV (about 430 MB) so make sure you have a decent internet connection when you go to download it. I am also linking you to the official data dictionary for this table.
Problem 1: The Taxi Table
The data for this problem is based on all the yellow taxi rides that took place within New York City over the month of June 2024 (the most recently available data). Once you’ve downloaded the data, you’ll need to begin by creating a new table named taxi_rides and populating it with the necessary columns and corresponding data types. The provided (and official!) data dictionary should help here with getting column data types correct, though you will probably still want to look at the CSV itself. In particular, note that the columns in the CSV are not entirely in the same order as provided in the data dictionary.
In addition to the columns provided within the CSV, you should add an extra column, named ride_id which should be designated with a SERIAL type or with GENERATE ALWAYS AS IDENTITY. This ensures that we have an easy way to talk about a unique taxi ride, which is otherwise a bit cumbersome with this data.
Copy the CSV data into your database table. Keep in mind that because you have an unequal number of columns in the CSV vs your table, you’ll need to specify the columns that the CSV data should be copied into. I know it is annoying, but it is necessary here.
Importing in the CSV data may take a little time, as it is a large CSV file, so be patient!
Problem 2: In a New York Minute
Now that you’ve gotten everything populated properly, you can proceed to answer the following questions. Some of these may require multiple queries, so just include all queries you use in your solution and explain how you used them in your description.
Remember to only use SQL that we have discussed in class within these queries.
This data supposedly contains all yellow taxi rides that took place in the month of June, 2025. How many trips are included in the data which both started and ended outside the supposed June 2025 time span?
What percentage of the total trips had a disputed charge? (Consult the data guide for how to identify a disputed charge!)
What was the median amount charged per passenger across all rides? Ignore any rides that report 0 passengers.
What is the most common pickup location? Drop-off location? What about the most common route (pair of pickup + drop-off locations)? If you want to know what these correspond to, there is a separate officially provided location lookup table here.
TipYou can get the mode of a pair of values together if you include them both in a set of parentheses in the ordering statement, e.g.
(ORDER BY (col1, col2))How many taxi rides seemingly took their passengers backward in time?! What was the average duration that they traveled back in time?
Problem 3: Speedin’ Taxis
This question is still using the taxi rides data set, but is a bit more involved, so I wanted to give it more room for an explanation. Suppose you wanted information on the statistical speeds of taxi drivers across the city. You have access to both distances and times, so calculating a speed should be straightforward. However, there are a few things to consider:
As seen above, not all time durations seem like real physical trips. So you should only consider rides that were at least 30 seconds long.
Also, some trip distances are somehow reported as 0. Those should be ignored.
Finally, by default, when you subtract two timestamps you will get an interval. Unfortunately, the way that Postgres stores intervals is not conducive to the sorts of arithmetic you need to do here, as you need to be able to divide by a time in a known unit. So to get an interval in hours that you can actually use to calculate a speed, you can do the following:
(EXTRACT( EPOCH FROM (dropoff_time - pickup_time)) / 3600 )where
dropoff_timeandpickup_timeare whatever you named those columns in your table. We’ll talk more about how these functions involving times and intervals work later in the semester (Ch 12), but for now you can use the above to get the trip duration in hours, which will allow you to compute a speed in miles per hour.
Your primary objective here is to identify all the trips with speeds that meet the above criteria but which are outliers of the main speed distribution. Here we are going to define outliers as points that are below the lower whisker or above the upper whisker on a classic boxplot, for which I am including a diagram below. In this case, the whiskers are located 1.5 IQR below and above the 1st (25%) and 3rd (75%) quartiles, respectively.
You need to identify all the valid rides that fall into these ranges (and are thus an outlier), and write them out to a CSV file entitled bad_taxi_mphs.csv. The file should be ordered by increasing speed and should include just the taxi ride ID and the speed of that ride. Include a header at the top and make sure to upload your CSV back to GitHub (it shouldn’t be that large, ~6MB).
Do not feel like you need to do this all in a single query! In fact, please do not! Break it up as you need, explaining what you are doing. You are free to do simple calculations like arithmetic either with your own calculator or using Postgres as a calculator.
There are a LOT of speeds very close to one another. Keep the maximum amount of decimals returned to you when computing your whiskers or else you are liable to miss some rides in your filter!
Have you remembered to upload your bad_taxi_mphs.csv back to GitHub?!