Homework 02

Data Management with SQL

There are 4 problems on this assignment. The first is about choosing columns and data types to fit some target questions. The remaining three problems are about building practice with the same general steps: creating tables with appropriate data types, importing data into those tables, and then doing some queries to answer some questions or export some data. The data to import for all three problems resides in the data folder in the repository, but you may have to copy this elsewhere on your system to properly import it. Also in the data folder are corresponding markdown files that briefly describe each of the columns in the CSVs, which may help as you name your columns or pick your data types. For the last 2 problems, you should show the commands you used to create and populate the table, as well as whatever query commands you used to answer the individual questions. Upload your created CSV files as well as your filled out template files all back to GitHub.

Accept Assignment


Problem 1: Conscientious Eating

Suppose you wanted to track the food you purchase and eat over the course of several months, and decide that storing such information in a database might prove useful. In particular, you are interested in being able to answer questions such as:

  • How much am I spending in a given month on certain types of food?
  • How much am I eating out vs cooking at home?
  • Should I be shopping more for certain items at certain stores because they are generally cheaper there?

Write out a possible CREATE TABLE statement that includes all the columns you think you might need to answer the above questions. Include a comment for each column about why you chose the data type that you did for that column.

Problem 2: Gotta Catch ’em All

Given that we will not be introducing how to import or export data until Monday, to get you started here I am providing you with some SQL to create a table and populate it with information about all the available Pokemon through Gen 61. The file called pokemon_generator.sql in the data folder has the table creation and insertion lines that you need to run, so you can just copy and paste or drag it into your client to open it and run the commands. Then use the created table to answer the below questions. You may need to wait on the last question until after Monday to be able to export the CSV (or read ahead some).

  • How many legendary Pokemon might we consider “defensive”, in that they have a defense stat which is higher than their offense? We are talking just normal offense and defense here, not special offense or defense.
  • What 2nd generation Pokemon that is not a Mega form (with “Mega” in its name) has the greatest sum of its stats? In case of a tie choose the one whose name comes last in the alphabet.
  • Save to a CSV file (named type_combos.csv) a table of all the unique combinations of dual-type Pokemon (Pokemon with two different, non-null types). Your file should include just the two types, and should be ordered alphabetically by type 1 and then type 2. No header line should be present at the top. Don’t forget to upload this file!

Problem 3: A Real Cinephile

The movie dataset (movies.csv) contains over nine thousand movies with information pulled from The Movie Database website. Create a table within your database called movies, set some reasonable data types for the columns, and then import the data from the CSV file. Sometimes it can be hard to anticipate the best data type for a column, and so occasionally your import might fail because of an illegal data type. It happens! If that occurs to you, the easiest fix is to just delete (or in SQL parlance, DROP) the table by doing DROP TABLE your_table_name;, and then recreate the table with improved data types. After you have successfully gotten the table created and the data imported, use it to answer the following questions.

  • What is the most popular movie released in the 80s (1980 - 1989)? In case of a tie, choose the movie who name comes first in the alphabet.
  • What are the top 5 movies with the highest vote average and minimum of 10000 votes to come out since the start of 2010?
  • Write to a CSV (called spider_not_man.csv) a table of all movies that reference spiders in either the title or the overview, but which are not a Spider-Man movie. Your table should include 3 columns: the release date, the title, and the popularity, and be ordered from the oldest release date to newest. Include a header line at the top. Don’t forget to upload this file!

Problem 4: New Year New You

Finally, the New Year’s Resolutions dataset (resolutions.csv) contains just over 5 thousand tweets that people made at the start of 2015 containing the hashtag #NewYearsResolution. Create a table within your database called resolutions, set some reasonable data types for the columns, and then import the data from the CSV file. Once you have the data imported, answer the below questions.

Warning

The creator of this dataset (not me!) made some interesting choices in how certain columns are encoded. Be prepared to need to adjust some column types from what you might have expected…

  • Who was the last person to post a primary topic/theme pertaining to “time management” before the start of the new year?
  • What female from the Midwest got the greatest number of retweets?
  • Some individuals included more hashtags than just the #NewYearsResolution tag. Export to a CSV called hashtag_happy.csv a table of all the individuals from Oregon who included more than one hashtag in their tweet. Some clever pattern matching will be your friend here! Your table should include just the username and the category of tweet, and should be ordered alphabetically by the username. Include a header line at the top. Don’t forget to upload the file!

Footnotes

  1. I know this is not particularly up-to-date, but other datasets tended to have around 40-50 columns, and I didn’t want you to have to deal with that!↩︎