Introduction
Your last assignment in this course is a group project in which you compile and then utilize a database to construct and answer a homework problem of your own devising. You are entirely free to choose a topic and database centered around whatever types of data you’d prefer, but the following should give you some guidance about the different “pieces” of the project. Deliverables are detailed further down the page and will be due at the end of the day on the last day of class (though some groups will present on that Monday). Finally, some information on the grading rubric is given at the end.
Project deliverables will be uploaded to GitHub as per usual, so you can accept the assignment below to create and gain access to that shared repository. Since you are working in groups, the first member to accept will need to create the group: please name it according to your group member’s names. Then future members when accepting will just be able to select the existing group. Try to make sure you do not accidentally join the wrong group, as the only remedy in that case is to contact me so that I can manually remove you.
Accept ProjectProject Pieces
Database Construction
Before anything else can happen, you need to construct your database. Which means choosing your data sources and ingesting that information into an organized format in your database.
Data Sources
Your created database should incorporate information taken from a number of different sources equal to the number of members in your group. Reasonably, that means it should be comprised of at least 3 or more database tables with shared relationships between them. If you are searching for good data sources, some excellent places to start might include, but are certainly not limited to:
If you are willing to do a bit of programming in either Python or R, you can access other large repositories of information by utilizing publically available APIs or some basic webscraping. A great resource for free web APIs is here
While everyone is running their own Postgres server on their system, it is sometimes useful to have a server that can be more easily accessed by all group members. There are some good free options out there that will allow you to easily set up a database in the cloud and then connect to it with Beekeeper just like your local database, provided you don’t try to store massive amounts of data on it. Two good options are Railway and Supabase. I’ll try to demo these in class if we have time, or I am happy to help you set them up in office hours if it is of interest to you. You can absolutely still do this project without them, it just requires one person handle the actually database largely on their own system.
Database Organization
You should take care to follow best practices when organizing your eventual database from the information that you are using. This means:
- Ensuring all tables are in 3NF, unless you have a very convincing reason for why they should not be.
- Making sure all table and column identifiers are clear, concise, and consistent
- Clearly indicating relationships between the data with foreign keys
- Creating indexes where appropriate on columns that might be heavily searched
- Defining other constraints where appropriate, be they
CHECK,NOT NULLorUNIQUEconstraints. Think about how to ensure that the data in your tables could not be corrupted with invalid inputs.
Imagine that this database that you are creating will need to stand on its own for 3-5 years. You might not be actively adding new data to it over those years, but construct it as if that would be the case.
It will make your life much easier later if you create all of your tables for this project in their own separate database, or at least in their own schema. Remember you can create a new database with CREATE DATABASE new_name; or a new schema with CREATE SCHEMA new_name;. This will just streamline later exporting your database.
Problem Construction
Once you have a constructed database, the follow-up portion of this project is to write homework question that utilizes your database to meet 3 or more learning objectives from different chapters. I will include a list of learning objectives that have governed the homework questions you have completed over the course of the semester that you can pull from. Your homework question may have a few parts as necessary to meet several learning objectives, but it is not required. Try to write a problem that would interest your peers and showcase some interesting insights about the data that you have collected. A good problem:
- Is well articulated and clearly explained. The data and database is often introduced so that the student understands what they will be working with.
- Requires the student to directly meet/demonstrate the learning objectives in order to solve the problem
- Highlights interesting features of the dataset. These problems revolve around real world data, and those data tell a story. A good problem can feature parts of that story.
- Is self-contained. It does not require important information not included in the problem or critical techniques not showcased in class.
In addition to creating the problem, you should be able to correctly answer the problem and write up a solution key. The solution should be more than just raw SQL queries: it should also include some exposition and explanations about what was done and why.
You have over 20-30 examples of what I consider largely “good” homework questions that you can look at for examples of some of this! Similarly, you have access to solutions that might inform how you write up your own solutions.
Deliverables
There will be three deliverables that you will be responsible for. All should be uploaded to the GitHub repository, or have links to where they can be found added to the README file in the repository.
Presentation
You and your group will make an 8 minute presentation on one of the last two days of class, which will then be followed by 2 minutes of questions. You should plan to cover within your presentation such topics as:
- What in the main topic that your data revolves around?
- Where did you get your data from, and what did it look like before being added to your database?
- How did you decide to organize your database? Were any choices particularly difficult?
- How have you endeavored to ensure that the data within your database will remain valid and uncorrupted?
- What was main idea behind your homework question?
- What learning objectives does your question meet?
- Does the solution to your question highlight any interesting features in the data?
As a general rule of thumb, you should plan to spend about 2/3 of your time discussing you data and the database, and only about 1/3 of your time discussing your written question. Practice good presentation skills here, with clean, attractive slides. Slides should not be packed full of text, and any visuals should be large enough for them to be easily visible. Practice makes perfect, and your delivery should be shared equally between group members, practiced, and articulate.
Please either upload your slideshow or add a link to where it can be accessed.
The Question and Answer
You will need to submit a written copy of both your homework question and corresponding solution. These copies should be in such a state that they could be handed to another student and that student could understand and complete your problem (alongside your database dump) without any further explanations or information. The question should be in a separate document from the solution, and both should be submitted in PDF form.
The Database
For the purposes of a student being able to access your database to answer your question, you will be expected to export your database to a .sql file that could then be imported into another PostgreSQL database. This will involve a command line tool that will be showcased in class the Monday before Thanksgiving.
Rubric
Projects will be scored according to the following point breakdowns, so ensure that you provide evidence of the following in each area.
| Points | Description |
|---|---|
| 10 | The appropriate number of unique data sources is used |
| 20 | Database includes at least 3 or more tables in 3rd normal form |
| 10 | Primary and foreign keys are defined and represent real relationships between the data |
| 10 | Indexes of the appropriate type have been created where reasonable, or justification given as to why they are not useful in this context |
| 5 | Good identifier naming conventions have been followed and identifiers are consistent |
| 10 | Other constraints have been added to columns to help maintain data integrity |
| Points | Description |
|---|---|
| 10 | The problem is clearly written and requires no outside information |
| 15 | The learning objectives are clearly indicated and met in the course of answering the problem |
| 5 | The problem is standalone and could be handed to another student to complete |
| 10 | The solution to the problem is both correct and complete with queries and explanations |
| Points | Description |
|---|---|
| 15 | Slides well constructed, aesthetic, and not walls of text or unreadable database tables |
| 15 | Talk feels smooth and practiced, good sharing of responsibilities, presenters feel knowledgable |
| 5 | Presenters are able to answer questions effectively at the end |