The data summary is meant to serve as a checkpoint that you have the vast majority of your data collection and organization complete. As such, its requirements should be fairly easy to accomplish provided that your data collection has been completed. Data summaries should be saved or exported as either PDF or HTML and uploaded to your team GitHub repository. A data summary is an expository work, and should utilize full sentences and paragraphs.
Please ensure that all of the below requirements are included in your summary.
Data Ingestion
Briefly cover any preprocessing or data cleaning that you had to perform on each of your data sources before storage. What issues were present and how did you meet those problems? The goal here is not to give hyper-in-depth explanations, but rather to clearly and articulately convey what problems you had and how they were overcome. Ensuring that if yourself or others were to come across similar difficulties in the future, you’d have a document to help you remember what you did to solve those problems.
Data Organization
Construct and include a detailed Entity Relationship Diagram (ERD) showcasing how your data is stored across different tables, including chosen data types, constraints and relations. Sites like DrawSQL or DBDiagram can help. These tables should be in third normal form unless you have a very compelling reason for denormalized data (which you should clearly explain). Take your time on this and ensure that it includes all necessary details and corresponds with your actual database storage.
Data Compilation
Projects are bringing in data from a variety of sources, and so the last portion of the data summary is showcasing that you have organized and formatted things nicely to be able to join easily across the disparate sources of information. Here you should include 2-3 visuals showcasing a bit of exploratory data analysis that involves pulling information from multiple tables and data sources. If you have done your organization well above, this should be easy. This requirement merely serves to ensure that you are ready to get going on your analysis, and that your data organization or format won’t get in the way. For each visual explain what it is showing and which data sources it pulled information from.
Scoring
| Category | Subcategory | 3 points | 2 points | 1 points | 0 points |
|---|---|---|---|---|---|
| Ingestion | Data Sources | All data sources touched on with | Most data sources mentioned, but < 25% missing | Some data sources mentioned, but missing 50% or more | No specific data sources mentioned |
| Explanations | Explanations touch on issues as well as the preprocessing steps in a general but comprehensive manner. | Some issues are not mentioned, or described steps are too vague or overly specific. | Some issues are not mentioned, or described steps are too vague or overly specific. | No explanation of preprocessing steps | |
| Data Organization | Normalization | All data appears to be in 3NF, or very clear reasons cited if not. | > 50% of tables are in 3NF, but some have interior relations | <= 50% of tables are in 3NF | Tables show little to no signs of normalization |
| Types | All types are defined and appear reasonable | A small number of columns appear mis-typed or with problematic types | A large number of columns have type issues | Little to no effort has been put into declaring proper types | |
| Constraints | Primary keys defined. Foreign keys defined where appropriate and some consideration shown to other constraints. | Primary keys defined. Some foreign keys or other constraints missing. | Some primary keys missing. Foreign keys or other constraints almost entirely absent. | No primary keys or other constraints present | |
| Data Compilation | Combinations | 3 or more tables joined across visuals (data permitting) | 2 tables joined across visuals | Visuals only show content from a single table | Visuals are hard-coded and don’t access data from any tables |
| Explanations | Explanations clearly explain what is shown and what data sources that information was pulled from. | Explanations not always clear, but sourced data indicated | Some explanations missing, or no clear citing of sourced data | No explanations included with figures at all |