Data Summary Example

The following serves as a synopsis of the data gathered for the proposed capstone project. Here that was investigating the question: “How has the Section Leader program at Willamette University impacted student performance and participation in the CS-151 course?” To answer this question, information about student section attendance, as well as class attendance and scoring, were collected and organized in a relational database.

Data Ingestion

The data for this project was already largely compiled and organized to some extent, though it was spread across multiple locations. For this project student data was anonymized beyond student ID numbers, and so all other student information was stripped out of the currently organized data. Focusing on the ingestion for each main data source:

Section Attendance:: Section leaders record this data in a tabular format for each week. As originally organized, this was not in a tidy format, with students across the rows and weeks across the columns, and so the individual tables were parsed into unique rows, each of which contained a student ID, the unique class ID that student was a part of, and the week number corresponding to each week of section. Students have section on different days of the week, so week number was the best method of unification. If a student did not attend their section a particular week, then there is no row for that student corresponding to that week. In other words, only present students are recorded.
Lecture Attendance:: Course participation is graded based on responses to in-lecture polls. These polls generally have some extra credit attached to encourage serious submissions, but for the sake of this analysis the only point of interest was if the student was in class on a given day. Thus receiving any points on a day was considered being present. For my own classes, this data was already existing in a tidy format, with rows only existing if a student had answered a question, so all that was necessary was to drop the points column.
Scores:: Project and overall scores were already present in my existing data storage, they just needed the appropriate joins to be collected together. The decision to include project scores in addition to overall class scores revolved largely around the idea that many weeks of section are dedicated to purely to assistance with the projects, and thus it might be useful to see if there was an effect there even if there was not on the overall course grade.

For each ingestion I used Python’s SQLAlchemy package to connect to the existing databases to pull out the desired information and then write it into the new capstone database.

Data Organization

As a result of anonymizing any information about students beyond their ID, the table structure ended up more simple than in the source databases. In particular, each fundamental table tracks both the student ID and a unique class ID as the primary key. Tracking the class ID is important as there have been shifts in the number of sections and projects over the time span the data was acquired. Additionally, should a student retake a class after performing poorly the first time, this allows their results to be considered independently. Figure 1 below shows the resulting structure, data types, and relationships across the created tables.

Figure 1: An ER diagram of the created tables and their relationships

Additional information is tracked across unique classes in the classes table to better understand and group students according to what their course content looked like that particular semester.

Data Compilation

As examples of the joinable nature of these tables, two figures have been created showcasing information pulled from several tables. The first is some exploratory data analysis into the relationship between section attendance and class attendance as a scatter plot. While certainly not causal, it would be interesting to know if some students are regularly attendending class, but not section, or vice-versa. This trend is shown in Figure 2.

Figure 2: A theoretical figure showcasing the trend between section and class attendance

Secondly, we have an early look into the relationships between section attendance and project and overall scores. While not initially as interesting to us, we’ll bring in the temporal nature of the data as well by color coding the data according to the year the class was taught, thereby having made connections across all tables. Figure 3 showcases this trend, with overall score and project score being shown with different symbols and year being indicated with color.

Figure 3: A theoretical figure looking at trends between section attendance and project and overall scoring, with color mapping by year.

As such, we can successfully join and pull connected data from any of our created tables, and should be ready for full-scale analysis and machine learning.