The written portion of your Capstone project is perhaps the most involved, as it gives you space to lay out all you have accomplished with constraints on time or space. A good write-up should clearly communication your problem, your approaches to solving that problem, and your conclusions and ramifications. The below guide is intended to give you some formatting instructions as well as content reminders and expectations.
Academic Writing
Depending on your background, you may not have had the opportunity to write many scientific manuscripts. Certain expectations of scientific writing can thus at times be missed. We have discussed this in class here, and there is another potentially useful resource here (though a few pieces about article submission won’t pertain). You have also been reading various articles over the course of the semester with the chief aim of increasing your familiarity with how data scientists structure their written communication. If you are looking for more example, articles from the Harvard Data Science Review are a good source. Perhaps the most important thing to understand is that scientific writing is about communicating what was meaningfully done and the implications of those results. It is not a history of the things you tried and the exact code you used to achieve them.
Online Publication
You will be publishing this publically online, and thus are free to use any of the powers of a modern web browser to help facilitate your communication and explanations. This can include, but is not limited to, hyperlinks for navigation, lightboxing images for easier inspection, and animations. My strong recommendation is to publish a static webpage using GitHub Pages and Quarto as explained here, but if you have strong reasons to deviate from that and can support yourself, then feel free to do your own thing.
Contents
The following sections are topics of content that should appear somewhere within your written capstone. They do not need to be organized as discretely as this, and in fact in many cases doing so would not be the best way to communicate your data story. But they should appear somewhere in some form.
Introduction / Background
You need to set the stage by explaining to your reader what problem it is that you are trying to solve or investigate. In many cases, this means also needing to bring the reader up-to-speed on the field that you are investigating. Why is this particular area interesting? What minimum vocabulary or understanding would a reader need to have to comprehend the choices and explanations that you are making? Remember that while you may have come into this area with existing background knowledge, and have then been researching it for 2 months, a reader will almost certain not have that same background knowledge. What vocabulary will they need to understand? What is the existing state of research in this area? How do you expect your data and research to fit into that broader scientific understanding?
Data Engineering
Gathering data was an important part of getting started on this project. How did you go about gathering your data? Where did they come from? What assumptions or cleaning did you have to make? How did you eventually end up storing the data? You are likely going to be referring to parts of these data throughout the rest of your manuscript, so take time to ensure that the reader understands where different things are coming from and that you have ways to unambiguously refer to different pieces of the data.
Statistical Thinking
Statistical thinking is a way of understanding our complex world by modeling it in more simple terms that nonetheless capture essential aspects of its structure, and then also quantifying how uncertain we are about our knowledge or conclusions. How have you applied statistical thinking to your problem? What model (or models) did you create to try to better understand the data and how it pertains to your question? What methodologies did you use, and why were they appropriate to this situation? How confident are you in the results of your analysis?
Data Visualization
The beating heart of most scientific publications are their visuals. To the point where you should almost have all your visuals complete before starting to write your results or conclusions. Choose the most appropriate visuals for the data you are trying to show. Visuals should always enhance the understanding that a reader gets from reading the text. Any visual that is not serving that purpose should either be cut or better explained in the text. At no point should a reader ever encounter a visual in the course of reading your manuscript and be forced to answer the question “What am I looking at here?”. At the same time, a core truth of scientific writing is that the visuals are often the only thing looked at in a manuscript. So ensure that they are self-explanatory, and include excellent captions. Interactive visuals are an option here as well, since everything will be published on the web. This can be a great way of conveying more information in a graphic at times, but don’t let it distract from the central purpose the graphic has to play to enhancing reader understanding about a particular point.
Machine Learning
Machine learning can come in a wide variety of forms, but some analysis using machine learning should be evident in your writing. It may be closely tied to your statistical thinking or stand in opposition or as an alternative. Ensure that you clearly explain what it is that you are doing and what algorithms or methods you are using. Why are those methods the best for your current situation? If your machine learning analysis offers alternative or conflicting results to your other work, explain where such discrepancies may have come from, and exactly how they might have caused the results that you are seeing.
Data Ethics
What are the data ethics considerations of your project, and how did you address any concerns throughout? How was privacy handled? What biases may be built into your conclusions based on the data that you started with? Are there other ethical considerations that you needed to make?
Conclusions
Like any good story, scientific writing ends with a conclusion. What were your final results? Were you able to answer your original question adequately? How confident are you in your results? How did your project end up contributing or fitting in amongst the existing research in this area? What would be the most useful directions to extend this project in the future? Or is this line of research a dead-end, and time and energy could better be spent going in different directions? This should absolutely relate back to what you wrote in the introduction, bringing things full circle and reinforcing what you have contributed and where it fits among our understanding of the problem.
How rough is rough?
While this is a rough draft, the expectation is that you should have evidence of all the above written. Some aspects, especially in results and conclusions, might have some parts missing or need more fleshing out in the future. Other parts, like the data acquisition, you should expect to have written to completion. Visuals should be as close to finished as you can make them. A general rule of thumb should likely be that later parts of the analysis are at least 3/4 complete, while earlier parts are 100% complete.
Rubric
| Category | Exceeds (3) | Meets (2) | Developing (1) | Does Not Meet (0) |
|---|---|---|---|---|
| Motivation (x1) | Clearly articulates a compelling and well-justified motivation for the project, with significant relevance and impact | Articulates a clear motivation for the project with relevance and impact. | Provides some motivation for the project, but it may lack clarity or relevance. | No clear motivation for the project is provided. |
| Background (x1) | Provides a comprehensive introductory history of the topic and problem so that a reader with no prior experience in the topic can easily follow along. | Provides enough history and explanations of the topic and problem so that a reader doesn’t become lost. | Some background information is missing, so that topics or ideas come up that the reader might not understand or grasp the significance of. | No real background information is given, so that a reader without existing experience in the area quickly becomes lost. |
| Statistical Thinking Execution (x3) | Demonstrates strong statistical thinking with appropriate methodologies and clear explanations. | Demonstrates statistical thinking with appropriate methodologies. | Shows some statistical thinking, but methodologies or explanations are lacking. | Shows little to no demonstration of statistical thinking or appropriate methodologies. |
| Data Visualization Execution (x3) | Uses highly effective and relevant data visualizations that enhance understanding and are well-integrated into the document. | Uses effective data visualizations that are relevant and integrated into the write-up. | Uses some data visualization, but they may lack relevance or integration. | Little to no use of data visualizations shown. |
| Data Engineering Execution (x3) | Demonstrates advanced data engineering skills with well-organized, efficient, and scalable solutions. | Demonstrates competent data engineering skills with organized and efficient solutions. | Demonstrates basic data engineering skills, but lacks in efficiency or organization | Shows little to no use of data engineering skills. |
| Machine Learning Execution (x3) | Showcases machine learning analysis with comprehensive explanations of why techniques were used and evaluation of the results. | Shows machine learning analysis with some explanations of techniques and results | Attempts some machine learning analysis, but it is poorly explained or irrelevant to the problem at hand. | Shows no evidence of machine learning analysis. |
| Data Ethics Considerations (x2) | Thoroughly addresses data ethics considerations, including privacy, bias, and ethical implications. | Address data ethics considerations, including privacy and bias. | Mentions data ethics considerations but lacks depth or detail. | Little to no consideration of data ethics is mentioned. |
| Conclusions (x1) | Conclusions are fully supported by the results and evaluated in the context of their uncertainty. Implications for the field and future work are covered extensively. | Conclusions are fully supported by the results, and implications and future work are covered. | Conclusions present but may not be fully supported by the results, or implications or future work covered only briefly. | No real evidence of conclusions evident. |
| Writing Quality (x2) | Writing is clear, coherent, and professional with proper grammar and syntax. Citations are properly formatted and complete. | Writing is clear and coherent with proper grammar and syntax. Citations are present and mostly correct. | Writing is understandable but may lack clarity or professionalism. Citations are incomplete or incorrectly formatted. | Writing is unclear, confusing, or unprofessional. Little or no use of citations. |
| Data Storytelling (x2) | A reader is smoothly guided through an entire data story, from motivation to conclusion, seamlessly. | An entire data story is present, with decent transitions. | A data story is present, but it has many jumps or disconnects that make it difficult for a reader to follow. | No real story is present. The document reads more as a series of disconnected sections. |