Homework 9

For the problem below, the expectation is that you submit a standalone HTML file (any images should be embedded) back to GitHub. Data for the problem is provided in the starting repository in the /data folder. Since this is the second homework in this unit, groups have already been created, so just join with your group when you accept the assignment.

Accept Assignment


Problem: Galaxy Classification

The Sloan Digital Sky Survey took images of millions of objects across the night sky, including many galaxies. The Galaxy Zoo project has sought to classify those many galaxies by tapping into a community of volunteers, but in this problem you will use the results of their work to instead attempt to create several machine learning models to predict a galaxy’s broad type: elliptical, merger/irregular, or spiral.

The SDSS imaging system operated over a wide range of wavelengths, with different filters covering different portions of the overall wavelength. As such, when you see terms like u and g showing up, they are refering to the magnitude of the object in that specific filtered band of wavelengths. You can see the breakdown of filters below:

SDSS Wavelength filters

The dataset in galaxies.csv provides 780 labeled galaxies with a variety of other observed and morphological features included. These include:

  • Color information, in the form of magnitude differences (u-g for instance). These may be useful since we expect spiral galaxies to be “bluer” owing to their more recent star formation.
  • Eccentricity information, retrieved from fitting an ellipse to the galaxy and extracting its semi-major and minor axes
  • The 4th order adaptive moment in each filter, which measures the kurtosis or “tailed-ness” of a brightness distribution
  • Petrosian magnitudes at the 50th and 90th percentile in each filter. Because galaxies are large enough to be resolved (extend across multiple pixels), you could measure their magnitude at different distances away from the center. The Petrosian magnitudes measure the magnitude of the galaxy 50 percent or 90 percent away from the center of the galaxy. This helps capture the drop-off in the number of stars as one moves away from the galactic center.

Your task here will be to develop 3 different machine learning models which use this data to predict a galaxy’s type. For all three models, take a reasonable training/testing split of your data so that you can be sure to have an unbiased dataset to compare your model against.

Part A: Logistic Regression

Create a logistic regression model and fit your training data to the model. Create a confusion matrix of your testing results and compute the overall accuracy.

Part B: Decision Tree Classifier

Create a decision tree classifier model, and fit your training data to the model. You are free to tune the various tree parameters as you see fit to get a result that you are happy with. Clearly document what you ended up using however. Create a confusion matrix of your testing results and compute the overall accuracy. Additionally, plot the feature importance of what features were most used in creating the tree. What feature seems to be the most useful in determining galaxy type?

Part C: Random Forest Classifier

Create a random forest classifier model, and fit your training data to the model. Again, you are free to tune the random forest parameters as you see fit to get a result you are happy with. Create a confusion matrix of your testing results and compute the overall accuracy. Again, go ahead and plot the feature importance of what features were most used in your random forest model. This will likely look significantly different from your single decision tree! What feature seemed to be most useful?

Part D: Comparisons

You have now created three different models, and have some confusion matrices and accuracy data with which to compare them. For each of the specific use-cases below, determine which of your models you think would work best, and justify your reasoning.

  • You are interested in doing a study on elliptical galaxies. Because of the nature of the study, you want to ensure that any galaxies you have in your dataset are truly elliptical. You don’t mind if you miss out on a few elliptical galaxies in the process.
  • You are doing a study on star forming regions and want to grab as many spiral galaxies as possible into your dataset. You don’t mind if you get a few mergers in your dataset as well, since they are also commonly star forming regions.
  • You are interested in exploring the intersection between spiral and elliptical galaxies, and thus would love to build up a dataset of galaxies that could easily be mistaken as either a spiral or an elliptical galaxy.