Projects I’ve Worked On

 
 

Cluster Analysis of Depression Screening Data

This project was an exercise in cluster analysis; the goal was to use the k-means algorithm to segment respondents of the NHANES depression screener into groups and see if I could isolate respondents who displayed moderate-to-severe depressive symptoms into one group. I explored demographic features, compared clustering results on given features vs extracted features (PCA), used Within Sum of Squares and Silhouette Scores to find optimal values for k, and reached conclusions about this group based on my analysis.

 

HR Analytics: Predictive Modeling of Employee aTTRITION

As part of a course in Business Analytics with R, I led a group project that involved consulting a (fictional) company on ways in which they could reduce their employee attrition rate. Using a dataset from Kaggle, we implemented a logistic regression and classification trees to identify statistically significant variables that influence employee attrition, such as travel frequency and perceived work/life balance. We also built a bagging model that predicted employee attrition with 98% accuracy and 97% precision, which were far beyond our benchmarks.

 

RateYourMusic: Interactive Visualization

The second part of a comprehensive project involving the online music database RYM. After creating the dataset, I used the Plotly package to create a colorful Flexible Dashboard summarizing different categories of features.

 

RateYourMusic: Scraping the Charts

I wrote a web crawling program in R using Rvest, Dplyr, and other Tidyverse packages to extract album chart data from RYM, an online music database. I converted this into a clean, unified dataset and shared it on Kaggle.com. It currently has over 6,000 views and 700 downloads.

 

Health Analytics: Measuring Impact of Various Risk Factors on Diabetes Rates

For an undergraduate Econometrics course, I performed research on diabetes risk factors and built a logistic regression model in Stata to estimate the impact of various risk factors on diabetes rates among the non-institutionalized, adult, civilian population of the United States. I focused in on overweight/obesity and found that it was a statistically significant predictor. Process involved consulting health professionals, data cleaning, and model building.

 

IBM Telco Customer Churn Prediction

In this exercise, I developed predictive models using a classification tree and AdaBoost to predict churn within the last month and compared it to a baseline established by a ZeroR classifier. I also plotted ROC curves to compare the performance of two models. Both the classification tree and boosting models predicted churn with a nearly 50% better balanced accuracy than the ZeroR baseline, and the boosting model edged out the performance of the classification tree by a fraction of a percentage.

 

Visualizing Black American Population Growth

Using US Census Bureau estimates, I created a interactive dashboard to visualize the growth and relative population density of Black Americans, per state, from 2010 - 2020

 

Database Project: Data Cleaning and Schema Design

For a graduate course in database foundations, I helped lead a group project where we designed a relational database in MySQL using scanner data of grocery store coffee sales. The process involved exploring and cleaning the data, schema design, schema normalization, and testing with both simple and complex queries.

 

COVID-19 Dashboard

A dashboard visualizing the global impact of the COVID-19 pandemic.