A comprehensive collection of data science projects completed for CSCI 5523 (Introduction to Data Mining), demonstrating proficiency across the machine learning pipeline from exploratory analysis to model evaluation.
Project 1: Classification & Analysis
Exploratory Data Analysis
- Telecom customer churn dataset analysis
- Feature distribution visualization
- Correlation analysis and data quality assessment
Decision Trees & kNN
- Implementation of classification algorithms
- Hyperparameter tuning and cross-validation
- Performance comparison across algorithms
Naive Bayes Spam Classification
- Text preprocessing and feature extraction
- Probabilistic classification for spam detection
- Precision/recall analysis
Multi-Dataset ML Analysis
- Applied ML pipelines to iris, diabetes, and thyroid datasets
- Comparative model performance evaluation
- ROC curve analysis and model selection
Project 2: Advanced Analytics
Apriori Algorithm
- Market basket analysis implementation
- Association rule mining with support/confidence metrics
- Frequent itemset discovery
Instacart Transaction Analysis
- Large-scale retail transaction data
- Customer purchase pattern identification
- Product association recommendations
Cluster Analysis
- K-means and hierarchical clustering
- Cluster quality evaluation (silhouette scores)
- Dendrogram visualization
COVID-19 Literature Clustering
- CORD-19 research paper analysis
- Text embedding and similarity measures
- Research topic discovery through unsupervised learning
Technical Skills Demonstrated
- Data Preprocessing: Handling missing values, normalization, encoding
- Supervised Learning: Classification, regression, model evaluation
- Unsupervised Learning: Clustering, dimensionality reduction
- Association Mining: Apriori, frequent pattern discovery
- Visualization: matplotlib, seaborn, dendrograms
Tools & Libraries
- Python (Jupyter Notebooks)
- pandas, NumPy for data manipulation
- scikit-learn for ML algorithms
- matplotlib, seaborn for visualization
