23 lines (17 loc) · 1.04 KB

Healthcare Fraud Detection – CMS, Kaggle & Synthea Datasets

This project analyzes healthcare fraud patterns using three large-scale datasets:

CMS Medicare Data – Public provider billing records with cost and service metrics.
Kaggle Healthcare Fraud Dataset – Real-world data labeled with fraudulent claims.
Synthea Synthetic Data – Comprehensive synthetic EHR data including patients, conditions, and claims.

📊 Project Objectives:

Explore service volume and financial metrics across providers and states.
Detect patterns of excessive billing and service anomalies.
Prepare datasets for machine learning models focused on fraud detection.

🔧 Tools & Techniques:

Python, Pandas, Seaborn, Scikit-learn, Matplotlib
Data preprocessing, outlier handling, feature creation
Visualization and statistical summary
Prepared for modeling with textbook methods from An Introduction to Statistical Learning

Course Project – Statistical Learning (Spring 2025)
Team Members: Nhan, Tan, Andre