Skip to content
View erickyegon's full-sized avatar

Block or report erickyegon

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
erickyegon/README.md

Erick Kiprotich Yegon, PhD

Data Scientist · Healthcare AI & Analytics · Real-World Evidence · Causal Inference

LinkedIn ORCID Portfolio Email

📍 Richmond, Kentucky, USA  |  🇺🇸 U.S. Permanent Resident — No Sponsorship Required


What I Do

I design and ship production data science systems — ML pipelines, causal inference engines, AI platforms, and real-time analytics infrastructure — applied to healthcare and population health problems at scale.

My background combines hands-on engineering with deep quantitative methodology: I build the models and I understand the math behind them.

Core areas:

  • 🤖 AI / LLM Systems — RAG pipelines, multi-agent architectures, healthcare Q&A platforms
  • 🧠 Machine Learning & MLOps — end-to-end pipelines, model validation, SHAP explainability, CI/CD
  • 📊 Healthcare Analytics — risk stratification, population health modeling, clinical decision intelligence
  • 🔬 Causal Inference & RWE — PSM, DiD, ITS, TMLE, SuperLearner — production-grade, not just academic

Philosophy: Models that don't deploy don't matter. Data science should produce systems, not papers.


Impact at a Glance

What I Built Result
ML predictive models for health outcomes ~30% improvement in prediction accuracy
Automated data pipelines (ClickHouse + Python + dbt) Reporting latency: 10–14 days → real-time
Causal inference & RWE studies 25+ production studies informing program decisions
Medicare risk adjustment pipeline (U.S.) Validated ATT of −$391/member, p<0.0001
Healthcare analytics platforms Scale: 8.5M+ individuals across multiple health systems
Peer-reviewed publications 30+ articles incl. The Lancet Global Health

Featured Projects

🤖 AI & LLM Systems

Project Description Stack
AI-Powered Research Assistant Production RAG platform for scientific paper intelligence with modular LangGraph workflows Python · LangGraph · LangChain · ChromaDB · FastAPI
Automated Research & Report Generation Multi-agent AI system for research retrieval, synthesis and structured reporting Python · LangGraph · FastAPI
MultiAgent Research Graph AI knowledge graph generator from natural language queries Python · LangGraph · LLMs
Healthcare Q&A RAG Platform Enterprise healthcare knowledge retrieval with vector search and RBAC Python · FastAPI · ChromaDB

🧠 Machine Learning & MLOps

Project Description Stack
Medicare Risk Adjustment Pipeline Validated U.S. Medicare RAF pipeline — ATT −$391/member, p<0.0001 Python · R · SQL
Insurance Premium Prediction End-to-end ML pipeline with CI/CD, MLflow tracking and SHAP explainability Python · XGBoost · MLflow · SHAP
DHS RAG System Semantic intelligence system for Demographic & Health Survey datasets Python · RAG · Vector Search
Multimodal PDF RAG System Document intelligence platform with OCR, table extraction and semantic search Python · FastAPI · React

📊 Healthcare Data Science

Project Description Stack
Medical Diagnosis AI ML prototype for clinical diagnostic support Python · scikit-learn
KDHS Memory Bot Multimodal RAG chatbot for large public health survey datasets Python · OCR · Vector DB
Kenya Community Health AI AI analytics platform integrating national digital health systems Python · Multi-Agent AI

Technical Stack

Languages Python · R · SQL

Machine Learning scikit-learn · XGBoost · PyTorch · TensorFlow · MLflow · SHAP · Survival models

AI / LLM LangChain · LangGraph · RAG · Vector Databases (ChromaDB, Pinecone) · Multi-Agent Systems · Prompt Engineering

Data Infrastructure AWS (Redshift · Glue · SageMaker · S3) · ClickHouse · PostgreSQL · dbt · FastAPI · Docker · Airflow

Visualization & BI Power BI · Tableau · Plotly · ggplot2

Causal & Statistical Methods PSM · Difference-in-Differences · Interrupted Time Series · TMLE · SuperLearner · Bayesian modeling · Mixed-effects models · Pharmacoepidemiology


Education

PhD — Epidemiology (Quantitative Methods, Causal Inference & Health Data Science) Advanced training in study design, statistical theory, and evidence generation — applied directly to ML model validation, experiment design, and real-world evidence production.

MSc — Health Systems Management BSc — Statistics

Certifications:

  • Stanford University — Machine Learning in Medicine
  • AWS Certified Data Science & Analytics
  • Google Data Analytics Professional Certificate
  • DataCamp Machine Learning Scientist Track
  • Generative AI (multiple platforms)

Why PhD + Data Science?

A common assumption: PhD = academic researcher = not hands-on.

That's not my profile.

My PhD is in quantitative epidemiology — which means advanced statistics, causal modeling, experimental design, and evidence validation. These are the same foundations that make a data scientist rigorous: knowing why a model works, not just that it works.

In practice, I:

  • Build and ship ML pipelines, not just analyze data
  • Design causal inference studies that hold up to scrutiny
  • Write production Python and SQL, not just R markdown
  • Lead analytics engineering alongside research

The PhD makes the data science better. It doesn't replace it.


Open To

Hands-on and leadership roles across data science, healthcare analytics, and AI:

  • Senior / Principal Data Scientist
  • Healthcare Data Scientist
  • Clinical Data Scientist
  • Population Health Analyst / Analytics Lead
  • Real-World Evidence Scientist / Analyst
  • HEOR Data Scientist
  • Decision Science / Advanced Analytics
  • Director / Associate Director, Data Science or Epidemiology

Target sectors: Pharma · Biotech · CRO · Health tech · Payers & Insurers · Clinical AI · Population health


Data Science · Healthcare Analytics · AI Systems · Causal Inference · Real-World Evidence

📩 keyegon@gmail.com  |  🔗 LinkedIn  |  🌐 Portfolio

Popular repositories Loading

  1. AI-Powered-Research-Assistant-for-Scientific-Papers AI-Powered-Research-Assistant-for-Scientific-Papers Public

    AI-Powered Research Assistant for Scientific Papers leverages LangGraph, LangServe, and LangChain with Euri LLM to deliver a robust, production-grade platform. It features modular workflows, secure…

    Python 2

  2. Zomato-SQL-Project Zomato-SQL-Project Public

    1 2

  3. Loan-Prediction-Using-PowerBI Loan-Prediction-Using-PowerBI Public

    1

  4. MultiAgentResearchGraph MultiAgentResearchGraph Public

    AI research system generating interactive knowledge graphs from natural language queries. Uses three specialized LLM agents (Research, Summarizer, Mapper) orchestrated by LangGraph to transform web…

    Python 1

  5. EnergyDrinkConsumerBehavior-PriceRangePrediction EnergyDrinkConsumerBehavior-PriceRangePrediction Public

    End-to-end ML project predicting energy drink price ranges using survey data. Includes data cleaning, feature engineering, EDA, and MLOps pipeline with MLflow tracking, Great Expectations validatio…

    Python 1

  6. autogen-dsa-solver autogen-dsa-solver Public

    🧠 A smart multi-agent system leveraging AutoGen to solve DSA problems. It provides step-by-step solutions, generates test cases, and executes code securely via Docker. Features a web UI/CLI & suppo…

    Python 1