Skip to content

VivekGhantiwala/DataPilot-AI

DataPilot AI Logo

Your Intelligent Data Analysis Copilot

Typing SVG

Version License Python Build

Stars Forks Issues PRs

Coverage Code Style Linting Type Check PRs Welcome

Docker Streamlit CI/CD


📑 Table of Contents

🎯 Section 📝 Section
🌟 Overview 🧹 Preprocessing
Features 🧠 AI Insights
🏗️ Architecture 📊 Visualizations
🚀 Quick Start 🐳 Docker
📖 Documentation 🗺️ Roadmap
🤖 AutoML 🤝 Contributing
📈 Time Series FAQ
🔮 Explainability 💖 Support

🌟 Overview

🎯 What is DataPilot AI?

DataPilot AI is a comprehensive, production-ready data science framework that transforms how you work with data. It combines the power of automated machine learning, explainable AI, and intelligent insights into one seamless toolkit.

"From raw data to actionable insights in minutes, not hours."

Whether you're a data scientist seeking to accelerate workflows, a business analyst needing quick insights, or a developer integrating ML into applications — DataPilot AI has you covered.

10+ ML AlgorithmsAuto Preprocessing
SHAP & LIMETime Series Forecasting
Interactive DashboardOne-Click Reports

🎪 Key Highlights

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                                                                 │
│   🔥 ZERO-CONFIG AUTOML        📊 BEAUTIFUL VISUALIZATIONS    🧠 AI INSIGHTS   │
│   Train 10+ models with        Publication-ready charts       Smart pattern    │
│   one line of code             with Plotly & Seaborn          detection        │
│                                                                                 │
│   ⚡ BLAZING FAST              🔍 EXPLAINABLE AI              🌐 WEB DASHBOARD │
│   Optimized algorithms         SHAP & LIME built-in          Streamlit UI      │
│   with XGBoost & LightGBM      for model transparency        no coding needed  │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘

✨ Features

🔍 Exploratory Data Analysis

Automated Statistical Analysis

  • 📊 Distribution & outlier detection
  • 🔗 Correlation heatmaps
  • 📈 Missing value analysis
  • 📋 Data quality profiling
  • 🎯 Target variable insights

🤖 AutoML Pipeline

Zero-Config Model Training

  • ⚡ 10+ ML algorithms
  • 🎛️ Hyperparameter tuning
  • 📊 Model leaderboard
  • 💾 One-click export
  • 🔄 Cross-validation

📈 Time Series

Forecasting & Analysis

  • 📉 Trend decomposition
  • 🔮 ARIMA/Prophet/ETS
  • ⚠️ Anomaly detection
  • 📅 Seasonality analysis
  • 📊 Confidence intervals

🔮 Model Explainability

Transparent ML Decisions

  • 🎯 SHAP value analysis
  • 🍋 LIME explanations
  • 📊 Feature importance
  • 📈 Partial dependence
  • 🔍 Individual predictions

🧹 Data Preprocessing

Smart Data Cleaning

  • 🔧 Missing value imputation
  • 📏 Feature scaling
  • 🏷️ Categorical encoding
  • 🎯 Outlier treatment
  • ⚖️ Class balancing

🧠 AI Insights

Intelligent Recommendations

  • 🔍 Pattern detection
  • ⚠️ Quality issue alerts
  • 💡 Feature suggestions
  • 📝 Auto report generation
  • 🎯 Actionable insights

🛠️ Tech Stack

🔧 Core Technologies

Python Pandas NumPy SciPy

🤖 Machine Learning & AI

Scikit-learn XGBoost LightGBM CatBoost

📊 Visualization

Plotly Matplotlib Seaborn

🌐 Web & DevOps

Streamlit Docker GitHub Actions

🔍 Explainability

SHAP LIME Statsmodels


🏗️ Architecture

System Overview

                                    ┌─────────────────────────────────────────┐
                                    │           📥 DATA INPUT                 │
                                    │   CSV • Excel • Parquet • DataFrame     │
                                    └─────────────────┬───────────────────────┘
                                                      │
                                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                                                 │
│                                    🧠 DATAPILOT AI CORE                                        │
│                                                                                                 │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐       │
│  │  🧹 PREPROCESS   │  │  📊 ANALYSIS     │  │  🤖 ML ENGINE    │  │  🔮 EXPLAINER    │       │
│  │                  │  │                  │  │                  │  │                  │       │
│  │ • Missing Values │  │ • EDA            │  │ • AutoML         │  │ • SHAP Values    │       │
│  │ • Outliers       │  │ • Statistics     │  │ • 10+ Algorithms │  │ • LIME           │       │
│  │ • Encoding       │  │ • Time Series    │  │ • Hyperparameter │  │ • Feature Import │       │
│  │ • Scaling        │  │ • AI Insights    │  │ • CV & Tuning    │  │ • PDP Plots      │       │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘  └──────────────────┘       │
│                                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘
                                                      │
                    ┌─────────────────────────────────┼─────────────────────────────────┐
                    │                                 │                                 │
                    ▼                                 ▼                                 ▼
        ┌───────────────────┐             ┌───────────────────┐             ┌───────────────────┐
        │  🎨 DASHBOARD     │             │  💻 CLI           │             │  🐍 PYTHON API    │
        │  Streamlit UI     │             │  Command Line     │             │  Programmatic     │
        │  No-Code          │             │  Automation       │             │  Full Control     │
        └───────────────────┘             └───────────────────┘             └───────────────────┘

📁 Project Structure

📦 DataPilot-AI/
│
├── 🧠 src/                           # Core library modules
│   ├── __init__.py                   # Package exports
│   ├── ai_insights.py                # 🧠 AI-powered insights engine
│   ├── automl.py                     # 🤖 Automated machine learning
│   ├── data_preprocessing.py         # 🧹 Data cleaning & transformation
│   ├── eda.py                        # 📊 Exploratory data analysis
│   ├── explainability.py             # 🔮 SHAP & LIME integrations
│   ├── ml_models.py                  # 📈 ML model training (1200+ lines)
│   ├── time_series.py                # ⏰ Time series forecasting
│   ├── visualization.py              # 🎨 Data visualization utilities
│   ├── report_generator.py           # 📝 Automated report generation
│   └── data_generator.py             # 🎲 Synthetic data generation
│
├── 🎨 dashboard/
│   └── app.py                        # 🌐 Streamlit web interface (550+ lines)
│
├── 🧪 tests/                         # Test suite
│   ├── test_automl.py
│   ├── test_eda.py
│   └── test_preprocessing.py
│
├── 📊 data/
│   └── sample_data.csv               # Sample dataset
│
├── ⚙️ Configuration Files
│   ├── pyproject.toml                # Modern Python config
│   ├── requirements.txt              # Dependencies
│   ├── setup.py                      # Package setup
│   ├── Dockerfile                    # Container config
│   └── .github/workflows/ci.yml      # CI/CD pipeline
│
├── 📚 Documentation
│   ├── README.md                     # You are here! 📍
│   ├── CONTRIBUTING.md               # Contribution guide
│   ├── CHANGELOG.md                  # Version history
│   ├── CODE_OF_CONDUCT.md            # Community guidelines
│   └── SECURITY.md                   # Security policy
│
└── 💻 cli.py                         # Command-line interface

🚀 Quick Start

⚡ Get up and running in under 2 minutes!

Prerequisites

Requirement Version Notes
🐍 Python 3.9+ 3.11 recommended
📦 pip Latest Package manager
🔧 Git Any For cloning

Installation Options

🎯 Option 1: Quick Install (Recommended)

# Clone the repository
git clone https://github.com/VivekGhantiwala/DataPilot-AI.git
cd DataPilot-AI

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install package
pip install -e .

📦 Option 2: Dependencies Only

# Clone and install
git clone https://github.com/VivekGhantiwala/DataPilot-AI.git
cd DataPilot-AI

pip install -r requirements.txt

🐳 Option 3: Docker

# Build image
docker build -t datapilot-ai .

# Run container
docker run -p 8501:8501 datapilot-ai

⚙️ Option 4: With Extras

# Install with all optional dependencies
pip install -e ".[all]"

# Or specific extras
pip install -e ".[dev]"      # Development tools
pip install -e ".[ml]"       # Extra ML libraries
pip install -e ".[dashboard]" # Dashboard dependencies

✅ Verify Installation

# Check CLI
python cli.py --help

# Run tests
pytest tests/ -v

# Launch dashboard
python cli.py dashboard

🎉 You're ready to go!


📖 Documentation

Choose Your Path

🎨 No Code ⌨️ CLI 🐍 Python API
Interactive Dashboard Command Line Full Programmatic Control
Upload & Click Script Automation Custom Workflows
Visual Results Pipeline Integration Production Ready

🎨 Interactive Dashboard

Launch the beautiful Streamlit dashboard for a zero-code experience:

# Using CLI
python cli.py dashboard

# Or directly
streamlit run dashboard/app.py

Then open http://localhost:8501 in your browser.

Dashboard Features: 📊 Data Overview • 📈 EDA • 🔧 Preprocessing • 🤖 ML Training • 🧠 AI Insights • 📥 Export


⌨️ Command Line Interface

# 📊 Exploratory Data Analysis
python cli.py analyze -i data.csv -t target_column -o report.txt

# 🤖 AutoML Training
python cli.py automl -i data.csv -t target --task classification --max-models 10

# 📈 Time Series Forecasting
python cli.py timeseries -i sales.csv --date-column date --value-column sales -f 30

# 🧹 Data Preprocessing
python cli.py preprocess -i raw.csv -o clean.csv --scale --encode

# 📝 Generate Report
python cli.py report -i data.csv -o report.html --title "Analysis Report"

🐍 Python API Examples

🤖 AutoML Pipeline

from src import AutoML
import pandas as pd

# Load your data
data = pd.read_csv("your_data.csv")
X = data.drop(columns=["target"])
y = data["target"]

# Initialize AutoML - it's that simple! 🚀
automl = AutoML(
    task="classification",  # or "regression", "auto"
    max_models=10,          # Number of models to try
    cv_folds=5              # Cross-validation folds
)

# Train all models
automl.fit(X, y)

# View results
print(automl.get_leaderboard())
print(automl.summary())

# Make predictions
predictions = automl.predict(X_new)

# Save best model
automl.save("best_model.pkl")

Output:

╔═══════════════════════════════════════════════════════════════════╗
║                        🏆 Model Leaderboard                        ║
╚═══════════════════════════════════════════════════════════════════╝

 Rank │ Model               │ Accuracy │ F1 Score │ ROC-AUC │ Time(s)
──────┼─────────────────────┼──────────┼──────────┼─────────┼─────────
  1   │ XGBoost             │  0.9421  │  0.9385  │  0.9712 │   2.3
  2   │ LightGBM            │  0.9398  │  0.9362  │  0.9689 │   1.1
  3   │ Random Forest       │  0.9356  │  0.9318  │  0.9645 │   4.7

📊 Exploratory Data Analysis

from src import ExploratoryAnalysis
import pandas as pd

# Load data
data = pd.read_csv("your_data.csv")

# Run comprehensive EDA
eda = ExploratoryAnalysis(data)
eda.print_report()

# Get specific insights
correlations = eda.correlation_analysis()
missing = eda.missing_value_analysis()
outliers = eda.detect_outliers_summary()
stats = eda.get_statistics()

📈 Time Series Forecasting

from src import TimeSeriesAnalyzer
import pandas as pd

# Load time series data
data = pd.read_csv("sales_data.csv")

# Initialize analyzer
ts = TimeSeriesAnalyzer(
    data=data,
    date_column="date",
    value_column="sales"
)

# Run analysis
ts.analyze()

# Generate 30-day forecast
forecast = ts.forecast(periods=30, method="auto")

# Visualize results
ts.plot_forecast(forecast)

# Get detailed report
print(ts.generate_report())

🔮 Model Explainability

from src import ModelExplainer
from sklearn.ensemble import RandomForestClassifier

# Train your model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Create explainer
explainer = ModelExplainer(
    model=model,
    X_train=X_train,
    feature_names=X_train.columns.tolist(),
    task="classification"
)

# Explain a single prediction
explanation = explainer.explain_prediction(X_test.iloc[0])

# SHAP summary plot
explainer.plot_shap_summary(X_test)

# Feature importance
importance = explainer.get_feature_importance()

# Generate report
report = explainer.generate_report(X_test)

🧹 Data Preprocessing

from src import DataPreprocessor

# Initialize preprocessor
prep = DataPreprocessor()

# Load data
prep.load_data("raw_data.csv")

# Full pipeline (one-liner!)
clean_data = prep.preprocess_pipeline(
    handle_missing=True,
    remove_dups=True,
    handle_outliers_flag=True,
    scale=True,
    encode=True
)

# Or step-by-step with full control
prep.handle_missing_values(strategy="auto")
prep.remove_duplicates()
prep.handle_outliers(method="clip")  # or "iqr", "zscore"
prep.scale_features(method="standard")  # or "minmax", "robust"
prep.encode_categorical(method="onehot")  # or "label"

🧠 AI-Powered Insights

from src import AIInsights
import pandas as pd

data = pd.read_csv("your_data.csv")

# Initialize AI insights engine
ai = AIInsights(data)

# Generate automated report
report = ai.generate_automated_report(target_column="target")
print(report)

# Detect data quality issues
issues = ai.detect_data_quality_issues()

# Get smart recommendations
recommendations = ai.generate_recommendations(task_type="classification")

# Quick insights summary
quick = ai.get_quick_insights()

🤖 AutoML Pipeline

🚀 End-to-End Automated Machine Learning

Zero Config 10+ Algorithms Auto Tuning

🔄 Pipeline Flow

┌──────────────────────────────────────────────────────────────────────────────────────┐
│                                                                                      │
│                            🤖 AUTOML PIPELINE WORKFLOW                               │
│                                                                                      │
│   ┌─────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌──────┐ │
│   │  DATA   │───▶│  AUTO       │───▶│  MODEL      │───▶│  HYPER      │───▶│DEPLOY│ │
│   │  INPUT  │    │  PREPROCESS │    │  SELECTION  │    │  TUNING     │    │      │ │
│   └─────────┘    └─────────────┘    └─────────────┘    └─────────────┘    └──────┘ │
│       │                │                  │                  │                │     │
│       ▼                ▼                  ▼                  ▼                ▼     │
│   CSV/Excel      • Missing Values    • Task Detection   • Grid Search    • Export  │
│   DataFrame      • Encoding          • Algorithm Pool   • Random Search  • Predict │
│   Parquet        • Scaling           • Cross-Valid      • Best Params    • Serve   │
│                                                                                      │
└──────────────────────────────────────────────────────────────────────────────────────┘

📊 Supported Algorithms

🎯 Classification Models

# Algorithm Library Best For
1 🌲 Random Forest scikit-learn Robust baseline
2 🚀 XGBoost xgboost High performance
3 ⚡ LightGBM lightgbm Large datasets
4 🐱 CatBoost catboost Categorical data
5 📈 Logistic Regression scikit-learn Interpretable
6 🎯 SVM scikit-learn High dimensions
7 🏠 KNN scikit-learn Non-parametric
8 🌳 Decision Tree scikit-learn Explainable
9 🔄 AdaBoost scikit-learn Adaptive
10 🌿 Extra Trees scikit-learn Variance reduction

📈 Regression Models

# Algorithm Library Best For
1 📏 Linear Regression scikit-learn Simple baseline
2 🎚️ Ridge scikit-learn L2 regularization
3 🎚️ Lasso scikit-learn L1 regularization
4 🌲 Random Forest scikit-learn Non-linear
5 🚀 XGBoost xgboost High performance
6 ⚡ LightGBM lightgbm Speed
7 🔗 ElasticNet scikit-learn L1+L2 combined
8 🎯 SVR scikit-learn Kernel methods
9 📊 Gradient Boosting scikit-learn Sequential
10 🌿 Extra Trees scikit-learn Ensemble

⏰ Time Series Algorithms

Method Library Best For Features
📈 ARIMA statsmodels Non-seasonal data Trend, differencing
🔄 SARIMA statsmodels Seasonal patterns Seasonality components
📊 Exponential Smoothing statsmodels Trend + seasonality Level, trend, season
🔮 Prophet prophet Business forecasts Holidays, events

🏆 Sample Leaderboard Output

╔═══════════════════════════════════════════════════════════════════════════════════╗
║                           🏆 AUTOML MODEL LEADERBOARD                             ║
╠═══════════════════════════════════════════════════════════════════════════════════╣
║                                                                                   ║
║  Rank │ Model               │ Accuracy │ Precision │ Recall │ F1-Score │ AUC     ║
║ ──────┼─────────────────────┼──────────┼───────────┼────────┼──────────┼─────────║
║   🥇  │ XGBoost             │  0.9421  │   0.9398  │ 0.9445 │  0.9421  │ 0.9712  ║
║   🥈  │ LightGBM            │  0.9398  │   0.9375  │ 0.9422 │  0.9398  │ 0.9689  ║
║   🥉  │ Random Forest       │  0.9356  │   0.9334  │ 0.9379 │  0.9356  │ 0.9645  ║
║   4   │ CatBoost            │  0.9312  │   0.9290  │ 0.9335 │  0.9312  │ 0.9612  ║
║   5   │ Gradient Boosting   │  0.9289  │   0.9267  │ 0.9312 │  0.9289  │ 0.9601  ║
║   6   │ Extra Trees         │  0.9234  │   0.9212  │ 0.9257 │  0.9234  │ 0.9567  ║
║   7   │ AdaBoost            │  0.9156  │   0.9134  │ 0.9179 │  0.9156  │ 0.9523  ║
║   8   │ SVM                 │  0.9089  │   0.9067  │ 0.9112 │  0.9089  │ 0.9478  ║
║   9   │ KNN                 │  0.8945  │   0.8923  │ 0.8968 │  0.8945  │ 0.9389  ║
║  10   │ Logistic Regression │  0.8823  │   0.8801  │ 0.8846 │  0.8823  │ 0.9312  ║
║                                                                                   ║
╚═══════════════════════════════════════════════════════════════════════════════════╝
                     ✅ Best Model: XGBoost (Accuracy: 94.21%)

📊 Visualization Gallery

Sample Outputs

┌────────────────────────────────────────────────────────────────────────────┐
│                                                                            │
│   ╔════════════════════════════════════════════════════════════════════╗   │
│   ║               📊 EXPLORATORY DATA ANALYSIS REPORT                  ║   │
│   ╚════════════════════════════════════════════════════════════════════╝   │
│                                                                            │
│   📌 Dataset Overview                                                      │
│   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   │
│   • Total Rows: 10,000                                                     │
│   • Total Columns: 25                                                      │
│   • Memory Usage: 2.4 MB                                                   │
│   • Numerical Columns: 18                                                  │
│   • Categorical Columns: 7                                                 │
│                                                                            │
│   📈 Missing Values Analysis                                               │
│   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   │
│   • income: 5.0% missing                                                   │
│   • credit_score: 3.0% missing                                             │
│                                                                            │
│   🔗 Top Correlations                                                      │
│   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   │
│   • income ↔ savings: 0.89                                                 │
│   • age ↔ years_employed: 0.76                                             │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Available Visualizations

Type Description Method
📊 Distribution Histograms with KDE plot_distribution()
📈 Box Plots Outlier visualization plot_boxplot()
🔥 Heatmaps Correlation matrices plot_correlation_heatmap()
🎯 Scatter Relationship analysis plot_scatter()
📉 Time Series Temporal patterns plot_time_series()
🏆 Feature Importance Model insights plot_feature_importance()
🎨 Dashboard Multi-plot overview create_dashboard()

🐳 Docker Deployment

🚢 Deploy Anywhere with Docker

# Build the image
docker build -t datapilot-ai .

# Run with default settings
docker run -p 8501:8501 datapilot-ai

# Run with data volume
docker run -p 8501:8501 -v $(pwd)/data:/app/data datapilot-ai

# Run with environment variables
docker run -p 8501:8501 -e DEBUG=false datapilot-ai

Docker Compose

version: '3.8'
services:
  datapilot:
    build: .
    ports:
      - "8501:8501"
    volumes:
      - ./data:/app/data
    environment:
      - DEBUG=false
      - AUTOML_MAX_MODELS=10

🗺️ Roadmap

📅 Development Timeline

✅ Completed (v1.0.0)

Feature Status Version
🔍 Core EDA Module ✅ Done v1.0
🤖 AutoML Pipeline ✅ Done v1.0
🧹 Data Preprocessing ✅ Done v1.0
📈 Time Series Analysis ✅ Done v1.0
Feature Status Version
🔮 Model Explainability ✅ Done v1.0
🎨 Streamlit Dashboard ✅ Done v1.0
💻 CLI Interface ✅ Done v1.0
🐳 Docker Support ✅ Done v1.0

🔄 In Progress (v1.1.0)

Feature Progress Expected
🧠 Deep Learning Integration 🟡🟡🟡⚪⚪ 60% Q1 2026
💬 NLP Query Interface 🟡🟡⚪⚪⚪ 40% Q1 2026
📊 Advanced Visualizations 🟡🟡🟡🟡⚪ 80% Q1 2026

📋 Planned (v2.0.0+)

Feature Priority Timeline
☁️ Cloud Deployment Templates 🔴 High Q2 2026
⚡ Real-time Streaming Analysis 🔴 High Q2 2026
🗄️ Feature Store Integration 🟠 Medium Q3 2026
🔄 MLOps Pipeline 🟠 Medium Q3 2026
📈 Model Monitoring 🟡 Normal Q4 2026
🧪 A/B Testing Framework 🟡 Normal Q4 2026

📊 Overall Progress

Completed    ████████████████████████████████░░░░░░░░  80%
In Progress  ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░  30%
Planned      ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0%

🤝 Contributing

We 💖 Contributors!

Quick Start for Contributors

# 1. Fork and clone
git clone https://github.com/YOUR_USERNAME/DataPilot-AI.git
cd DataPilot-AI

# 2. Create feature branch
git checkout -b feature/amazing-feature

# 3. Install dev dependencies
pip install -e ".[dev]"

# 4. Make changes and test
pytest tests/ -v
black src/ tests/
flake8 src/ tests/

# 5. Commit with conventional commits
git commit -m "feat(automl): add amazing feature"

# 6. Push and create PR
git push origin feature/amazing-feature

Contribution Types

Type Description Label
🐛 Bug Fix Fix existing issues bug
✨ Feature New functionality enhancement
📚 Docs Documentation improvements documentation
🧪 Tests Add or improve tests testing
🎨 Style Code style/formatting style
♻️ Refactor Code improvements refactor

📖 Read the full guide: CONTRIBUTING.md


❓ Frequently Asked Questions

🤔 What makes DataPilot AI different from other AutoML tools?

DataPilot AI combines AutoML, Explainable AI, Time Series, and AI Insights in one unified toolkit. Unlike tools that focus on just model training, we provide end-to-end coverage from data exploration to model explanation.

🐍 What Python versions are supported?

We support Python 3.9, 3.10, 3.11, and 3.12. Python 3.11 is recommended for best performance.

💾 Can I use my own models with the explainability module?

Yes! The ModelExplainer class works with any scikit-learn compatible model. Just pass your trained model, and you'll get SHAP/LIME explanations instantly.

🌐 Can I deploy the dashboard to the cloud?

Absolutely! The Streamlit dashboard can be deployed to:

  • Streamlit Cloud (free tier available)
  • Heroku
  • AWS/GCP/Azure with Docker
  • Any platform supporting Docker containers
📊 What data formats are supported?
  • CSV (.csv)
  • Excel (.xlsx, .xls)
  • Parquet (.parquet)
  • JSON (.json)
  • Pandas DataFrames (programmatic)
⚡ How fast is the AutoML pipeline?

Speed depends on data size and model count, but typical benchmarks:

  • 1,000 rows, 10 models: ~30 seconds
  • 10,000 rows, 10 models: ~2-5 minutes
  • 100,000 rows, 10 models: ~10-20 minutes

LightGBM and XGBoost are particularly optimized for speed.

🔧 Can I customize the preprocessing pipeline?

Yes! You can either use the one-liner preprocess_pipeline() or chain individual methods (handle_missing_values(), scale_features(), etc.) for full control.


💖 Support & Sponsorship

If DataPilot AI helps your work, consider supporting!


Star Fork Sponsor



📬 Get in Touch

Report Bug Request Feature Discussions

Email GitHub


📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2024 DataPilot AI Contributors

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software...

🙏 Acknowledgments

Built with Amazing Open Source Projects

Scikit-learn XGBoost LightGBM SHAP LIME Streamlit Plotly


Star History Chart




Thanks for visiting! Star ⭐ this repo if you found it helpful!


Made with ❤️ by Vivek Ghantiwala and the DataPilot AI Community


Back to Top

About

🔬 A comprehensive Python framework for automated data analysis, machine learning, and AI-powered insights. Features AutoML, explainable AI (SHAP/LIME), time series forecasting, interactive dashboards, and automated report generation.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors