Skip to content

SIVerilyDP/pandas-profiling-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pandas Profiling App

A custom Workbench application for generating comprehensive data profiling reports from BigQuery datasets using ydata-profiling.

Features

📋 Data Preview

  • Browse BigQuery datasets and tables
  • Load data with configurable row limits
  • View first/last rows
  • Inspect column types, null values, and statistics
  • Export data to CSV or Parquet

🐍 Python Code Editor

  • Interactive Python code execution environment
  • Write and test pandas operations
  • Transform and analyze data on the fly
  • Pre-populated with common examples

📈 Profile Reports

  • Comprehensive data profiling with ydata-profiling
  • Dataset overview (rows, columns, missing values, duplicates)
  • Variable types and distributions
  • Correlation matrices (Pearson, Spearman, Kendall)
  • Missing value patterns
  • Detailed univariate analysis
  • Interactive visualizations
  • Download reports as HTML

Tech Stack

  • Framework: Streamlit
  • Language: Python 3.11
  • Data Processing: pandas, numpy
  • Profiling: ydata-profiling
  • Data Source: Google BigQuery
  • Visualization: plotly

Project Structure

pandas-profiling-app/
├── .devcontainer/
│   └── devcontainer.json    # Workbench dev container config
├── Dockerfile                # Container image definition
├── requirements.txt          # Python dependencies
├── app.py                    # Main Streamlit application
└── README.md                 # This file

Deployment

Prerequisites

  • Google Cloud Workbench environment
  • BigQuery datasets in your GCP project
  • GitHub repository

Deploy to Workbench

  1. Create GitHub Repository

    git init
    git add .
    git commit -m "Initial commit: Pandas profiling app"
    git remote add origin https://github.com/YOUR_USERNAME/pandas-profiling-app.git
    git push -u origin main
  2. Deploy in Workbench

    • Go to Workbench UI
    • Create new custom application
    • Repository: https://github.com/YOUR_USERNAME/pandas-profiling-app
    • Branch: main
    • Folder path: (leave blank)
    • Cloud: gcp
  3. Access the App

    • Wait for deployment to complete
    • Open the application URL
    • Start profiling your data!

Usage Guide

Step 1: Select Data Source

  1. In the sidebar, select a BigQuery dataset
  2. Choose a table from the dataset
  3. Adjust the row limit (default: 10,000 rows)
  4. Click "📥 Load Data"

Step 2: Preview Your Data

Navigate to the "📋 Data Preview" tab to:

  • View first/last 20 rows
  • Inspect column information
  • Review basic statistics
  • Export data if needed

Step 3: (Optional) Run Python Code

Navigate to the "🐍 Python Code Editor" tab to:

  • Write pandas code to transform your data
  • Execute operations interactively
  • Test different analyses

Step 4: Generate Profile Report

Navigate to the "📈 Profile Report" tab to:

  1. Choose profiling mode (Explorative or Minimal)
  2. Click "🔍 Generate Profile Report"
  3. Wait for the report to generate
  4. Download the report as HTML for sharing

Configuration

Row Limits

  • Default: 10,000 rows
  • Range: 100 - 50,000 rows
  • Adjust based on dataset size and memory constraints

Profiling Modes

  • Explorative: Comprehensive analysis with all correlations and interactions (slower)
  • Minimal: Basic analysis with essential statistics (faster)

Environment Variables

The app uses Google Cloud Application Default Credentials, which are automatically configured in the Workbench environment. No manual authentication required.

Development

Local Development (Optional)

  1. Install Dependencies

    pip install -r requirements.txt
  2. Set up Google Cloud Authentication

    gcloud auth application-default login
  3. Run the App

    streamlit run app.py
  4. Access Open http://localhost:8501

Testing in Workbench

The app automatically starts when the dev container launches. No manual intervention needed.

Troubleshooting

BigQuery Connection Issues

  • Ensure Application Default Credentials are configured
  • Verify you have access to the BigQuery project
  • Check dataset and table permissions

Memory Issues

  • Reduce row limit for large datasets
  • Use "Minimal" profiling mode for faster processing
  • Consider profiling a sample of your data first

Report Generation Timeout

  • Large datasets may take several minutes to profile
  • Use a smaller row limit or minimal mode
  • Check browser console for errors

Performance Tips

  1. Start Small: Begin with 1,000-5,000 rows to test
  2. Use Minimal Mode: For quick insights on large datasets
  3. Cache Strategy: The app caches BigQuery queries for 5 minutes
  4. Memory Management: Monitor the memory usage indicator

Security Notes

  • The app runs in an isolated dev container
  • Network access restricted to app-network
  • Uses Google Cloud's secure authentication
  • No credentials stored in code

Support

For issues or questions:

  • Check Workbench documentation
  • Review BigQuery permissions
  • Verify dev container configuration

License

This application is designed for use within Google Cloud Workbench environments.


Built with Streamlit | Powered by ydata-profiling | Deployed on Google Cloud Workbench

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors