Pandas Profiling App

A custom Workbench application for generating comprehensive data profiling reports from BigQuery datasets using ydata-profiling.

Features

📋 Data Preview

Browse BigQuery datasets and tables
Load data with configurable row limits
View first/last rows
Inspect column types, null values, and statistics
Export data to CSV or Parquet

🐍 Python Code Editor

Interactive Python code execution environment
Write and test pandas operations
Transform and analyze data on the fly
Pre-populated with common examples

📈 Profile Reports

Comprehensive data profiling with ydata-profiling
Dataset overview (rows, columns, missing values, duplicates)
Variable types and distributions
Correlation matrices (Pearson, Spearman, Kendall)
Missing value patterns
Detailed univariate analysis
Interactive visualizations
Download reports as HTML

Tech Stack

Framework: Streamlit
Language: Python 3.11
Data Processing: pandas, numpy
Profiling: ydata-profiling
Data Source: Google BigQuery
Visualization: plotly

Project Structure

pandas-profiling-app/
├── .devcontainer/
│   └── devcontainer.json    # Workbench dev container config
├── Dockerfile                # Container image definition
├── requirements.txt          # Python dependencies
├── app.py                    # Main Streamlit application
└── README.md                 # This file

Deployment

Prerequisites

Google Cloud Workbench environment
BigQuery datasets in your GCP project
GitHub repository

Deploy to Workbench

Create GitHub Repository

git init
git add .
git commit -m "Initial commit: Pandas profiling app"
git remote add origin https://github.com/YOUR_USERNAME/pandas-profiling-app.git
git push -u origin main

Deploy in Workbench
- Go to Workbench UI
- Create new custom application
- Repository: https://github.com/YOUR_USERNAME/pandas-profiling-app
- Branch: main
- Folder path: (leave blank)
- Cloud: gcp
Access the App
- Wait for deployment to complete
- Open the application URL
- Start profiling your data!

Usage Guide

Step 1: Select Data Source

In the sidebar, select a BigQuery dataset
Choose a table from the dataset
Adjust the row limit (default: 10,000 rows)
Click "📥 Load Data"

Step 2: Preview Your Data

Navigate to the "📋 Data Preview" tab to:

View first/last 20 rows
Inspect column information
Review basic statistics
Export data if needed

Step 3: (Optional) Run Python Code

Navigate to the "🐍 Python Code Editor" tab to:

Write pandas code to transform your data
Execute operations interactively
Test different analyses

Step 4: Generate Profile Report

Navigate to the "📈 Profile Report" tab to:

Choose profiling mode (Explorative or Minimal)
Click "🔍 Generate Profile Report"
Wait for the report to generate
Download the report as HTML for sharing

Configuration

Row Limits

Default: 10,000 rows
Range: 100 - 50,000 rows
Adjust based on dataset size and memory constraints

Profiling Modes

Explorative: Comprehensive analysis with all correlations and interactions (slower)
Minimal: Basic analysis with essential statistics (faster)

Environment Variables

The app uses Google Cloud Application Default Credentials, which are automatically configured in the Workbench environment. No manual authentication required.

Development

Local Development (Optional)

Install Dependencies
```
pip install -r requirements.txt
```
Set up Google Cloud Authentication
```
gcloud auth application-default login
```
Run the App
```
streamlit run app.py
```
Access Open http://localhost:8501

Testing in Workbench

The app automatically starts when the dev container launches. No manual intervention needed.

Troubleshooting

BigQuery Connection Issues

Ensure Application Default Credentials are configured
Verify you have access to the BigQuery project
Check dataset and table permissions

Memory Issues

Reduce row limit for large datasets
Use "Minimal" profiling mode for faster processing
Consider profiling a sample of your data first

Report Generation Timeout

Large datasets may take several minutes to profile
Use a smaller row limit or minimal mode
Check browser console for errors

Performance Tips

Start Small: Begin with 1,000-5,000 rows to test
Use Minimal Mode: For quick insights on large datasets
Cache Strategy: The app caches BigQuery queries for 5 minutes
Memory Management: Monitor the memory usage indicator

Security Notes

The app runs in an isolated dev container
Network access restricted to app-network
Uses Google Cloud's secure authentication
No credentials stored in code

Support

For issues or questions:

Check Workbench documentation
Review BigQuery permissions
Verify dev container configuration

License

This application is designed for use within Google Cloud Workbench environments.

Built with Streamlit | Powered by ydata-profiling | Deployed on Google Cloud Workbench

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.devcontainer		.devcontainer
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Pandas Profiling App

Features

📋 Data Preview

🐍 Python Code Editor

📈 Profile Reports

Tech Stack

Project Structure

Deployment

Prerequisites

Deploy to Workbench

Usage Guide

Step 1: Select Data Source

Step 2: Preview Your Data

Step 3: (Optional) Run Python Code

Step 4: Generate Profile Report

Configuration

Row Limits

Profiling Modes

Environment Variables

Development

Local Development (Optional)

Testing in Workbench

Troubleshooting

BigQuery Connection Issues

Memory Issues

Report Generation Timeout

Performance Tips

Security Notes

Support

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages