A custom Workbench application for generating comprehensive data profiling reports from BigQuery datasets using ydata-profiling.
- Browse BigQuery datasets and tables
- Load data with configurable row limits
- View first/last rows
- Inspect column types, null values, and statistics
- Export data to CSV or Parquet
- Interactive Python code execution environment
- Write and test pandas operations
- Transform and analyze data on the fly
- Pre-populated with common examples
- Comprehensive data profiling with ydata-profiling
- Dataset overview (rows, columns, missing values, duplicates)
- Variable types and distributions
- Correlation matrices (Pearson, Spearman, Kendall)
- Missing value patterns
- Detailed univariate analysis
- Interactive visualizations
- Download reports as HTML
- Framework: Streamlit
- Language: Python 3.11
- Data Processing: pandas, numpy
- Profiling: ydata-profiling
- Data Source: Google BigQuery
- Visualization: plotly
pandas-profiling-app/
├── .devcontainer/
│ └── devcontainer.json # Workbench dev container config
├── Dockerfile # Container image definition
├── requirements.txt # Python dependencies
├── app.py # Main Streamlit application
└── README.md # This file
- Google Cloud Workbench environment
- BigQuery datasets in your GCP project
- GitHub repository
-
Create GitHub Repository
git init git add . git commit -m "Initial commit: Pandas profiling app" git remote add origin https://github.com/YOUR_USERNAME/pandas-profiling-app.git git push -u origin main
-
Deploy in Workbench
- Go to Workbench UI
- Create new custom application
- Repository:
https://github.com/YOUR_USERNAME/pandas-profiling-app - Branch:
main - Folder path: (leave blank)
- Cloud:
gcp
-
Access the App
- Wait for deployment to complete
- Open the application URL
- Start profiling your data!
- In the sidebar, select a BigQuery dataset
- Choose a table from the dataset
- Adjust the row limit (default: 10,000 rows)
- Click "📥 Load Data"
Navigate to the "📋 Data Preview" tab to:
- View first/last 20 rows
- Inspect column information
- Review basic statistics
- Export data if needed
Navigate to the "🐍 Python Code Editor" tab to:
- Write pandas code to transform your data
- Execute operations interactively
- Test different analyses
Navigate to the "📈 Profile Report" tab to:
- Choose profiling mode (Explorative or Minimal)
- Click "🔍 Generate Profile Report"
- Wait for the report to generate
- Download the report as HTML for sharing
- Default: 10,000 rows
- Range: 100 - 50,000 rows
- Adjust based on dataset size and memory constraints
- Explorative: Comprehensive analysis with all correlations and interactions (slower)
- Minimal: Basic analysis with essential statistics (faster)
The app uses Google Cloud Application Default Credentials, which are automatically configured in the Workbench environment. No manual authentication required.
-
Install Dependencies
pip install -r requirements.txt
-
Set up Google Cloud Authentication
gcloud auth application-default login
-
Run the App
streamlit run app.py
-
Access Open http://localhost:8501
The app automatically starts when the dev container launches. No manual intervention needed.
- Ensure Application Default Credentials are configured
- Verify you have access to the BigQuery project
- Check dataset and table permissions
- Reduce row limit for large datasets
- Use "Minimal" profiling mode for faster processing
- Consider profiling a sample of your data first
- Large datasets may take several minutes to profile
- Use a smaller row limit or minimal mode
- Check browser console for errors
- Start Small: Begin with 1,000-5,000 rows to test
- Use Minimal Mode: For quick insights on large datasets
- Cache Strategy: The app caches BigQuery queries for 5 minutes
- Memory Management: Monitor the memory usage indicator
- The app runs in an isolated dev container
- Network access restricted to
app-network - Uses Google Cloud's secure authentication
- No credentials stored in code
For issues or questions:
- Check Workbench documentation
- Review BigQuery permissions
- Verify dev container configuration
This application is designed for use within Google Cloud Workbench environments.
Built with Streamlit | Powered by ydata-profiling | Deployed on Google Cloud Workbench