A speech-to-text processing solution using Databricks Asset Bundles for infrastructure-as-code and GitHub Actions for automated CI/CD deployment.
This repository implements an end-to-end speech-to-text (STT) pipeline on Databricks:
- Audio files are ingested from Unity Catalog Volumes using Auto Loader
- Data flows through a Bronze → Silver medallion architecture (Spark Declarative Pipelines)
- Transcription is handled by Whisper Large V3 via a Databricks Model Serving endpoint
- NLP enrichment (sentiment, summary, entities, topic, translation) is applied to every transcription
- Both NLP implementations (AI SQL functions and Foundation Model API) are evaluated with MLflow
- Deployment is automated via GitHub Actions with OIDC authentication
- Infrastructure is managed via Databricks Asset Bundles (DAB)
- ✅ Audio Ingestion & Transcription — Auto Loader picks up audio files and Whisper transcribes them
- ✅ NLP Enrichment — Sentiment, summary, entities, topic, and translation via two parallel implementations
- ✅ MLflow Evaluation — Side-by-side quality comparison of AI SQL functions vs Foundation Model API
- ✅ Automated CI/CD — GitHub Actions deploy to Dev and Prod environments
- ✅ Infrastructure as Code — Databricks Asset Bundle with dev/prod targets
- ✅ Dashboard — Databricks AI/BI dashboard for monitoring transcription and NLP results
- ✅ Genie Space — Natural language interface for querying the gold layer tables
- Databricks workspace(s) with Unity Catalog enabled
- Two workspaces (recommended for full CI/CD): one for
dev, one forprod— each target indatabricks.ymlpoints to a separate workspace, ensuring complete environment isolation - One workspace (simplified setup): both
devandprodtargets deploy to the same workspace, differentiated only by schema name — suitable for personal projects or demos
- Two workspaces (recommended for full CI/CD): one for
- GitHub repository with administrative access
- Databricks CLI installed (optional, for local deployment)
-
Configure Databricks — Create catalog, service principal, and federation policies → See docs/DATABRICKS_SETUP.md
-
Configure GitHub Actions — Set up environments, variables, and secrets → See docs/GITHUB_ACTIONS_SETUP.md
-
Deploy — Push to
devormainbranch to trigger automated deployment
Speech-To-Text-With-Databricks/
├── speech_to_text_asset_bundle/ # Databricks Asset Bundle (DAB)
│ ├── databricks.yml # Bundle config: variables, targets (dev/prod)
│ ├── resources/ # Jobs, pipelines, schemas, volumes, dashboard
│ │ ├── stt_audio_transcription.pipeline.yml # Bronze + Silver transcription pipeline
│ │ ├── stt_nlp_enrichment.pipeline.yml # Silver NLP enrichment pipeline
│ │ ├── stt_gold_layer.pipeline.yml # Gold aggregation pipeline
│ │ ├── stt_dashboard.dashboard.yml # AI/BI dashboard resource
│ │ ├── stt_genie.job.yml # Genie Space setup job
│ │ └── stt_main.job.yml # Orchestration job
│ ├── src/ # Python source code and assets
│ │ ├── stt_audio_transcription/ # Bronze + Silver transcription tables
│ │ ├── stt_nlp_enrichment/ # Silver NLP enrichment tables
│ │ ├── stt_gold_layer/ # Gold detail and aggregate tables
│ │ ├── dashboards/ # AI/BI dashboard definition (Lakeview JSON)
│ │ ├── stt_genie/ # Genie Space setup notebook
│ │ └── stt_nlp_evaluation/ # MLflow quality evaluation notebook
│ ├── tests/ # Unit and integration tests
│ └── pyproject.toml # Python dependencies and tooling
├── .github/workflows/ # CI/CD automation
│ ├── sync_git_folder_and_deploy_adb_dev.yml # Deploy to Dev on push to 'dev'
│ └── sync_git_folder_and_deploy_adb_prod.yml # Deploy to Prod on push to 'main'
├── docs/ # Additional documentation
└── README.md # This file
The core Databricks solution. Contains:
databricks.yml— Bundle configuration withdevandprodtargets and all bundle variablesresources/— YAML definitions for all pipelines, the AI/BI dashboard, the Genie Space setup job, the orchestration job, schemas, and volumessrc/stt_audio_transcription/— Bronze and Silver transcription pipeline tablessrc/stt_nlp_enrichment/— Silver NLP enrichment tables (two parallel implementations)src/stt_gold_layer/— Gold detail and aggregate tablessrc/dashboards/— AI/BI Lakeview dashboard definitionsrc/stt_genie/— Notebook that creates/updates the Genie Space via the Databricks SDKsrc/stt_nlp_evaluation/— MLflow GenAI evaluation notebooktests/— Unit tests for transformations
For detailed documentation, see speech_to_text_asset_bundle/README.md For detailed documentation, see speech_to_text_asset_bundle/README.md
GitHub Actions workflows for CI/CD:
sync_git_folder_and_deploy_adb_dev.yml— Syncs Git folder and deploys to Dev when code is pushed todevbranchsync_git_folder_and_deploy_adb_prod.yml— Deploys asset bundle to Prod when code is pushed tomainbranch
Both workflows use GitHub OIDC for secure, token-less authentication with Databricks.
All four stages are orchestrated by the stt_main job: transcription → NLP enrichment → gold layer update and MLflow evaluation in parallel.
- Spark Declarative Pipelines (SDP) — Serverless streaming pipelines with
@dlt.tabledecorators - Auto Loader — Incremental ingestion from Unity Catalog Volumes
- Whisper Large V3 — Foundation Model for audio transcription via Model Serving endpoint
- Databricks AI SQL functions — Built-in
ai_analyze_sentiment,ai_summarize,ai_extract,ai_classify,ai_translate - Foundation Model API —
databricks-meta-llama-3-3-70b-instructviaai_query()for NLP tasks - MLflow GenAI evaluation — Side-by-side quality scoring with deterministic validators and LLM judges
- Unity Catalog — Centralized data governance and metadata management
- Databricks Asset Bundles — Infrastructure-as-code for multi-environment deployment
- GitHub Actions + OIDC — Secure CI/CD without long-lived tokens
Push to dev or main branch to trigger GitHub Actions workflows:
git push origin dev # Deploys to Dev environment
git push origin main # Deploys to Prod environmentcd speech_to_text_asset_bundle
# Validate configuration
databricks bundle validate --target dev
# Deploy to dev
databricks bundle deploy --target dev --var="service_principal_id=<uuid>"
# Deploy to prod
databricks bundle deploy --target prod --var="service_principal_id=<uuid>"
# Run the full pipeline (transcription → NLP enrichment → gold layer + evaluation)
databricks bundle run stt_main- Databricks Setup — Service principal, catalog, and federation policy configuration
- GitHub Actions Setup — GitHub environments, variables, and secrets
- Solution Architecture — Technical deep-dive into pipeline design and data flow
- Environment Setup Overview — Quick setup checklist and documentation index
- Bundle README — Pipeline architecture, data schemas, and configuration reference
- Copilot Agents — Custom AI agents available in this repository
See LICENSE for details.
