Speech To Text With Databricks

A speech-to-text processing solution using Databricks Asset Bundles for infrastructure-as-code and GitHub Actions for automated CI/CD deployment.

Overview

This repository implements an end-to-end speech-to-text (STT) pipeline on Databricks:

Audio files are ingested from Unity Catalog Volumes using Auto Loader
Data flows through a Bronze → Silver medallion architecture (Spark Declarative Pipelines)
Transcription is handled by Whisper Large V3 via a Databricks Model Serving endpoint
NLP enrichment (sentiment, summary, entities, topic, translation) is applied to every transcription
Both NLP implementations (AI SQL functions and Foundation Model API) are evaluated with MLflow
Deployment is automated via GitHub Actions with OIDC authentication
Infrastructure is managed via Databricks Asset Bundles (DAB)

What's Implemented

✅ Audio Ingestion & Transcription — Auto Loader picks up audio files and Whisper transcribes them
✅ NLP Enrichment — Sentiment, summary, entities, topic, and translation via two parallel implementations
✅ MLflow Evaluation — Side-by-side quality comparison of AI SQL functions vs Foundation Model API
✅ Automated CI/CD — GitHub Actions deploy to Dev and Prod environments
✅ Infrastructure as Code — Databricks Asset Bundle with dev/prod targets
✅ Dashboard — Databricks AI/BI dashboard for monitoring transcription and NLP results
✅ Genie Space — Natural language interface for querying the gold layer tables

Quick Start

Prerequisites

Databricks workspace(s) with Unity Catalog enabled
- Two workspaces (recommended for full CI/CD): one for dev, one for prod — each target in databricks.yml points to a separate workspace, ensuring complete environment isolation
- One workspace (simplified setup): both dev and prod targets deploy to the same workspace, differentiated only by schema name — suitable for personal projects or demos
GitHub repository with administrative access
Databricks CLI installed (optional, for local deployment)

Setup

Configure Databricks — Create catalog, service principal, and federation policies → See docs/DATABRICKS_SETUP.md
Configure GitHub Actions — Set up environments, variables, and secrets → See docs/GITHUB_ACTIONS_SETUP.md
Deploy — Push to dev or main branch to trigger automated deployment

Project Structure

Speech-To-Text-With-Databricks/
├── speech_to_text_asset_bundle/          # Databricks Asset Bundle (DAB)
│   ├── databricks.yml                    # Bundle config: variables, targets (dev/prod)
│   ├── resources/                        # Jobs, pipelines, schemas, volumes, dashboard
│   │   ├── stt_audio_transcription.pipeline.yml  # Bronze + Silver transcription pipeline
│   │   ├── stt_nlp_enrichment.pipeline.yml       # Silver NLP enrichment pipeline
│   │   ├── stt_gold_layer.pipeline.yml           # Gold aggregation pipeline
│   │   ├── stt_dashboard.dashboard.yml           # AI/BI dashboard resource
│   │   ├── stt_genie.job.yml                     # Genie Space setup job
│   │   └── stt_main.job.yml                      # Orchestration job
│   ├── src/                              # Python source code and assets
│   │   ├── stt_audio_transcription/      # Bronze + Silver transcription tables
│   │   ├── stt_nlp_enrichment/           # Silver NLP enrichment tables
│   │   ├── stt_gold_layer/               # Gold detail and aggregate tables
│   │   ├── dashboards/                   # AI/BI dashboard definition (Lakeview JSON)
│   │   ├── stt_genie/                    # Genie Space setup notebook
│   │   └── stt_nlp_evaluation/           # MLflow quality evaluation notebook
│   ├── tests/                            # Unit and integration tests
│   └── pyproject.toml                    # Python dependencies and tooling
├── .github/workflows/                    # CI/CD automation
│   ├── sync_git_folder_and_deploy_adb_dev.yml   # Deploy to Dev on push to 'dev'
│   └── sync_git_folder_and_deploy_adb_prod.yml  # Deploy to Prod on push to 'main'
├── docs/                                 # Additional documentation
└── README.md                             # This file

`/speech_to_text_asset_bundle`

The core Databricks solution. Contains:

databricks.yml — Bundle configuration with dev and prod targets and all bundle variables
resources/ — YAML definitions for all pipelines, the AI/BI dashboard, the Genie Space setup job, the orchestration job, schemas, and volumes
src/stt_audio_transcription/ — Bronze and Silver transcription pipeline tables
src/stt_nlp_enrichment/ — Silver NLP enrichment tables (two parallel implementations)
src/stt_gold_layer/ — Gold detail and aggregate tables
src/dashboards/ — AI/BI Lakeview dashboard definition
src/stt_genie/ — Notebook that creates/updates the Genie Space via the Databricks SDK
src/stt_nlp_evaluation/ — MLflow GenAI evaluation notebook
tests/ — Unit tests for transformations

For detailed documentation, see speech_to_text_asset_bundle/README.md For detailed documentation, see speech_to_text_asset_bundle/README.md

`/.github/workflows`

GitHub Actions workflows for CI/CD:

sync_git_folder_and_deploy_adb_dev.yml — Syncs Git folder and deploys to Dev when code is pushed to dev branch
sync_git_folder_and_deploy_adb_prod.yml — Deploys asset bundle to Prod when code is pushed to main branch

Both workflows use GitHub OIDC for secure, token-less authentication with Databricks.

Solution Details

Data Flow

All four stages are orchestrated by the stt_main job: transcription → NLP enrichment → gold layer update and MLflow evaluation in parallel.

Technologies

Spark Declarative Pipelines (SDP) — Serverless streaming pipelines with @dlt.table decorators
Auto Loader — Incremental ingestion from Unity Catalog Volumes
Whisper Large V3 — Foundation Model for audio transcription via Model Serving endpoint
Databricks AI SQL functions — Built-in ai_analyze_sentiment, ai_summarize, ai_extract, ai_classify, ai_translate
Foundation Model API — databricks-meta-llama-3-3-70b-instruct via ai_query() for NLP tasks
MLflow GenAI evaluation — Side-by-side quality scoring with deterministic validators and LLM judges
Unity Catalog — Centralized data governance and metadata management
Databricks Asset Bundles — Infrastructure-as-code for multi-environment deployment
GitHub Actions + OIDC — Secure CI/CD without long-lived tokens

Deployment

Automated (Recommended)

Push to dev or main branch to trigger GitHub Actions workflows:

git push origin dev      # Deploys to Dev environment
git push origin main     # Deploys to Prod environment

Manual (Databricks CLI)

cd speech_to_text_asset_bundle

# Validate configuration
databricks bundle validate --target dev

# Deploy to dev
databricks bundle deploy --target dev --var="service_principal_id=<uuid>"

# Deploy to prod
databricks bundle deploy --target prod --var="service_principal_id=<uuid>"

# Run the full pipeline (transcription → NLP enrichment → gold layer + evaluation)
databricks bundle run stt_main

Additional Documentation

Databricks Setup — Service principal, catalog, and federation policy configuration
GitHub Actions Setup — GitHub environments, variables, and secrets
Solution Architecture — Technical deep-dive into pipeline design and data flow
Environment Setup Overview — Quick setup checklist and documentation index
Bundle README — Pipeline architecture, data schemas, and configuration reference
Copilot Agents — Custom AI agents available in this repository

External References

License

See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 250 Commits
.ai-dev-kit		.ai-dev-kit
.claude/skills		.claude/skills
.github		.github
.vscode		.vscode
docs		docs
speech_to_text_asset_bundle		speech_to_text_asset_bundle
.gitignore		.gitignore
.mcp.json		.mcp.json
.mcp.json.bak		.mcp.json.bak
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech To Text With Databricks

Overview

What's Implemented

Quick Start

Prerequisites

Setup

Project Structure

`/speech_to_text_asset_bundle`

`/.github/workflows`

Solution Details

Data Flow

Technologies

Deployment

Automated (Recommended)

Manual (Databricks CLI)

Additional Documentation

External References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Speech To Text With Databricks

Overview

What's Implemented

Quick Start

Prerequisites

Setup

Project Structure

/speech_to_text_asset_bundle

/.github/workflows

Solution Details

Data Flow

Technologies

Deployment

Automated (Recommended)

Manual (Databricks CLI)

Additional Documentation

External References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`/speech_to_text_asset_bundle`

`/.github/workflows`

Packages