Skip to content

CASD-EU/Seminar_workflow_automation

Repository files navigation

Seminar_workflow_automation

In this seminar, we will learn the basics of workflow automation and orchestration framework. We will also play with some workflow automation frameworks, such as: - prefect - airflow

1. What is workflow automation and orchestration framework?

A workflow automation and orchestration framework coordinates multiple tasks(often interdependent) so that complex processes run reliably, observably, and efficiently without human intervention.

  • Automation: Executes pre-defined tasks automatically (Execution.)
  • Orchestration: Ensures multiple automated components run in the right order, with state tracking, error handling, and data passing between them. (Coordination)

2. Key concepts of a workflow orchestration Framework

Component Description Prefect Analogy Airflow Analogy
Task The smallest executable unit (a function, script, or command). @task decorator PythonOperator, BashOperator, or any Airflow operator.
Flow / DAG A sequence of tasks, rules, and dependencies designed to achieve a specific outcome. It can be represented as a directed acyclic graph. @flow decorator DAG object — defines Directed Acyclic Graph structure.
Task Dependency Orders and relationships between tasks (upstream/downstream). Python call order (task_a() → task_b()). Prefect infers dependencies automatically. Explicitly defined using >> or << operators (task_a >> task_b).
Scheduler Decides when to run flows (e.g., cron-like, event triggered). Prefect deployments & schedules Airflow Scheduler with schedule_interval (cron, timedelta, etc.).
Executor / Worker Runs tasks on compute resources (local, Docker, cluster). Prefect worker(pulls jobs from a Work Pool and executes flows.) Executor — runs tasks via Celery, LocalExecutor, or KubernetesExecutor.
State Management Tracks states of each task(e.g. running, success, retry, fail, cancel) Prefect Orion state engine askInstance and DagRun states (success, failed, up_for_retry, etc.).
Observer / Logger Records logs, metrics, and events for debugging. Prefect UI & logging Airflow Web UI (http://localhost:8080) — DAG-based view.
Configuration Store Holds environment variables, credentials, or secrets. Prefect Blocks Connections / Variables in Airflow (stored in metadata DB or environment).
API / Backend Central coordination hub for flows and tasks. Prefect API Server Airflow Webserver + Scheduler + Metadata DB (tightly coupled).

3. Automation vs. Orchestration

| Dimension |Automation |Orchestration| |Scope |Single task or script |Multiple tasks with dependencies| |Goal |Reduce manual execution |Ensure end-to-end process consistency| |Example | Run a Spark job daily | Run extract → transform → validate → load, in sequence|

4.Core Principles Behind Workflow Orchestration

4.1 State Awareness

Every task has a state (e.g., Pending, Running, Completed, Failed). Orchestration Framework tracks transitions between states, which allows:

  • Retry logic
  • Conditional branching (if success → do next step)

4.2. Task Lineage and Dependency Graph(DAGS)

The Orchestration Framework

  • ensures tasks dependency defined in a dag: extract → transform → validate → load → notify
  • tracks inputs and outputs, making debugging easier.

4.3. Scheduling and Triggers

The Orchestration Frameworks manage when to run workflows:

  • Time-based: cron-like (daily, weekly)
  • Event-based: file arrival, API signal, database change
  • Manual trigger: via CLI or API

4.4. Retry and Error Handling

Built-in resilience:

  • Retry failed tasks up to N times
  • Pause or continue depending on conditions
  • Mark run as Failed or Recovered

4.5. Observability

Orchestration provides visibility into:

  • Execution time per task
  • Failure reasons
  • Logs and stdout
  • Resource usage

In Prefect, this is visible in the Orion dashboard and stored in the SQLite/PostgreSQL backend.

4.6. Parameterization and Dynamic Workflows

Flows can take parameters to control behavior:

@flow
def etl(dataset: str, date: str):
    ...

Enables reusing the same logic for multiple datasets.

4.7. Execution Environment Abstraction

The orchestrator separates workflow logic from runtime. With the same workflow code, you can:

  • Run locally for testing.
  • Move to a cluster (Kubernetes, Dask, Ray).

5. Framework Architecture (Generic)

          ┌──────────────────────┐
          │   User / Developer   │
          └─────────┬────────────┘
                    │
                    ▼
           ┌─────────────────┐
           │ Flow Definition │   ← Python, YAML, or JSON
           └────────┬────────┘
                    │
                    ▼
           ┌─────────────────┐
           │  Scheduler / API │   ← Prefect Server / Cloud
           └────────┬────────┘
                    │
         ┌──────────┴──────────┐
         ▼                     ▼
 ┌─────────────────┐    ┌─────────────────┐
 │   Worker Node A  │    │   Worker Node B │   ← Executes tasks
 └─────────────────┘    └─────────────────┘
                    │
                    ▼
           ┌─────────────────┐
           │ Logging & UI    │   ← Observability
           └─────────────────┘

6. Advantages of Using an Orchestration Framework

Category Benefit
Reliability Automated recovery and retries
Reproducibility All runs tracked and logged
Scalability Parallel task execution
Auditability Historical logs and metadata stored
Maintainability Modular and reusable workflow components
Security Credential isolation through secrets and blocks

About

Cours and TPs for workflow automation seminar

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors