Skip to content

Commit 4c5a137

Browse files
committed
docs: add M13 production hardening milestone + 18-segment demo outline
Milestone 13 covers: - Retry on transient failures (spot preemptions, registry throttling) - Precise build image sizing (per-tool profiles, Helm-configurable) - Multi-cluster push (remote registries, promotion pipeline, env gates) - Operational reliability (timeouts, health gates, DB backup) - Observability (Prometheus metrics, alerting, cost attribution) Segment outline restructured from 14 to 18 segments: - 01-06: platform core (refresh) - 07: merge & release pipeline (new) - 08-09: infrastructure (renumbered) - 10: baggage middleware (new) - 11-13: testing & data (renumbered) - 14: customization (new) - 15-17: regression, GUI tour, GUI extension (rewrite) - 18: M13 roadmap video (new) Made-with: Cursor
1 parent c4c879b commit 4c5a137

2 files changed

Lines changed: 347 additions & 0 deletions

File tree

Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
# Demo Segment Outline — 17 segments
2+
3+
Proposed restructure for the tekton-dag demo video series.
4+
Segments follow the natural lifecycle: understand → setup → build → test → merge → operate → extend.
5+
6+
---
7+
8+
## Segment map
9+
10+
| # | Title | Status | Visual | ~Duration | Notes |
11+
|----|------------------------------|----------|----------|-----------|-------|
12+
| 01 | Architecture overview | Refresh | Manim | 80s | Already strong — minor polish for M12 hook tasks, customization mention |
13+
| 02 | Quickstart | Refresh | VHS | 65s | Update to reflect current scripts, build-image matrix |
14+
| 03 | Bootstrap dataflow | Refresh | Mixed | 70s | Keep propagation walkthrough, update Kaniko cache mention |
15+
| 04 | PR pipeline | Refresh | VHS | 90s | Good as-is — minor wording tweaks |
16+
| 05 | Intercept routing | Refresh | Manim | 75s | Good — keep dual-backend (Telepresence + mirrord) |
17+
| 06 | Local debug | Refresh | Manim | 70s | Good — keep mirrord focus |
18+
| 07 | **Merge & release pipeline** | **New** | Manim | 75s | version-bump → build → crane tag → push version commit |
19+
| 08 | Orchestrator API | Refresh | VHS | 80s | Was 07 — add /api/reload, /api/graph/stats |
20+
| 09 | Multi-team Helm | Refresh | Manim | 75s | Was 08 — minor polish |
21+
| 10 | **Baggage middleware** | **New** | Manim | 70s | Per-framework deep dive: Spring, Node, Flask, PHP. Roles, headers, safety. |
22+
| 11 | Testing ecosystem | Refresh | VHS | 80s | Was 10 — rename, cover Newman + pytest + Playwright + Artillery |
23+
| 12 | Test-trace graph | Refresh | Mixed | 75s | Was 11 — good as-is |
24+
| 13 | Results DB | Refresh | VHS | 65s | Was 09 — good, slight reorder for flow |
25+
| 14 | **Customization** | **New** | Manim | 80s | Hook tasks, build variants, stack schema, language versions |
26+
| 15 | Regression suite | Rewrite | VHS | 75s | Was 12 — expand from 9 lines to real Phase 1/2 walkthrough |
27+
| 16 | Management GUI | Rewrite | Manim | 80s | Was 13 — full screen tour: DAG, runs, triggers, team switcher, logs |
28+
| 17 | Extending the GUI | Rewrite | Manim | 70s | Was 14 — concrete Flask route → Vue store → Playwright example |
29+
| 18 | **What's coming next** | **New** | Manim | 90s | M13 roadmap: retry, sizing, multi-cluster, observability |
30+
31+
**Total: ~19 minutes** (was ~14 minutes with thin 12–14)
32+
33+
---
34+
35+
## Segment details
36+
37+
### 01 — Architecture overview (refresh)
38+
39+
**What stays:** DAG model, three pipelines, polyglot stacks, orchestrator, GUI, Results, Neo4j.
40+
**What changes:** Add explicit mention of merge/release pipeline as third pipeline. Reference M12 customization (hook tasks, build variants). Mention baggage middleware as the routing mechanism.
41+
**Visual:** Manim — update StackDAGScene with merge pipeline box and hook task annotation.
42+
43+
---
44+
45+
### 02 — Quickstart (refresh)
46+
47+
**What stays:** Kind cluster, Tekton install, build images, kubectl apply.
48+
**What changes:** Mention `--matrix` flag for build-image variants. Note optional components (Results, Neo4j, Dashboard, management GUI).
49+
**Visual:** VHS terminal recording of the 4-command setup.
50+
51+
---
52+
53+
### 03 — Bootstrap dataflow (refresh)
54+
55+
**What stays:** resolve-stack → clone → build → deploy → trace request through propagation chain.
56+
**What changes:** Minor wording updates. Mention Kaniko cache PV for faster rebuilds.
57+
**Visual:** Mixed (Manim HeaderPropagationScene + VHS bootstrap terminal).
58+
59+
---
60+
61+
### 04 — PR pipeline (refresh)
62+
63+
**What stays:** Webhook → orchestrator → generate PipelineRun → build → intercept → validate → test → cleanup → PR comment.
64+
**What changes:** Minor polish. Clarify "validation cluster" vs "production cluster" language. Mention query-test-plan integration with Neo4j.
65+
**Visual:** VHS showing generate-run.sh and pipeline execution.
66+
67+
---
68+
69+
### 05 — Intercept routing (refresh)
70+
71+
**What stays:** Blue/green request paths, header routing, dual backend, concurrent PRs, validate propagation/original traffic.
72+
**What changes:** Minor wording. Keep as-is — this is one of the strongest segments.
73+
**Visual:** Manim InterceptRoutingScene.
74+
75+
---
76+
77+
### 06 — Local debug (refresh)
78+
79+
**What stays:** mirrord tunnel, IDE breakpoints, live cluster data, clean disconnect.
80+
**What changes:** Minor polish. Reference VS Code launch configs in the repo.
81+
**Visual:** Manim LocalDebugScene.
82+
83+
---
84+
85+
### 07 — Merge & release pipeline (NEW)
86+
87+
**Story arc:**
88+
1. Open — "PR passed, tests green. What happens when you merge?"
89+
2. Merge trigger — webhook fires, orchestrator generates merge PipelineRun
90+
3. Version bump — `version-bump` task in release mode: strip `-rc.N` → release semver (e.g. `0.1.0-rc.3``0.1.0`)
91+
4. Build — full compile + Kaniko containerize with merge tag
92+
5. Tag release — `tag-release-images` uses crane to copy to `registry/app:v0.1.0`
93+
6. Push version commit — bump to next dev cycle (`0.1.1-rc.0`), push `versions.yaml` back to repo
94+
7. Hook tasks — optional pre-build/post-build hooks (image scan, SBOM, Slack)
95+
8. Close — "Released image ready for promotion. PR to merge to release."
96+
97+
**Visual:** Manim — new MergeReleaseScene showing pipeline flow with version numbers transforming.
98+
99+
---
100+
101+
### 08 — Orchestrator API (refresh, was 07)
102+
103+
**What stays:** healthz/readyz, /api/stacks, POST /api/run, webhook, test-plan, graph endpoints, /api/runs, /api/teams.
104+
**What changes:** Add /api/reload (hot-reload config). Add /api/graph/stats. Clarify deployment via Helm ConfigMap.
105+
**Visual:** VHS terminal showing curl calls.
106+
107+
---
108+
109+
### 09 — Multi-team Helm (refresh, was 08)
110+
111+
**What stays:** Single team → three teams, values.yaml knobs, team-scoped ConfigMaps, GUI team switcher, webhook isolation.
112+
**What changes:** Reference hook task parameters in values.yaml. Mention ArgoCD ApplicationSet for GitOps provisioning.
113+
**Visual:** Manim MultiTeamScene.
114+
115+
---
116+
117+
### 10 — Baggage middleware (NEW)
118+
119+
**Story arc:**
120+
1. Open — "Intercept routing depends on one thing: every service must propagate the dev-session header. The baggage libraries make this automatic."
121+
2. The three roles — originator sets, forwarder passes through, terminal accepts but does not forward
122+
3. Spring Boot — `@ConditionalOnProperty`, `BaggageContextFilter` + `BaggageRestTemplateInterceptor`, auto-configured
123+
4. Node/Vue — `createBaggageFetch` wraps fetch, `createAxiosInterceptor` for axios, browser env gating via `VITE_BAGGAGE_ENABLED`
124+
5. Flask/Python — `init_app` + `BaggageSession` subclassing `requests.Session`
125+
6. PHP — PSR-15 `BaggageMiddleware` + Guzzle middleware
126+
7. W3C baggage — standard `baggage` header alongside custom `x-dev-session`
127+
8. Production safety — all gated by `BAGGAGE_ENABLED=false` in prod
128+
9. Close — "Five frameworks, one header contract, zero application code changes beyond config."
129+
130+
**Visual:** Manim — new BaggageMiddlewareScene showing header flow through each framework icon.
131+
132+
---
133+
134+
### 11 — Testing ecosystem (refresh + rename, was 10)
135+
136+
**What stays:** Newman/Postman for API tests, pytest for orchestrator, tekton-dag-common tests, e2e intercept validation.
137+
**What changes:** Rename from "Newman tests" to "Testing ecosystem." Add Playwright (69 e2e tests for management GUI). Mention Artillery load tests. Clarify stack-scoped vs system-level test distinction.
138+
**Visual:** VHS terminal showing pytest + Newman output.
139+
140+
---
141+
142+
### 12 — Test-trace graph (refresh, was 11)
143+
144+
**What stays:** Neo4j graph model, trace ingestion, query-test-plan, blast radius 1/2, gap detection, focused test execution.
145+
**What changes:** Minor polish. This is already one of the best segments.
146+
**Visual:** Mixed (Manim BlastRadiusScene + VHS graph query terminal).
147+
148+
---
149+
150+
### 13 — Results DB (refresh, was 09)
151+
152+
**What stays:** Tekton Results + Postgres setup, verify-results script, auditability, GUI pipeline monitor.
153+
**What changes:** Slight reorder in the series (was 09, now 13) to group testing/data together. Minor wording.
154+
**Visual:** VHS terminal showing Results API queries.
155+
156+
---
157+
158+
### 14 — Customization (NEW)
159+
160+
**Story arc:**
161+
1. Open — "tekton-dag is designed to be extended without forking. Every integration point is config-driven."
162+
2. Stack schema — `stacks/schema.json` validates stack YAML, prevents typos
163+
3. Add an app — new entry in `apps[]` with role, build tool, propagation-role, downstream, tests
164+
4. Build variants — `compileImageVariants` in Helm for multiple Java/Node/Python/PHP versions
165+
5. Hook tasks — `pre-build-task`, `post-build-task`, `pre-test-task`, `post-test-task` parameters; pipeline `when` skips if empty
166+
6. Example hooks — image scan (`tasks/examples/example-image-scan.yaml`), Slack notification
167+
7. New team onboarding — `teams/<name>/team.yaml` + `values.yaml` → Helm release
168+
8. Registry and infrastructure — change image registry, switch intercept backend
169+
9. Close — "Config-only onboarding. No pipeline forks."
170+
171+
**Visual:** Manim — new CustomizationScene showing schema → stack → hooks → team flow.
172+
173+
---
174+
175+
### 15 — Regression suite (REWRITE, was 12)
176+
177+
**Story arc:**
178+
1. Open — "How do we know the platform itself works? Unit tests cover Python services. But a CI/CD platform needs cluster-level verification."
179+
2. Phase 1 (no cluster) — stack YAML validation, registry wiring, pytest orchestrator + common + management-gui-backend
180+
3. Phase 2 (with cluster) — port prep, `stack-dag-verify` PipelineRun, wait for Succeeded, Newman against live orchestrator
181+
4. Optional Phase 2+ — Tekton Results verification, management GUI Playwright suite
182+
5. The regression script — `run-regression-agent.sh`: iterates Phase 1 → Phase 2, exits 0 or non-zero
183+
6. CI integration — agent loops until `regression exit code: 0`, reads failures, fixes, re-runs
184+
7. Close — "Layered verification. No cluster? Phase one still catches regressions. Full cluster? Phase two proves the platform end to end."
185+
186+
**Visual:** VHS terminal showing actual regression run with Phase 1 pass, Phase 2 pass, exit code 0.
187+
188+
---
189+
190+
### 16 — Management GUI (REWRITE, was 13)
191+
192+
**Story arc:**
193+
1. Open — "The management GUI is a Vue 3 single-page app. The browser never touches the Kubernetes API directly — every action goes through Flask."
194+
2. Team switcher — select active team, all views filter to that team's namespace
195+
3. DAG view — Vue Flow renders the stack graph, click a node to see app details
196+
4. Pipeline runs — monitor active and completed runs, drill into TaskRun logs
197+
5. Triggers — manual bootstrap, PR, or merge trigger from the UI
198+
6. Test results — Newman/Playwright/Artillery results per run
199+
7. Git browser — browse app repos from the GUI
200+
8. Embedded Tekton Dashboard — iframe integration for deeper Tekton inspection
201+
9. Architecture — Vue 3 + Vite frontend, Flask backend, Pinia stores, team-scoped API helpers
202+
10. Testing — pytest backend, Playwright e2e (69 tests), Postman/Newman API tests
203+
11. Close — "One web interface for the whole platform. Team-scoped, cluster-safe, fully tested."
204+
205+
**Visual:** Manim — new ManagementGUITourScene showing each view with transitions.
206+
207+
---
208+
209+
### 17 — Extending the GUI (REWRITE, was 14)
210+
211+
**Story arc:**
212+
1. Open — "Adding a new operator surface — like TaskRun logs or Results views — follows a four-step pattern."
213+
2. Step 1: Flask route — add a JSON endpoint wrapping the Kubernetes client; return stable response shapes
214+
3. Step 2: pytest — mock the K8s client, test the route with the same pattern as existing tests
215+
4. Step 3: Vue store — add a Pinia store using `useApiHelper` and `teamUrl` for team-scoped reads
216+
5. Step 4: View + router — add a Vue component and router entry
217+
6. Step 5: Playwright spec — e2e test the new view end to end
218+
7. Concrete example — walk through adding a "TaskRun Logs" panel: Flask reads logs API → Vue displays in monospace → Playwright verifies
219+
8. Close — "Five files, one pattern. The extension guide in the repo lists more ideas."
220+
221+
**Visual:** Manim — new GUIExtensionPatternScene showing the 5-step flow with code snippets.
222+
223+
---
224+
225+
### 18 — What's coming next (NEW)
226+
227+
**Story arc:**
228+
1. Open — "tekton-dag handles the full lifecycle today: build, test, intercept, merge, release. Here is what we are building next to make it production-grade on real infrastructure."
229+
2. Retry on transient failures — spot instance preemptions, registry throttling, DNS timeouts. Task-level `retries` on build/deploy tasks, but never on test tasks. Structured retry annotations for post-mortem.
230+
3. Precise build image sizing — per-tool resource profiles (Maven needs heap, Node needs less), Helm-configurable `resources.*` values, stack-level overrides for large apps, Kaniko sizing separate from compile, monitoring baseline to inform decisions.
231+
4. Multi-cluster push — today builds and deploys in one cluster. Next: push released images to remote registries (ECR, GCR, Harbor), promotion pipeline with environment gates (staging → production approval), cross-cluster deploy task, full audit trail in Tekton Results.
232+
5. Operational reliability — explicit pipeline timeouts, graceful cleanup on timeout (finally block still runs), health-check gates before tests, Results DB backup, Neo4j persistence.
233+
6. Observability — Prometheus metrics (build duration, retry count, queue time), alerting rules (failure rate, registry push failures), cost attribution labels per team/stack/app.
234+
7. Close — "Infrastructure-grade reliability. Cost-aware sizing. Multi-environment promotion. That is milestone thirteen."
235+
236+
**Visual:** Manim — new RoadmapScene with a timeline/roadmap layout showing each pillar with icons.
237+
238+
---
239+
240+
## Concat groups
241+
242+
| Name | Segments | ~Duration |
243+
|------|----------|-----------|
244+
| `full-demo` | 01–13 | ~14 min |
245+
| `full-demo-complete` | 01–18 | ~19 min |
246+
| `platform-core` | 01–07 | ~8.5 min |
247+
| `operations` | 08–09, 14–15 | ~5 min |
248+
| `testing` | 11–13 | ~3.5 min |
249+
| `gui` | 16–17 | ~2.5 min |
250+
| `roadmap` | 18 | ~1.5 min |
251+
252+
---
253+
254+
## Visual asset inventory
255+
256+
### Existing Manim scenes (update)
257+
- `StackDAGScene` (01) — add merge pipeline box, hook task annotation
258+
- `InterceptRoutingScene` (05) — keep
259+
- `LocalDebugScene` (06) — keep
260+
- `MultiTeamScene` (09) — add ArgoCD mention
261+
- `BlastRadiusScene` (12) — keep
262+
- `RegressionSuiteScene` (15) — rewrite for Phase 1/2 flow
263+
264+
### New Manim scenes
265+
- `MergeReleaseScene` (07) — version transform pipeline
266+
- `BaggageMiddlewareScene` (10) — header flow through framework icons
267+
- `CustomizationScene` (14) — schema → stack → hooks → team
268+
- `ManagementGUITourScene` (16) — view panels with transitions
269+
- `GUIExtensionPatternScene` (17) — 5-step code flow
270+
- `RoadmapScene` (18) — timeline with M13 pillars: retry, sizing, multi-cluster, observability
271+
272+
### Existing VHS tapes (update)
273+
- `02-quickstart.tape` — minor script name updates
274+
- `03-bootstrap.tape` — keep
275+
- `04-pr-pipeline.tape` — keep
276+
- `07-orchestrator-api.tape` (now 08) — add reload/stats endpoints
277+
- `09-results-db.tape` (now 13) — keep
278+
- `10-newman.tape` (now 11) — add pytest/Playwright output
279+
- `11-graph-tests.tape` (now 12) — keep
280+
281+
### New VHS tapes
282+
- `15-regression.tape` — real regression-agent run showing Phase 1 + 2
283+
284+
---
285+
286+
## Decision: start fresh or refresh?
287+
288+
**Recommendation:** Refresh narration for 01–06, 08–09, 11–13 (they're already good — just align numbering and minor wording). **Write new** narration for 07, 10, 14, 18. **Rewrite** 15, 16, 17.
289+
290+
This avoids re-spending TTS budget on segments that are already strong while filling the real gaps.

0 commit comments

Comments
 (0)