diff --git a/Design & Best Practices Guide.md b/Design & Best Practices Guide.md
new file mode 100644
index 0000000..014a935
--- /dev/null
+++ b/Design & Best Practices Guide.md
@@ -0,0 +1,913 @@
+# SA Workflow Design & Best Practices Guide
+
+[**Overview 2**](#overview)
+
+[π Prerequisites 2](#π-prerequisites)
+
+[π― Goals 2](#π―-goals)
+
+[β Non-Goals 2](#β-non-goals)
+
+[**Preparation 2**](#preparation)
+
+[For the SA 2](#for-the-sa)
+
+[For the Customer 3](#for-the-customer)
+
+[**Workflow Design 4**](#workflow-design)
+
+[General 4](#general)
+
+[Common Design Questions 5](#common-design-questions)
+
+[**Best Practices 8**](#best-practices)
+
+[Workflows 8](#workflows)
+
+[Child Workflows 10](#child-workflows)
+
+[Activities 11](#activities)
+
+[Signals 13](#signals)
+
+[Queries 15](#queries)
+
+[Update 15](#update)
+
+[Workers & Task Queues 16](#workers-&-task-queues)
+
+[Timers 17](#timers)
+
+[Schedules 17](#schedules)
+
+[Cron Jobs 18](#cron-jobs)
+
+[Side Effects 19](#side-effects)
+
+[Data Converter 19](#data-converter)
+
+[Visibility 19](#visibility)
+
+[Versioning 20](#versioning)
+
+[Interceptors 21](#interceptors)
+
+[Sessions/Worker-Specific Task Queues 21](#sessions/worker-specific-task-queues)
+
+[Storage Optimization (Long Running Workflows) 21](#storage-optimization-\(long-running-workflows\))
+
+[**Appendix 21**](#appendix)
+
+[1\. Best Practices Checklist 21](#best-practices-checklist)
+
+[2\. Useful Links: SDK Features Matrix 23](#useful-links:)
+
+#
+
+# Overview {#overview}
+
+## π Prerequisites {#π-prerequisites}
+
+The customer has already vetted Temporal and/or Temporal Cloud as a suitable platform for building their use case(s).
+
+## π― Goals {#π―-goals}
+
+The goals of a Workflow design session are to:
+
+* Ensure the customer is using the Temporal primitives correctly and aligned with our best practices
+* Answer specific customer questions on Workflow design
+* Advise on any gotchas or potential issues with the Workflow
+* Build customer confidence on their path to production with Temporal
+
+## β Non-Goals {#β-non-goals}
+
+* Discover or Confirm Temporal or Temporal Cloud suitability for this use case. See [SA Customer Evaluation Runbook](https://docs.google.com/document/d/18f5K7wmOKy5luT1gajKqkCuys1q9tNkM9DwDNXVsTmw/edit#heading=h.o0epy3bs2n54)for that.
+* Deep debugging or troubleshooting of production issues (enter a support ticket for that)
+* Perfection: no 1-hour session will yield a perfect workflow. It will be an iterative process
+
+# Preparation {#preparation}
+
+## For the SA {#for-the-sa}
+
+The SA must be familiar with all of the Temporal concepts and primitives in this document. This document is not a replacement for the [Temporal documentation](https://docs.temporal.io/), the [Developerβs Guide](https://docs.temporal.io/dev-guide), the [SDK Samples](https://github.com/temporalio?q=samples-&type=public&language=&sort=stargazers), the [Community Forums](https://community.temporal.io/), or Slack. Rather, this document highlights (and references) knowledge from those sources that is relevant to Workflow Design and Best Practices.
+
+This guide should help the SA focus on key Workflow design elements and usage of Temporal primitives in a Workflow. However, it will take time for you to review and become familiar with all of the topics. Take that time. Read the docs. Play with the samples. Shadow your fellow SAs. Contribute back to this doc with your newly acquired knowledge. Relax with your favorite beverage as you bask in the growth of your Temporal expertise and the value you bring to our customers everyday. Cheers\!
+
+## For the Customer {#for-the-customer}
+
+Prior to the meeting, request that your customer provides the following:
+
+* A description of the use case and the business process during which it runs
+* A diagram of the Workflow with Activities, etc.
+* Access to Workflow Execution history in the Temporal UI (if code has been written and run successfully)
+* Access to code (if code has been written and customer is comfortable with sharing)
+
+## Suggested Session Outline
+
+# Workflow Design {#workflow-design}
+
+The following sections include questions, prompts and areas to guide the user throughout the design session.
+
+## General {#general}
+
+### β What is the customerβs **level of experience** with Temporal?
+
+* Is this the first Workflow they have designed or implemented?
+* What is their overall comfort level with and knowledge of Temporal, its primitives, and the execution model?
+
+### β Have the customer **walk through the diagram**.
+
+* What is the business process?
+* Where does Temporal fit in the overall architecture?
+ * Is everything a Workflow, or is it just being used for one part of the overall system?
+* What invokes the Workflow?
+ * Is it a user action, a system event, or a cron/schedule?
+* How often is the Workflow called?
+ * What is the expected peak volume? (e.g. 10x/day? 10x/second? More? Less?)
+* How long does the Workflow run?
+ * Does it complete in seconds? Months? Longer?
+* How many Activities, or other steps/actions, are in the Workflow?
+ * Are they sequential, or do they run in parallel?
+* What is done in the Activities?
+ * Are Activities too granular or too broad?
+ * Are there multiple steps within an Activity?
+
+At this point, the discussion may still be high level. Perhaps your customer has attempted to steer the conversation in another direction.
+
+* Ensure you have a good high level understanding before going too deep into the weeds.
+* If your customer has immediate questions, and areas to focus on, you can follow their lead using the sections below to guide your review of their Workflows, Activities, Signals, Queries, etc⦠as needed.
+* If they bring you deep into code, you can pull back out (if you need to) by asking to view the workflow history in the UI, or revisit the diagram
+
+### β **Gut check** for Use Case and Design
+
+* #### Is this an appropriate use-case for Temporal?
+
+ * Almost anything can be a valid use case, but watch out for:
+ * Extreme low latency, e.g. high-frequency algo trading where milliseconds matter
+ * Synchronous read-only operations, i.e. if you just need to query records out of a DB, you donβt need Temporal.
+ * Big data passing through Temporal. Temporal is a control-plane, it should not be used for the data-plane.
+
+* #### Indicators of a good design
+
+ * Workflow [determinism](#β-do-they-understand-workflow-determinism-requirements?)
+ * Activity [idempotency](#β-do-they-understand-activity-idempotency-as-a-best-practice?)
+ * Appropriate management of the [Event History](#β-do-they-understand-the-event-history?) size
+ * Understanding and consideration of [Temporal limits](https://docs.temporal.io/self-hosted-guide/defaults) and [Cloud limits](https://docs.temporal.io/cloud/limits)
+ * Appropriate use of [timeouts](#β-what-are-the-activity-timeout-settings?) and [retry](#β-what-is-the-activity-retry-policy?) options
+
+* #### Indicators of a sub-optimal design
+
+ * DIY state management
+ * Are they unnecessarily persisting status or entities to a database?
+ * It is valid/necessary to persist data for historical, reporting or aggregate query purposes
+ * Are they sending messages (via Kafka, RabbitMQ, etc.) for the purpose of choreography to other components in the system? (perhaps those should be Workflows as well)
+ * Misuse of [Child Workflows](#β-do-they-use-child-workflows?)
+ * Misuse of [Local Activities](#β-do-they-use-local-activities?)
+ * Misuse of [Search Attributes](#β-do-they-use-search-attributes?)
+
+## Common Design Questions {#common-design-questions}
+
+### β When to use Local Activity (vs. Activity)?
+
+* Use regular Activities unless your use case requires very high throughput and large Activity fan-outs of very short-lived Activities.
+* Reference:
+ * Best practices for [Local Activities](#β-do-they-use-local-activities?)
+ * Community forum: [Local Activity vs Activity](https://community.temporal.io/t/local-activity-vs-activity/290/3)
+ * Slide deck [Workflow Latency: Regular Activities vs Local Activities](https://docs.google.com/presentation/d/1BFipVEynxs5-fC7aeOsPO4xSeWan5L7-0WTdIi1DZU0/edit#slide=id.p)
+
+### β When to use Child Workflow (vs. Activity)?
+
+* When in doubt, use an Activity
+* Reference:
+ * Best practices for [Child Workflows](#β-do-they-use-child-workflows?)
+ * Docs: [When to use a Child Workflow versus an Activity](https://docs.temporal.io/encyclopedia/child-workflows#child-workflow-versus-an-activity)
+
+### β Can a WF communicate with another WF?
+
+* Yes, via Nexus sync operations
+* Yes. Can be done with:
+ * Signals
+ * Queries
+ * Updates
+
+ You can Signal from a Workflow and Query/Update through an Activity within a Workflow.
+
+
+### β Can a WF communicate with another WF **in a different Namespace**?
+
+* Yes, via Nexus
+* Old Way β Yes through an Activity as follows:
+ * Create a new Temporal Client
+ * Use that Client within an Activity to Signal, Query or Update the WF in another Namespace
+ * See for reference: [this sample](https://github.com/temporalio/samples-java/tree/main/core/src/main/java/io/temporal/samples/batch/slidingwindow)
+
+### β How to handle large payloads?
+
+* Pass references to data (e.g. filenames/handles)
+* Consider [External Storage](https://docs.temporal.io/external-storage)
+ * (Previously: Consider [Temporal Large Payload Codec](https://github.com/DataDog/temporal-large-payload-codec) (from DataDog) to automatically replace data with reference
+ * Some caution using a Large Payload Codec
+ * Causes a remote access for every payload
+ * Do it explicitly on a case-by-case basis (Donβt do this implicitly)
+ * Lots of remote writes/reads can affect workflow performance.
+* Consider moving the work into activities
+ * The first activity takes the ID and gets the data and continues the workflow
+ * The last activity takes the results, stores it and returns an ID
+ * Challenge is knowing which bits of data are needed inside the workflow logic
+* Store data on local disk and use sticky Activities
+* Consider a broader Activity that gathers AND processes data within a single Activity
+* Reference:
+ * Best practices for [Activity Input and Outputs](#β-what-are-the-input-and-output-sizes-for-activity-payloads?)
+
+### β How to handle large Workflow History?
+
+* Use Continue-As-New at or before the WARN levels (10mb or 10k events)
+ * Note that Cloud users will *not* see the default warning, as it is emitted on the server side, however they can obtain WF history length through the SDK.
+ * Also, the SDKs now contain a βsuggestContinueAsNew()β API that can be used to determine when to CAN.
+ * Java \- [Workflow.getInfo().isContinueAsNewSuggested()](https://www.javadoc.io/doc/io.temporal/temporal-sdk/latest/io/temporal/workflow/WorkflowInfo.html)
+ * Go \- [GetContinueAsNewSuggested()](https://pkg.go.dev/go.temporal.io/sdk@v1.25.1/internal#GetWorkflowInfo) from GetWorkflowInfo
+ * Python \- [is\_continue\_as\_new\_suggested](https://python.temporal.io/temporalio.workflow.Info.html#is_continue_as_new_suggested) from Workflow.info
+ * TypeScript \- [workflowInfo().continueAsNewSuggested](https://typescript.temporal.io/api/interfaces/workflow.WorkflowInfo#continueasnewsuggested)
+ * Dotnet \- [Workflow.ContinueAsNewSuggested](https://dotnet.temporal.io/api/Temporalio.Workflows.Workflow.html#Temporalio_Workflows_Workflow_ContinueAsNewSuggested)
+* Use Child Workflows to partition the work
+* Reference:
+ * Best practices for [WF Event History](#β-do-they-understand-the-event-history?)
+
+### β How to run Activities / Child WFs in parallel and aggregate results?
+
+* All executions by the API (e.g. Execute Activity, Execute Child WF) return some form of a Promise, Future, or Awaitable (depending on the SDK).
+* This is by default asynchronous invocation. To run execs in parallel, just donβt block on getting the result of a promise
+* Gather the promises in an array and when appropriate, iterate over the array and Get the results from the promises
+
+### β When to use Worker-specific Activity Task Queues?
+
+Use Worker-specific Activity Task Queues for:
+
+* Special purpose hardware or capabilities for workers (e.g. GPUs)
+* Activities that need to run on the same worker
+ * Such as when workers need to access local filesystem files.
+ * Such as when workers need to run on a machine with elevated access.
+ * Such as when workers need to run on a machine with differentiated hardware, such as GPUs.
+* Rate-limiting
+* Reference:
+ * Best practices on [Activity specific task queues](#β-do-any-activities-run-on-unique-task-queues?)
+
+### β When to use a schedule vs a timer?
+
+* Use timers when:
+ * Your delay is relative (e.g. wait 2 days)
+ * Your delay happens inside a Workflow Execution
+ * Would exceed 200 schedule actions per second (default is 10\)
+* Use schedules when:
+ * Your delay is for an entire Workflow Execution
+ * Your delay is for a set time (e.g. 3pm Wednesdays)
+ * Your delay is recurring (though you can limit schedule runs to a single execution)
+
+### β Can I intentionally let a task queue backlog grow by underprovisioning workers for some period of time?
+
+* Yes, **BUT**β¦ be aware that Temporal does not guarantee FIFO. Task processing currently prioritizes sync match over async match. Therefore a task on the backlog may be processed after a newer task is scheduled (when the newer task is able to sync match). \[Issue [2517](https://github.com/temporalio/temporal/issues/2517) will provide config alternative when implemented\]
+
+# Best Practices {#best-practices}
+
+## Workflows {#workflows}
+
+### β Do they understand Workflow [**Determinism**](https://docs.temporal.io/workflows#deterministic-constraints) requirements? {#β-do-they-understand-workflow-determinism-requirements?}
+
+* A Workflow Definition can (and will) be executed many times. Any re-execution must follow the same execution path for a given input to the Workflow Definition.
+* Have they encountered non-determinism errors (NDEs)?
+ * Are they unit testing using the Replayer with past Workflow histories?
+ * If NDEs occur due to code changes, see the Versioning section.
+* Ref: [community post about determinism](https://community.temporal.io/t/workflow-determinism/4027)
+
+### β Do they understand the [**Event History**](https://docs.temporal.io/workflows#event-history)? {#β-do-they-understand-the-event-history?}
+
+* A single execution has limits on the size of the Event History. Are they at risk of hitting the limits?
+ * Is there a strategy to use [Continue-As-New](https://docs.temporal.io/workflows#continue-as-new) within a single workflow or use [Child Workflows](https://docs.temporal.io/workflows#child-workflow) to partition the work across multiple Workflows?
+* Temporal scales out incredibly well. Partition work *across* Workflows rather than performing too much work *within* a single Workflow.
+
+### β Do they set [**Workflow Id**](https://docs.temporal.io/workflows#workflow-id) to a meaningful business identifier (or not)? {#β-do-they-set-workflow-id-to-a-meaningful-business-identifier-(or-not)?}
+
+* Are they leveraging Workflow Id uniqueness constraints for running Workflows?
+* Do they explicitly set a [Workflow Id Reuse Policy](https://docs.temporal.io/workflows#workflow-id-reuse-policy) for closed Workflows? If so, why?
+* A Workflow Id cannot be reused for Open Workflows.
+ * It can effectively act as an idempotency key in designs where the Workflow may be started more than once.
+* Do not reuse the same Workflow Id with high frequency as it can result in server performance issues \[[ref](https://temporaltechnologies.slack.com/archives/C01H1G7J98F/p1702041104745229?thread_ts=1701805613.351819&cid=C01H1G7J98F)\]
+* In addition, the Workflow ***Run Id*** can change during a Workflow Execution (e.g. during Retry).
+ * Do *not* rely on Run Id for any logical choices in a Workflow, as this will lead to non-determinism issues.
+
+### β Do they explicitly set any **Workflow Timeout** options? {#β-do-they-explicitly-set-any-workflow-timeout-options?}
+
+* Generally, the defaults for Workflow Timeouts are sufficient. We do *not recommend changing the defaults*.
+ * Workflow Execution Timeout and Workflow Run Timeout default to infinite
+ * Workflow Task Timeout defaults to 10 seconds
+ * Potential reasons to increase WFT timeout:
+ * Time consuming data converters
+ * Large WF history
+ * Maximum value is 120 seconds
+ * Can be overridden at a namespace level via Dynamic Config via a support ticket
+ * Monitor CPU & Garbage Collection (if applicable) and the following SDK metrics
+ * workflow\_task\_execution\_latency
+ * request\_failure on the API RespondWorkflowTaskCompleted
+ * Possible causes of Workflow Task Timeouts include
+ * CPU going over 100%
+ * Starting a ton of async activities and combined inputs go over 4 MB
+ * Returning more than 2 MB of data
+ * Workflow tasks will retry with a backoff until a maximum retry interval of 10 minutes is reached \[[ref](https://temporaltechnologies.slack.com/archives/CTEFJ76QG/p1721703278870979)\]
+* If you need to timeout a Workflow, use explicit Timers within the Workflow, rather than setting exec or run timeouts.
+
+### β Do they understand **Workflow Failure** / error / exception handling? {#β-do-they-understand-workflow-failure-/-error-/-exception-handling?}
+
+* Any error that is **not** a [Temporal Failure](https://docs.temporal.io/references/failures) will fail the WF Task, which will be retried indefinitely with exponential backoff until it succeeds
+ * The exception is the Go SDK, where \`error\` fails the WF and \`panic\` fails the WFT
+
+### β Do they set a **Workflow Retry** policy? {#β-do-they-set-a-workflow-retry-policy?}
+
+* Why do they?
+ * This is generally not recommended.
+ * By design, a Workflow should not fail due to intermittent issues
+
+## Child Workflows {#child-workflows}
+
+### β Do they use [**Child Workflows**](https://docs.temporal.io/workflows#child-workflow)? {#β-do-they-use-child-workflows?}
+
+* **Do** use Child Workflows strategically to:
+ * Partition large workloads into smaller chunks to stay under history size limits
+ * Target specific hosts (eg TaskQueue) due to security, workload profile, or other *strategic* reasons
+ * Extract behavior to simplify or explicitly define team ownership (eg [Shared Kernel](https://yoan-thirion.gitbook.io/knowledge-base/software-architecture/ddd-re-distilled#shared-kernel))
+* **Do not** use Child Workflows to:
+ * Organize code
+ * Use standard features of your programming language (e.g. packages, objects, structs, etc.) for code organization and modularity
+ * Reduce cost
+ * Child WFs will result in more events and actions than just using an Activity within the main WF (and on cloud Child WF counts as 2 actions)
+* [When to use a Child Workflow versus an Activity](https://docs.temporal.io/encyclopedia/child-workflows#when-to-use-child-workflows)
+ * **When in doubt, use an Activity**
+* [Valid reasons to use a Child Workflow](https://community.temporal.io/t/purpose-of-child-workflows/652/2)
+* As of July 2025, starting hundreds of child workflow per parent can cause multi-minute delays
+ * Details [here](https://temporaltechnologies.slack.com/archives/C03HNADRLKY/p1752262100562919?thread_ts=1752262034.541529&cid=C03HNADRLKY)
+
+### β **How many** Child WFs are being started by a Parent? {#β-how-many-child-wfs-are-being-started-by-a-parent?}
+
+* A single Parent SHOULD NOT start more than 1000 Children (per [docs](https://docs.temporal.io/workflows#when-to-use-child-workflows))
+ * **Note** that this is *not* a hardcoded limit though, but rather a code-smell.
+ * See [this thread](https://temporaltechnologies.slack.com/archives/C01RN061UMR/p1660750792060049?thread_ts=1660747402.058829&cid=C01RN061UMR) for a discussion where this βlimitβ was clarified as being *guidance*, not *absolute.*
+ * Also see this [discussion](https://temporaltechnologies.slack.com/archives/C03BY3HR2RH/p1669136625638709?thread_ts=1669121603.628799&cid=C03BY3HR2RH) where an alternative is proposed to remove the code-smell.
+ * Alternatives are proposed [here](https://community.temporal.io/t/batch-processing-vs-multiple-workflows/1688/2)
+
+### β Do the Parent and Child WF **need to share state**? {#β-do-the-parent-and-child-wf-need-to-share-state?}
+
+* If so, they can only communicate via Signals
+ * Local state cannot be shared
+
+### β Does the Parent **need to wait** on the Child Workflow result? {#β-does-the-parent-need-to-wait-on-the-child-workflow-result?}
+
+* Review the [Parent Close Policy](https://docs.temporal.io/encyclopedia/child-workflows#parent-close-policy) configuration
+
+## Activities {#activities}
+
+### β Do they understand Activity [**Idempotency**](https://docs.temporal.io/activities#idempotency) as a best practice? {#β-do-they-understand-activity-idempotency-as-a-best-practice?}
+
+* An Activity Definition may be executed multiple times during failure scenarios.
+ * An Activity will only be executed *once* if it is successful, but has *at-least-once* semantics due to potential failure during execution
+* We recommend using idempotency keys
+* See [https://temporal.io/blog/idempotency-and-durable-execution](https://temporal.io/blog/idempotency-and-durable-execution)
+
+### β Are Activities **short-term or long**\-**running**? {#β-are-activities-short-term-or-long-running?}
+
+* Are [timeouts](#bookmark=id.omql4akvpemu) set appropriately for the Activity duration?
+* There is not a firm definition of βshortβ (perhaps a few minutes or less?)
+* βLongβ running activities should [Heartbeat](https://docs.temporal.io/activities#activity-heartbeat)
+ * Use a short Heartbeat Timeout value
+ * Heartbeat frequently
+ * Include custom information/payload on the Heartbeat
+ * For saving progress.
+ * Do they understand that Heartbeats are [Throttled](https://docs.temporal.io/activities#throttling) by the SDK?
+ * An activity is a unit of failure detection (through timeouts), retries and visibility. It is OK to pack multiple operations in a single activity if you are OK with specifying a single timeout for all of them together and retrying them together. It also makes troubleshooting harder as it is less clear at which point the process is having issues. (source: Max in [community slack channel](https://temporalio.slack.com/archives/C04S80QKB2Q/p1711302488204919?thread_ts=1711297768.296809&cid=C04S80QKB2Q))
+* OK but what about *really long running (months)?*
+ * See [this excellent thread](https://temporaltechnologies.slack.com/archives/C04NYM5D3U6/p1757449292278879?thread_ts=1757447403.123839&cid=C04NYM5D3U6) in slack
+ * Heartbeats π, [signal-back-to workflow pattern](https://docs.temporal.io/activity-execution#when-to-use-async-completion), async completion
+
+### β What are the **Input and Output sizes** for Activity Payloads? {#β-what-are-the-input-and-output-sizes-for-activity-payloads?}
+
+* Is there risk of reaching the 2MB Blob Size Limit for Payloads?
+ * Or the History total size limit of 50MB?
+* Should they only pass references to the data, rather than the actual data?
+* Should they use sticky Activity queues (also known as βWorker-specific Activitiesβ) to allow for the data to be stored locally and shared across Activities?
+* Are they compressing the payloads via a Data Converter?
+ * Compression is recommended by default.
+
+### β Is the Activity performing any **polling**? {#β-is-the-activity-performing-any-polling?}
+
+* [What is the best practice for a polling activity?](https://community.temporal.io/t/what-is-the-best-practice-for-a-polling-activity/328/2)
+* If polling interval is frequent, perform polling within the Activity using iteration
+* If polling interval is infrequent, then perform polling by using Retry Options
+* Can they use a Signal or [Async Activity Completion](https://docs.temporal.io/activities#asynchronous-activity-completion) approach instead?
+
+### β Is the Activity **listening** on a port or socket? {#β-is-the-activity-listening-on-a-port-or-socket?}
+
+* The listening process should be run outside the Workflow.
+ * Use Signal or SignalWithStart to start the Workflow from the listening process.
+
+### β What are the Activity **Timeout** settings? {#β-what-are-the-activity-timeout-settings?}
+
+* [Schedule-To-Start](https://docs.temporal.io/activities#schedule-to-start-timeout) should generally **not** be set (default is βΎοΈ), but temporal\_activity\_schedule\_to\_start\_latency metric should be monitored. Schedule-to-start should only be used if workflow wants to take action in case worker(s) are busy or otherwise unavailable, for example when using host-specific task queues.
+* Either [Start-To-Close](https://docs.temporal.io/activities#start-to-close-timeout) or [Schedule-To-Close](https://docs.temporal.io/activities#schedule-to-close-timeout) **must** be set.
+ * Setting Start-To-Close is **strongly** recommended
+ * This timer resets on each retry
+ * Schedule-To-Close default is βΎοΈ
+ * This timer is inclusive of all retries (i.e. it does **not** reset on each retry)
+* Each Activity should be individually considered for its own optimal timeout settings
+ * [One does not simply](https://www.dictionary.com/e/memes/one-does-not-simply/) use the same Timeout settings for every Activity in a WF
+* Do they know Activity Timeout is enforced on the server side?
+ * Timeout setting should be **greater** than the longest potential time in which the Activity would complete under normal circumstances.
+ * Or in other words they should have **shorter** timeout enforced on the worker side with upstream actions that may take longer to complete, e.g. DB write, API requests. If not this will lead to duplicate actions, and other resource contention when Activity Retry kicks in.
+
+ [The 4 Types of Activity timeouts](https://temporal.io/blog/activity-timeouts)
+
+### β Do Activity Tasks run on Workers separate from the Workflow Tasks? {#β-do-activity-tasks-run-on-workers-separate-from-the-workflow-tasks?}
+
+* Do they require optimized compute resources or hardware? (e.g. CPU/Mem, GPUs)
+
+### β Do sequential Activities run on the same Worker (i.e. are they sticky)? {#β-do-sequential-activities-run-on-the-same-worker-(i.e.-are-they-sticky)?}
+
+* Are there large payloads that are used by multiple Activities that canβt or shouldnβt go through Temporal?
+
+### β Do they use [**Local Activities**](https://docs.temporal.io/activities#local-activity)? {#β-do-they-use-local-activities?}
+
+* **We recommend using regular Activities unless your use case requires very high throughput and large Activity fan outs of very short-lived Activities.**
+* How long is the Local Activity expected to run?
+ * Should not run for more than a few seconds, *inclusive of retries*
+* With LA, they lose the ability to rate limit & route tasks to workers
+* Reference: [Local Activity vs Activity](https://community.temporal.io/t/local-activity-vs-activity/290/3)
+* Reference: [Regular Activities vs Local Activities](https://docs.google.com/presentation/d/1BFipVEynxs5-fC7aeOsPO4xSeWan5L7-0WTdIi1DZU0/edit?slide=id.g36c5e7f8258_1_781#slide=id.g36c5e7f8258_1_781)
+
+### β What is the [**Activity Retry**](https://docs.temporal.io/retry-policies) policy? {#β-what-is-the-activity-retry-policy?}
+
+* Each Activity should be individually considered for its own optimal retry settings.
+* Do they use the default policy or set a custom policy?
+* Do they configure any specific [Non-Retryable](https://docs.temporal.io/retry-policies#non-retryable-errors) errors?
+* Reference: [https://docs.temporal.io/encyclopedia/detecting-activity-failures](https://docs.temporal.io/encyclopedia/detecting-activity-failures)
+* Reference: [https://temporal.io/blog/failure-handling-in-practice](https://temporal.io/blog/failure-handling-in-practice)
+
+### β Do they understand **Activity Failure** / error / exception handling? {#β-do-they-understand-activity-failure-/-error-/-exception-handling?}
+
+* If an Activity Execution fails, the error is returned to the Workflow, which decides how to handle it.
+* Reference: [https://docs.temporal.io/references/failures](https://docs.temporal.io/references/failures)
+
+### β Would an Activity need to receive [**Cancellation**](https://docs.temporal.io/activities#cancellation)? {#β-would-an-activity-need-to-receive-cancellation?}
+
+* If so, the Activity *must* Heartbeat (or be a Local Activity in Core-based SDKs only).
+
+### β Do they use too many Activities, or too few?
+
+* Guidance here: [https://temporal.io/blog/how-many-activities-should-i-use-in-my-temporal-workflow](https://temporal.io/blog/how-many-activities-should-i-use-in-my-temporal-workflow)
+
+## Signals {#signals}
+
+### β Do they use [**Signals**](https://docs.temporal.io/workflows#signal)? {#β-do-they-use-signals?}
+
+* If not, should they be?
+ * Do they have a need to update the state of a Workflow during its execution?
+* If so, what is sending the Signal?
+ * Is it another Workflow or a Temporal Client in another application?
+
+### β Do they understand that Signals areβ¦ {#β-do-they-understand-that-signals-areβ¦}
+
+* Recorded in the Event History (i.e. they will be replayed when a Workflow is replayed)
+* Delivered to a Workflow as part of the next scheduled Workflow Task
+ * Therefore, there may be some latency in delivery, depending on the current Workflow Task completing and the next being scheduled
+* Delivered in the order they are received
+* A single workflow can only handle a few signals per second (β€5/sec) and flooding with signals will result in not being able to continue-as-new / eventually hitting event limits of workflow history.
+
+β Are they checking the return value or exceptions from sending a Signal?
+
+* A Signal call can give errors/throw exceptions:
+ * A workflow execution doesnβt exist
+ * A workflow execution is closed
+* See [Problems When Sending a Signal](https://docs.temporal.io/develop/java/message-passing#message-handler-troubleshooting), and SDK-specific info, for example [Java](https://docs.temporal.io/develop/java/message-passing#message-handler-troubleshooting):
+ * The Client can't contact the server: You'll receive a [WorkflowServiceException](https://javadoc.io/doc/io.temporal/temporal-sdk/latest/io/temporal/client/WorkflowServiceException.html) on which the cause is a [StatusRuntimeException](https://grpc.github.io/grpc-java/javadoc/io/grpc/StatusRuntimeException.html) and status of UNAVAILABLE (after some retries).
+ * The Workflow does not exist: You'll receive a [WorkflowNotFoundException](https://javadoc.io/doc/io.temporal/temporal-sdk/latest/io/temporal/client/WorkflowNotFoundException.html).
+
+### β Are Signal handlers **idempotent** and **deterministic**? {#β-are-signal-handlers-idempotent-and-deterministic?}
+
+* It is possible (though unlikely) that a Signal could be delivered more than once ([reference](https://docs.temporal.io/encyclopedia/application-message-passing#footnote-label)).
+* Signal-handling code is Workflow code; therefore, it must adhere to the constraints of Workflow code (e.g. determinism)
+
+### β Is there a **high rate or volume** of Signals? {#β-is-there-a-high-rate-or-volume-of-signals?}
+
+* The recommended guidance is to limit signals to β€5 per second sustained for optimal workflow performance
+ * The theoretical limit is tied to database latency (1/database\_latency), so \~20/sec with 50ms latency
+* All updates to a workflow are performed under a single lock, and workflows need resources for other operations beyond just signal processing \- executing workflow tasks, activities, etc., which all require database updates,
+ * so a high volume of Signals or Updates can prevent workflow progress
+* Is the total *volume* of Signals a concern for exceeding the Event History size limits?
+ * A single Workflow Execution is limited to 10000 Signals received in Temporal Cloud
+* Does the Workflow receiving the Signal also use Continue-As-New?
+ * Is there potential for the *rate* of Signals to be too fast for Continue-As-New to be successfully invoked?
+ * In order for CAN to run there must be a moment (\~100ms) when there are no unhandled Signals in the Workflow
+ * If the workflow cannot continue-as-new then the event history size increases until a WorkflowExecutionTerminated occurs with reason: Workflow History Size/count exceeds limit
+* Signals briefly lock a workflow execution, many signals to the same workflow can cause latency and limit throughput
+ * [Engineering/latency slack thread](https://temporaltechnologies.slack.com/archives/C02LGH6BG3A/p1684975109797379?thread_ts=1684963896.489869&cid=C02LGH6BG3A)
+ * [SA slack design discussion about signal volume](https://temporaltechnologies.slack.com/archives/C03HNADRLKY/p1752073205981859?thread_ts=1752008525.412319&cid=C03HNADRLKY).
+
+### β Do Signal handlers **invoke Activities**? {#β-do-signal-handlers-invoke-activities?}
+
+* This should be avoided. Try to limit the scope of the Signal handler to updating the Workflow state. Let the Workflow code react to state changes to invoke Activities.
+ * [Recommendation from Max in the Community Forum](https://community.temporal.io/t/signal-method-invocation-and-workflow-thread-safety/1679/2)
+
+## Queries {#queries}
+
+### β Do they use [**Queries**](https://docs.temporal.io/workflows#query)? {#β-do-they-use-queries?}
+
+* If not, should they be?
+ * Are they manually storing or exporting Workflow state elsewhere during execution? Why?
+
+### β Do they understand that Queries areβ¦ {#β-do-they-understand-that-queries-areβ¦}
+
+* A synchronous operation
+* Available for both *running* and *completed* Workflows?
+ * Note: a Worker must be running and listening on the Task Queue
+
+### β Do Query handlers perform **read-only** operations? {#β-do-query-handlers-perform-read-only-operations?}
+
+* Queries must never mutate the state of a Workflow
+
+## Update {#update}
+
+### β Do they use [**Update**](https://docs.temporal.io/workflows#update)? {#β-do-they-use-update?}
+
+* The Update handler function must be idempotent and deterministic
+* The handler function runs as part of the Workflow code and is subject to Workflow code constraints
+
+### β Do they perform **Validation** on the Update request? {#β-do-they-perform-validation-on-the-update-request?}
+
+* This is optional (although recommended)
+ * Validation function cannot mutate the Workflow state
+ * Validators have the same basic restrictions as Queries
+* Updates that are rejected due to Validation are not recorded in the event history.
+
+### β Is it ok to invoke activities within the Update Handler? {#β-is-it-ok-to-invoke-activities-within-the-update-handler?}
+
+* [From Maxim](https://temporaltechnologies.slack.com/archives/CTEFJ76QG/p1713543088839059?thread_ts=1709568936.227519&cid=CTEFJ76QG):
+ * Beside Java, itβs ok to invoke activities within the update handler. I advise Java due to how the system thread is per handler level.
+
+### β Do they use Early Return to reduce latency?
+
+* See [Temporal Interactive Latency Options](https://docs.google.com/presentation/d/1dYU3lug3PdbliyEH2X_I9L2A27VA9CXfl5Jnw4t_azE/edit?pli=1&slide=id.g3129f517e9e_0_109#slide=id.g3129f517e9e_0_109) for techniques
+
+## Workers & Task Queues {#workers-&-task-queues}
+
+### β How many Workers do they run? {#β-how-many-workers-do-they-run?}
+
+* Run at least 2 for high availability
+
+### β Do Workers register all Workflows and Activities that can be dispatched on the Task Queue? {#β-do-workers-register-all-workflows-and-activities-that-can-be-dispatched-on-the-task-queue?}
+
+* All Workers listening to a given Task Queue must have identical registrations of Activities and/or Workflows
+
+### β Do any Activities run on unique Task Queues? {#β-do-any-activities-run-on-unique-task-queues?}
+
+* Reasons to do so:
+ * Rate-limiting
+ * Activities that need to run on the same worker
+ * Special purpose workers, e.g. GPUs
+ * You can do this for [Differing priority for the work](https://community.temporal.io/t/activity-with-priorities/3398/5) \- but see [TQ Priority and Fairness](https://docs.temporal.io/develop/task-queue-priority-fairness) for an easier way
+
+### β Do they configure rate-limiting for a Worker or Task Queue? {#β-do-they-configure-rate-limiting-for-a-worker-or-task-queue?}
+
+* Worker-side rate limiting
+ * \`maxConcurrentWorkflowTaskExecutionSize\`, \`maxConcurrentActivityExecutionSize\` and \`maxConcurrentLocalActivityExecutionSize\` define the number of total available slots for that Worker
+ * \`**maxWorkerActivitiesPerSecond**\`
+* Server-side rate limiting
+ * \`maxTaskQueueActivitiesPerSecond\`
+ * Must set the *same* value in each Worker that connects to the TQ
+* [Discussion about rate limiting options and best practices](https://community.temporal.io/t/rate-limit-configuration-and-best-practices/5498/2)
+* [Task Queue Activities per second architecture Diagram](https://lucid.app/lucidchart/408e3936-667b-435d-8fc3-603771ce0298/edit?invitationId=inv_465a4565-10e2-41af-b1a3-daa9280c43ac&page=0_0#)
+
+### β Do they require strict ordering for Tasks? {#β-do-they-require-strict-ordering-for-tasks?}
+
+* Without Priority and Fairness, Task Queues **do not** have any ordering guarantees
+ * For example, Tasks *across* separate WF executions may be executed in an order different than they were received.
+ * However, the order of executing *within* a single WF is fully controlled by the WF logic.
+ * Also, Signals for a single WF execution will always be delivered in the order they were received.
+* TQ [Priority and Fairness](https://docs.temporal.io/develop/task-queue-priority-fairness) allow for adjustments to how tasks are distributed in a task queue.
+ * [Priority](https://docs.temporal.io/develop/task-queue-priority-fairness#task-queue-priority) allows [Tasks](https://docs.temporal.io/tasks) to be executed in Priority order
+ * . [Fairness](https://docs.temporal.io/develop/task-queue-priority-fairness#task-queue-fairness) prevents one set of Tasks from blocking others within the same priority level.
+ * You can use Priority and Fairness individually or combine them to express Fairness within a Priority level.
+
+
+### β Can multiple workflows share one Task Queue? When should multiple task queues be used? {#β-can-multiple-workflows-share-one-task-queue?-when-should-multiple-task-queues-be-used?}
+
+* You can use multiple workflows per task queue
+* Reasons to use multiple queues:
+ * Multiple deployment units (services).
+ * Rate limiting and flow control.
+ * Routing to specific hosts.
+ * More info in [this community post from Maxim](https://community.temporal.io/t/in-what-situation-should-we-use-multiple-separated-task-queues/1254).
+
+## Timers {#timers}
+
+### β What is the duration set for a timer or sleep? {#β-what-is-the-duration-set-for-a-timer-or-sleep?}
+
+* The shortest reliable duration is 1 second
+ * Anything less than one second may not be reliable
+
+## Schedules {#schedules}
+
+### β Do they use [**Schedules**](https://docs.temporal.io/workflows#schedule)? {#β-do-they-use-schedules?}
+
+* How are they creating and managing the Schedule?
+ * SDK or CLI or Web UI?
+
+### β How many Schedules will they have? {#β-how-many-schedules-will-they-have?}
+
+* Will many run at the same time?
+ * Consider setting **Jitter** to offset execution, so they do not all run at once
+* What is the potential total Actions Per Second across all Schedules in a namespace?
+ * Temporal Cloud has a default limit of 10 (and can be raised to 100 with a short turnaround \- a couple of business days)
+ * If greater than 100, then the request to raise may take longer and should involve a discussion with Temporal engineering
+
+### β Are they using Schedules to achieve βdelayed startβ? {#β-are-they-using-schedules-to-achieve-βdelayed-startβ?}
+
+* As of 2024-02-20, [Start Delay](https://docs.temporal.io/workflows#delay-workflow-execution) is an experimental feature available in Go, Java and Python, and this should be considered.
+* Alternatively, they should use a normal Workflow with a timer/sleep at the beginning (instead of a Schedule with a run limit of 1).
+ * A normal Workflow will be cheaper for them, and easier to scale.
+
+### β Will they need to [**Pause**](https://docs.temporal.io/workflows#pause) and/or [**Backfill**](https://docs.temporal.io/workflows#backfill) Schedules? {#β-will-they-need-to-pause-and/or-backfill-schedules?}
+
+* Will they allow Backfills to run in parallel (AllowAll overlap policy) or sequentially (BufferAll)?
+ * There is currently an (undocumented) limit of 1000 on the number of executions that can be Buffered. They will need to batch or partition Backfill executions to stay under this limit.
+
+### β What is the configured [**Overlap Policy**](https://docs.temporal.io/workflows#overlap-policy)? {#β-what-is-the-configured-overlap-policy?}
+
+* The default is Skip (i.e. nothing happens; the Workflow Execution is not started)
+
+### β Do executions depend on [**Last Completion Result**](https://docs.temporal.io/workflows#last-completion-result)? {#β-do-executions-depend-on-last-completion-result?}
+
+* If their Overlap Policy allows overlaps, ensure they understand that the last completion means the run that successfully completed when the new run was started (and there may have been other executions started and currently running)
+
+## Cron Jobs {#cron-jobs}
+
+### β Do they use [**Cron Jobs**](https://docs.temporal.io/workflows#temporal-cron-job)? {#β-do-they-use-cron-jobs?}
+
+* Why?
+ * Suggest using [Schedules](#schedules) instead
+
+### β Are they aware of Cron limitations? {#β-are-they-aware-of-cron-limitations?}
+
+* A Cron Workflow should not call Continue As New, as the cron schedule will be dropped/lost
+
+## Side Effects {#side-effects}
+
+### β Do they use [**Side Effects**](https://docs.temporal.io/workflow-execution/event#side-effect)? {#β-do-they-use-side-effects?}
+
+* If there is *any* chance the Side Effect could fail, use an Activity instead.
+* Side Effects are not implemented in Core SDKs. Use local activities instead.
+
+## Data Converter {#data-converter}
+
+### β Do they use a custom [**Data Converter**](https://docs.temporal.io/dataconversion)? {#β-do-they-use-a-custom-data-converter?}
+
+* Why?
+ * For encryption, compression, custom serialization, etc?
+ * Compression is recommended
+* If used for encryption, is there a way to rotate keys?
+* Search Attribute values are *not* processed through a [custom Data Converter](https://docs.temporal.io/dataconversion#custom-data-converter).
+
+### β Are they using datetime/duration data types as input or return parameters? {#β-are-they-using-datetime/duration-data-types-as-input-or-return-parameters?}
+
+* The challenge with datetime/duration data types is that some languages like Python and Go donβt know how to natively serialize them. And JSON doesnβt support datetime/duration data type. That means it is up to every JSON library to determine if they implement a custom serialization for datetime/duration data types.
+* If you are using Python or Go, and/or plan on using multiple languages, you will need to implement your own custom Data Converter. This will require a common format that each SDK can convert back to its native datetime/duration data type.
+
+### β What latency is added by the custom Data Converter or Payload Codec? {#β-what-latency-is-added-by-the-custom-data-converter-or-payload-codec?}
+
+* Do these functions execute in the Workflow context, and contribute to Workflow Task Execution?
+ * Could they result in timeouts?
+
+## Visibility {#visibility}
+
+### β Do they use [**Search Attributes**](https://docs.temporal.io/visibility#search-attribute)? {#β-do-they-use-search-attributes?}
+
+* These should be used for operational purposes only, and **not part of any business logic** in the Workflows
+ * For Workflows, use Queries or store the data in external datastore
+* They should not contain sensitive data, (e.g. PII, etc.) as SA values are not encoded by Data Converters or Payload Codecs
+* Are they storing too much data in Search Attributes?
+ * Is the data related to the execution of the Workflow?
+ * Are they going to be querying this data in the UI or the CLI?
+* Are they aware that Search Attribute updates are eventually consistent, i.e. there will be a delay (\~1-2 seconds) before the value is updated in the Visibility data store
+
+### β Do they use [**Memos**](https://docs.temporal.io/workflow-execution#memo)? {#β-do-they-use-memos?}
+
+* These are *eventually consistent*. The data may not be up-to-date when retrieved through the describe or list workflow operations.
+* Reference: [Memo vs Search Attributes vs Visibility Records](https://community.temporal.io/t/memo-vs-serach-attributes-vs-visibility-records/3003)
+
+### β Do they use any of the Visibility APIs within their app? {#β-do-they-use-any-of-the-visibility-apis-within-their-app?}
+
+* These are *eventually consistent*. The data may not be up-to-date.
+
+## Versioning {#versioning}
+
+### β Do they (plan to) use [**Workflow Versioning**](https://docs.temporal.io/workflow-definition#workflow-versioning) (aka **Patch** Versioning) APIs within their app? {#β-do-they-(plan-to)-use-workflow-versioning-(aka-patch-versioning)-apis-within-their-app?}
+
+* Do they need to, or can they use [**Worker** Versioning](https://docs.temporal.io/worker-versioning) instead?
+ * *Donβt recommend yet as it is pre-release and the API is changing*
+ * Worker Versioning replaces **Task Queue** Versioning \- where you would have recommended TQ Versioning in the past, suggest Worker Versioning now.
+* When do they plan to remove the Versions / Patches?
+ * Never let the Versions / Patches accumulate indefinitely, it will lead to difficult to maintain code.
+* How long do the Workflows run for?
+ * They should have a plan to remove the Version/Patches after a finite period of time, e.g. 2 weeks or 30 days.
+ * Long-running Workflows need to keep patches in place until they either terminate or execute a Continue-As-New. Patches also need to be kept around if you plan on Querying them after they are closed. Patches should only be removed once the retention period has expired.
+ * For short running workflows, suggest **Worker** Versioning instead
+
+### β Do they **test** Version changes **using** the Workflow **Replayer**? {#β-do-they-test-version-changes-using-the-workflow-replayer?}
+
+* WorkflowReplayer in [Go](https://docs.temporal.io/dev-guide/go/testing#replay), [Java](https://docs.temporal.io/dev-guide/java/testing#replay)
+* worker.runReplayHistory in [TypeScript](https://docs.temporal.io/dev-guide/typescript/testing#replay)
+* replay\_workflow in [Python](https://docs.temporal.io/dev-guide/python/testing#replay)
+
+## Continue As New
+
+### β Do they use Continue As New?
+
+* With CAN just remember
+ * Timers won't be carried on, you have to calculate the remaining time and pass it to the new CAN execution, and there, schedule the timers.
+ * If Child Workflows are not started with parentClosePolicy ABANDON they will be terminated (or REQUEST\_CANCEL depending on the parentClosePolicy) when the parent workflow closes
+ * Ensure you wait for pending activities to complete (completed/failed..) before CAN
+ * Use Workflow.isEveryHandlerFinished() to ensure signals and update handlers have finished executing before CAN
+
+## Interceptors {#interceptors}
+
+Use Interceptors to change workflow and activity behavior at the worker level.
+For example, if you want some code that runs before every workflow starts, put that in a WorkflowInterceptor named βexecute\_workflowβ.
+Powerful, advanced, can cause confusing behavior if people forget they have interceptors enabled.
+
+- [Blog thatβs pretty good](https://platformatory.io/blog/Understanding-Temporal-Interceptors/)
+- [Java Workflow Interceptor docs](https://github.com/temporalio/sdk-java/blob/master/temporal-sdk/src/main/java/io/temporal/common/interceptors/WorkerInterceptor.java)
+
+See Mikeβs Github Project \- [https://github.com/temporalio/temporal-interceptor-seed](https://github.com/temporalio/temporal-interceptor-seed)
+
+## Sessions/Worker-Specific Task Queues {#sessions/worker-specific-task-queues}
+
+Sessions are only available only in Go
+
+* Note that when you use sessions, if the worker process dies, unless all the session activities have been processed, retry will call all of the session related activities again.
+
+ Worker Specific Task Queues available in every SDK
+
+* Use these to task Activities to specific workers
+
+## Storage Optimization (Long Running Workflows) {#storage-optimization-(long-running-workflows)}
+
+See [Cost Optimization \- Storage](https://docs.google.com/document/d/1CVsZ4kGHMI7X79HzZKPf2MDWBPbT4lpUDz-3yZOlE48/edit#heading=h.bsn3wts9qcyw), and in particular the section on [Continue As New](https://docs.google.com/document/d/1CVsZ4kGHMI7X79HzZKPf2MDWBPbT4lpUDz-3yZOlE48/edit#heading=h.spve8qj2rtv9) (also [this](https://docs.google.com/presentation/d/1mZvV53b49HaqPdj8fZnLTwcHZojDgbedclJYhak8d3o/edit#slide=id.g32c7d9e6922_0_232))
+
+# Appendix {#appendix}
+
+1. ## Best Practices Checklist {#best-practices-checklist}
+
+[Workflows](#workflows)
+
+[β Do they understand Workflow Determinism requirements?](#β-do-they-understand-workflow-determinism-requirements?)
+
+[β Do they understand the Event History?](#β-do-they-understand-the-event-history?)
+
+[β Do they set Workflow Id to a meaningful business identifier (or not)?](#β-do-they-set-workflow-id-to-a-meaningful-business-identifier-\(or-not\)?)
+
+[β Do they explicitly set any Workflow Timeout options?](#β-do-they-explicitly-set-any-workflow-timeout-options?)
+
+[β Do they understand Workflow Failure / error / exception handling?](#β-do-they-understand-workflow-failure-/-error-/-exception-handling?)
+
+[β Do they set a Workflow Retry policy?](#β-do-they-set-a-workflow-retry-policy?)
+
+[Child Workflows](#child-workflows)
+
+[β Do they use Child Workflows?](#β-do-they-use-child-workflows?)
+
+[β How many Child WFs are being started by a Parent?](#β-how-many-child-wfs-are-being-started-by-a-parent?)
+
+[β Do the Parent and Child WF need to share state?](#β-do-the-parent-and-child-wf-need-to-share-state?)
+
+[β Does the Parent need to wait on the Child Workflow result?](#β-does-the-parent-need-to-wait-on-the-child-workflow-result?)
+
+[Activities](#activities)
+
+[β Do they understand Activity Idempotency as a best practice?](#β-do-they-understand-activity-idempotency-as-a-best-practice?)
+
+[β Are Activities short-term or long-running?](#β-are-activities-short-term-or-long-running?)
+
+[β What are the Input and Output sizes for Activity Payloads?](#β-what-are-the-input-and-output-sizes-for-activity-payloads?)
+
+[β Is the Activity performing any polling?](#β-is-the-activity-performing-any-polling?)
+
+[β Is the Activity listening on a port or socket?](#β-is-the-activity-listening-on-a-port-or-socket?)
+
+[β What are the Activity Timeout settings?](#β-what-are-the-activity-timeout-settings?)
+
+[β Do Activity Tasks run on Workers separate from the Workflow Tasks?](#β-do-activity-tasks-run-on-workers-separate-from-the-workflow-tasks?)
+
+[β Do sequential Activities run on the same Worker (i.e. are they sticky)?](#β-do-sequential-activities-run-on-the-same-worker-\(i.e.-are-they-sticky\)?)
+
+[β Do they use Local Activities?](#β-do-they-use-local-activities?)
+
+[β What is the Activity Retry policy?](#β-what-is-the-activity-retry-policy?)
+
+[β Do they understand Activity Failure / error / exception handling?](#β-do-they-understand-activity-failure-/-error-/-exception-handling?)
+
+[β Would an Activity need to receive Cancellation?](#β-would-an-activity-need-to-receive-cancellation?)
+
+[Signals](#signals)
+
+[β Do they use Signals?](#β-do-they-use-signals?)
+
+[β Do they understand that Signals areβ¦](#β-do-they-understand-that-signals-areβ¦)
+
+[β Are Signal handlers idempotent and deterministic?](#β-are-signal-handlers-idempotent-and-deterministic?)
+
+[β Is there a high rate or volume of Signals?](#β-is-there-a-high-rate-or-volume-of-signals?)
+
+[β Do Signal handlers invoke Activities?](#β-do-signal-handlers-invoke-activities?)
+
+[Queries](#queries)
+
+[β Do they use Queries?](#β-do-they-use-queries?)
+
+[β Do they understand that Queries areβ¦](#β-do-they-understand-that-queries-areβ¦)
+
+[β Do Query handlers perform read-only operations?](#β-do-query-handlers-perform-read-only-operations?)
+
+[Update](#update)
+
+[β Do they use Update?](#β-do-they-use-update?)
+
+[β Do they perform Validation on the Update request?](#β-do-they-perform-validation-on-the-update-request?)
+
+[β Is it ok to invoke activities within the Update Handler?](#β-is-it-ok-to-invoke-activities-within-the-update-handler?)
+
+[Workers & Task Queues](#workers-&-task-queues)
+
+[β How many Workers do they run?](#β-how-many-workers-do-they-run?)
+
+[β Do Workers register all Workflows and Activities that can be dispatched on the Task Queue?](#β-do-workers-register-all-workflows-and-activities-that-can-be-dispatched-on-the-task-queue?)
+
+[β Do any Activities run on unique Task Queues?](#β-do-any-activities-run-on-unique-task-queues?)
+
+[β Do they configure rate-limiting for a Worker or Task Queue?](#β-do-they-configure-rate-limiting-for-a-worker-or-task-queue?)
+
+[β Do they require strict ordering for Tasks?](#β-do-they-require-strict-ordering-for-tasks?)
+
+[β Can multiple workflows share one Task Queue? When should multiple task queues be used?](#β-can-multiple-workflows-share-one-task-queue?-when-should-multiple-task-queues-be-used?)
+
+[Timers](#timers)
+
+[β What is the duration set for a timer or sleep?](#β-what-is-the-duration-set-for-a-timer-or-sleep?)
+
+[Schedules](#schedules)
+
+[β Do they use Schedules?](#β-do-they-use-schedules?)
+
+[β How many Schedules will they have?](#β-how-many-schedules-will-they-have?)
+
+[β Are they using Schedules to achieve βdelayed startβ?](#β-are-they-using-schedules-to-achieve-βdelayed-startβ?)
+
+[β Will they need to Pause and/or Backfill Schedules?](#β-will-they-need-to-pause-and/or-backfill-schedules?)
+
+[β What is the configured Overlap Policy?](#β-what-is-the-configured-overlap-policy?)
+
+[β Do executions depend on Last Completion Result?](#β-do-executions-depend-on-last-completion-result?)
+
+[Cron Jobs](#cron-jobs)
+
+[β Do they use Cron Jobs?](#β-do-they-use-cron-jobs?)
+
+[β Are they aware of Cron limitations?](#β-are-they-aware-of-cron-limitations?)
+
+[Side Effects](#side-effects)
+
+[β Do they use Side Effects?](#β-do-they-use-side-effects?)
+
+[Data Converter](#data-converter)
+
+[β Do they use a custom Data Converter?](#β-do-they-use-a-custom-data-converter?)
+
+[β Are they using datetime/duration data types as input or return parameters?](#β-are-they-using-datetime/duration-data-types-as-input-or-return-parameters?)
+
+[β What latency is added by the custom Data Converter or Payload Codec?](#β-what-latency-is-added-by-the-custom-data-converter-or-payload-codec?)
+
+[Visibility](#visibility)
+
+[β Do they use Search Attributes?](#β-do-they-use-search-attributes?)
+
+[β Do they use Memos?](#β-do-they-use-memos?)
+
+[β Do they use any of the Visibility APIs within their app?](#β-do-they-use-any-of-the-visibility-apis-within-their-app?)
+
+[Versioning](#versioning)
+
+[β Do they (plan to) use Workflow Versioning (aka Patch Versioning) APIs within their app?](#β-do-they-\(plan-to\)-use-workflow-versioning-\(aka-patch-versioning\)-apis-within-their-app?)
+
+[β Do they test Version changes using the Workflow Replayer?](#β-do-they-test-version-changes-using-the-workflow-replayer?)
+
+2. ## Useful Links: {#useful-links:}
+
+- [Common Design Patterns](https://taonic.github.io/temporal-design-patterns/)
+- [SDK Features Matrix](https://www.notion.so/temporalio/d86479c52be643c6a7c5f22cef5807e4?v=46d47ff7e32643dbb29950136fb3e5cd)
+- [Temporal 102 deck, covers a lot of topics, with diagrams](https://www.google.com/url?q=https://docs.google.com/presentation/d/1DsK9ZE-XHpLac2jBTf29UUSol1HBghCjzV8-3PRWT7Y/edit?slide%3Did.g2d0bcd56d06_0_392%23slide%3Did.g2d0bcd56d06_0_392&sa=D&source=docs&ust=1752690322127243&usg=AOvVaw1pdJ2swgV1ikET-UTTSmyD) that are discussed above
+- [Temporal Python Troubleshooting Guide](https://github.com/temporalio/dev-success/blob/main/python/troubleshooting_guide.md#the-thread-inside-an-async-def-python-function-is-blocked)
+- [Notes for Code Reviews](https://docs.google.com/document/d/1RiFq1ExYvjNqdLvI_Suuo8rGS1DhqQy3SgrOLcwWaqQ/edit?tab=t.0#heading=h.abk8gemnj13u) \- the Code Review companion guide
\ No newline at end of file
diff --git a/batch_workflow_design_best_practices.md b/batch_workflow_design_best_practices.md
new file mode 100644
index 0000000..765b9ca
--- /dev/null
+++ b/batch_workflow_design_best_practices.md
@@ -0,0 +1,286 @@
+# Batch Workflow Best Practices
+
+---
+
+## Table of Contents
+
+- [Schedules](#schedules)
+- [01 Basic Workflow](#01-basic-workflow)
+- [02 Fan-Out using Basic Child Workflows](#02-fan-out-using-basic-child-workflows)
+- [03 Batch Iterator Workflow](#03-batch-iterator-workflow)
+- [04 Sliding Window Workflow](#04-sliding-window-workflow)
+- [05 MapReduce Tree](#05-mapreduce-tree)
+- [06 Batch Signalling](#06-batch-signalling)
+- [07 Limits](#07-limits)
+
+---
+
+## Schedules
+
+Schedules allow Workflows to be executed on a recurring basis. Think of them as a more powerful Cron:
+
+- Supports `start` / `pause` / `stop` / `update` / `backfill` of scheduled workflow executions
+- Can have overlapping schedules, configurable with **Overlap Policies**
+- Full history visibility
+- Schedules can be created via the UI or CLI
+
+**References:**
+- https://temporal.io/blog/temporal-schedules-reliable-scalable-and-more-flexible-than-cron-jobs
+- https://docs.temporal.io/workflows#schedule
+- https://docs.temporal.io/cli/schedule
+
+```bash
+$ temporal schedule create \
+ --schedule-id 'your-schedule-id' \
+ --workflow-id 'your-workflow-id' \
+ --task-queue 'your-task-queue' \
+ --workflow-type 'YourWorkflowType'
+```
+
+---
+
+## 01 Basic Workflow
+
+This is just a standard workflow.
+
+- Workflow fetches, or is started with, record IDs to process
+- Runs activity/activities required to retrieve and process each record:
+ - Activities can be blocking or non-blocking
+ - If non-blocking, the workflow must block to allow all activities to complete
+ - **Can only have 2k in-flight activities; ideally limit to 500**
+- If the workflow history is likely to exceed 2k events (hard 50k limit), and/or you need Continue-as-New, consider the **Batch Iterator** pattern instead
+
+**Pros:** Simple
+**Cons:** Limited number of records that can be processed; can potentially overwhelm downstream systems; all-or-nothing approach to parallelism
+
+```mermaid
+flowchart TD
+ Records["π Record IDs\n(fetched or passed in)"]
+ WF["Workflow"]
+ A1["Activity"]
+ A2["Activity"]
+ AN["Activity ..."]
+
+ Records --> WF
+ WF --> A1
+ WF --> A2
+ WF --> AN
+```
+
+---
+
+## 02 Fan-Out using Basic Child Workflows
+
+Slightly better than the Basic Workflow. Useful when you have between **2K and 4M records**.
+
+- Parent workflow assigns blocks of IDs to child workflows:
+ - IDs can be explicit, e.g. `[1, 2, 3, β¦, n]`
+ - Better: use **offset and length**
+- Child workflows follow the Basic Workflow pattern
+- If the result of processing isn't needed, use `PARENT_CLOSE_POLICY_ABANDON` on child workflows
+- If workflow history is likely to exceed 2k events (hard 50k limit), and/or you need Continue-as-New, consider the **Batch Iterator** pattern instead
+
+**Pros:** Relatively simple
+**Cons:** Limited number of records that can be processed; can potentially overwhelm downstream systems; all-or-nothing approach to parallelism
+
+```mermaid
+flowchart TD
+ Records["π Record IDs"]
+ Parent["Parent Workflow"]
+ C1["Child Workflow\n(offset 0, len N)"]
+ C2["Child Workflow\n(offset N, len N)"]
+ C3["Child Workflow\n(offset 2N, len N)"]
+
+ Records --> Parent
+ Parent --> C1
+ Parent --> C2
+ Parent --> C3
+
+ C1 --> A1["Activities"]
+ C2 --> A2["Activities"]
+ C3 --> A3["Activities"]
+```
+
+---
+
+## 03 Batch Iterator Workflow
+
+Process a batch of records, then **Continue-as-New** to process the next batch.
+
+- Workflow loads a **page** of record IDs (from an offset)
+- Executes child workflows or activities to process each ID in the page
+- Calls `continue-as-new` with the last page token / offset:
+ - Next run of the workflow does the same with the next page
+- Limited parallelism
+- Continue-as-New manages event history size
+
+**Reference:** https://github.com/temporalio/samples-java/tree/main/core/src/main/java/io/temporal/samples/batch/iterator
+
+**Pros:** Can rate-limit traffic to downstream systems; no limit to total size of record set
+**Cons:** Batch progresses at the rate of the slowest processor
+
+```mermaid
+flowchart TD
+ Records["π Full Record Set"]
+ WF1["Workflow Run 1\n(page 1)"]
+ WF2["Workflow Run 2\n(page 2)"]
+ WF3["Workflow Run 3\n(page N)"]
+ DB[("Data Source\n(paginated)")]
+
+ Records --> DB
+ DB -->|"fetch page 1"| WF1
+ WF1 -->|"process records"| Acts1["Activities"]
+ WF1 -->|"continue-as-new\n(offset = page 2)"| WF2
+ DB -->|"fetch page 2"| WF2
+ WF2 -->|"process records"| Acts2["Activities"]
+ WF2 -->|"continue-as-new\n(offset = page N)"| WF3
+ DB -->|"fetch page N"| WF3
+ WF3 -->|"process records"| Acts3["Activities"]
+```
+
+---
+
+## 04 Sliding Window Workflow
+
+Similar to the Batch Iterator, but maximizes throughput by maintaining a **fixed-size window** of concurrent child workflows. As each child completes, a new one starts immediately for the next record.
+
+- A parent workflow starts a configured number of child workflows in parallel β **one child per record**
+- As each child completes, a new one is started for the next record
+- Limits the number of concurrent child workflows to prevent overwhelming downstream systems
+- The parent calls `continue-as-new` after starting the preconfigured number of children
+- A child signals its completion to the parent (since a parent cannot directly wait for a child started by a previous run)
+
+**Reference:** https://github.com/temporalio/samples-java/tree/main/core/src/main/java/io/temporal/samples/batch/slidingwindow
+
+**Pros:** Can rate-limit traffic; no limit to total record set size; window progresses at the rate of the **fastest** processor
+**Cons:** Complicated
+
+```mermaid
+flowchart TD
+ Records["π Record IDs"]
+ Parent["Parent Workflow\n(window size = W)"]
+ C1["Child 1\nβ done"]
+ C2["Child 2\nβ³ running"]
+ C3["Child 3\nβ³ running"]
+ C4["Child 4\nπ started"]
+
+ Records --> Parent
+ Parent --> C1
+ Parent --> C2
+ Parent --> C3
+
+ C1 -->|"Signal: complete"| Parent
+ Parent -->|"slot freed β start next"| C4
+
+ CAN["continue-as-new\n(after W children started)"]
+ Parent -->|"after W children"| CAN
+```
+
+---
+
+## 05 MapReduce Tree
+
+Used for **embarrassingly parallel** workloads where speed matters more than rate-limiting.
+
+- Recordset is received by a **Node** workflow
+- **Map phase:**
+ - If the recordset is small enough to be processed by `n` leaves β start `n` **Leaf** workflows as children
+ - Otherwise β split recordset into `n` chunks and pass to `n` **Node** child workflows (recurse)
+- **Reduce phase:**
+ - Results are signalled from child to parent
+ - Parent blocks until all results are received
+ - Can be skipped if results aren't needed
+- External reads *might* be okay β **avoid external/downstream writes**
+- Can be tricky to get correct; track tree depth and fail if too deep
+- If rate limiting is needed (e.g. thundering herd), use **Batch Sliding Window** or **Batch Iterator** instead
+
+**Pros:** No limit to total record set size; entire recordset processed in parallel
+**Cons:** Complicated
+
+```mermaid
+flowchart TD
+ Records["π Full Record Set"]
+ Root["Root Node Workflow"]
+
+ Node1["Node Workflow"]
+ Node2["Node Workflow"]
+
+ L1["Leaf Workflow"]
+ L2["Leaf Workflow"]
+ L3["Leaf Workflow"]
+ L4["Leaf Workflow"]
+ L5["Leaf Workflow"]
+ L6["Leaf Workflow"]
+
+ Records --> Root
+ Root -->|"chunk 1"| Node1
+ Root -->|"chunk 2"| Node2
+
+ Node1 --> L1
+ Node1 --> L2
+ Node1 --> L3
+
+ Node2 --> L4
+ Node2 --> L5
+ Node2 --> L6
+
+ L1 -->|"Signal result"| Node1
+ L2 -->|"Signal result"| Node1
+ L3 -->|"Signal result"| Node1
+ L4 -->|"Signal result"| Node2
+ L5 -->|"Signal result"| Node2
+ L6 -->|"Signal result"| Node2
+
+ Node1 -->|"Signal result"| Root
+ Node2 -->|"Signal result"| Root
+```
+
+---
+
+## 06 Batch Signalling
+
+The Temporal CLI batch signal feature notifies multiple workflows with a single command.
+
+**Supported commands:**
+- Signal
+- Reset
+- Cancel
+- Terminate
+
+Use by adding the `--query` parameter to the command.
+
+**Limits:**
+- 1 running batch job per namespace
+- 50 workflows per second per batch
+
+**Reference:** https://docs.temporal.io/cli/batch
+
+```bash
+# Terminate all running workflows of a given type
+$ temporal workflow terminate \
+ --query 'ExecutionStatus = "Running" AND WorkflowType="SomeWorkflowType"' \
+ --reason "Terminate Test Workflows Batch"
+
+# Signal all running workflows of a given type
+$ temporal workflow signal \
+ --workflow-id MyWorkflowId \
+ --name MySignal \
+ --input '{"Input": "As-JSON"}' \
+ --query 'ExecutionStatus = "Running" AND WorkflowType="YourWorkflow"' \
+ --reason "Testing"
+```
+
+---
+
+## 07 Limits
+
+Key numbers to know. Full reference: https://docs.temporal.io/cloud/limits
+
+| Limit | Value |
+|---|---|
+| **Actions per second per namespace** | Dynamically allocated based on usage |
+| **Unfinished actions per workflow** | 2,000 max (aim for 500). Includes activities, signals, child workflows, cancellation requests |
+| **Events per workflow** | 50,000 events max (aim for 2,000) **or** 50MB total history size |
+| **Signals per workflow** | 10,000 |
+| **Updates per workflow** | 10 in-flight, 2,000 total |
+| **Batch signalling** | 1 batch job per namespace; 50 workflows/sec per batch |
diff --git a/docs/.vitepress/config.mts b/docs/.vitepress/config.mts
index aa18ed9..c9b355b 100644
--- a/docs/.vitepress/config.mts
+++ b/docs/.vitepress/config.mts
@@ -73,6 +73,16 @@ export default withMermaid(defineConfig({
{ text: 'Fairness', link: '/fairness' }
]
},
+ {
+ text: 'Batch Processing Patterns',
+ items: [
+ { text: 'Overview', link: '/batch-processing-patterns' },
+ { text: 'Fan-Out with Child Workflows', link: '/fanout-child-workflows' },
+ { text: 'Batch Iterator', link: '/batch-iterator' },
+ { text: 'Sliding Window', link: '/sliding-window' },
+ { text: 'MapReduce Tree', link: '/mapreduce-tree' }
+ ]
+ },
],
socialLinks: [
{ icon: 'github', link: 'https://github.com/taonic/temporal-design-patterns' }
diff --git a/docs/batch-iterator.md b/docs/batch-iterator.md
new file mode 100644
index 0000000..6e54651
--- /dev/null
+++ b/docs/batch-iterator.md
@@ -0,0 +1,233 @@
+
+# Batch Iterator
+
+:::info TLDR
+**Process one page at a time** and call Continue-as-New with the next offset after each page so the Workflow's event history never grows without bound. With this method you can process infinite pages. Use this when your record set is arbitrarily large, you need a durable checkpoint after every page, and sequential page-by-page throughput is acceptable.
+:::
+
+## Overview
+
+The Batch Iterator pattern processes a large record set one page at a time. Each Workflow run processes a single page and then calls Continue-as-New with the next offset, producing a chain of short-lived runs that together cover the entire record set without accumulating unbounded event history.
+
+## Problem
+
+A single Workflow run is limited to 50,000 history events (aim for 2,000) and 2,000 in-flight Activities. Processing millions of records in one run is not possible within these bounds.
+
+You need a way to process an arbitrarily large record set reliably, with the ability to resume from a checkpoint if the Workflow is interrupted, and without overwhelming downstream systems with a burst of concurrent requests.
+
+## Solution
+
+Each Workflow run fetches one page of records using a persistent `offset` parameter, processes each record sequentially, and then calls `continueAsNew` with the incremented offset. The next run picks up exactly where the previous one left off.
+
+Because each run processes only a bounded number of records, history stays well within limits. The offset acts as a durable checkpoint: if the Workflow is interrupted mid-page, the next run replays only from the start of the current page.
+
+```mermaid
+flowchart TD
+ DB[("Data Source\n(paginated)")]
+ WF1["Workflow Run 1\n(offset=0)"]
+ WF2["Workflow Run 2\n(offset=PAGE_SIZE)"]
+ WF3["Workflow Run N\n(offset=NΓPAGE_SIZE)"]
+ Done(["Complete"])
+
+ DB -->|"fetch page 1"| WF1
+ WF1 -->|"processRecord ΓPAGE_SIZE"| Acts1["Activities"]
+ WF1 -->|"continueAsNew\n(offset=PAGE_SIZE)"| WF2
+
+ DB -->|"fetch page 2"| WF2
+ WF2 -->|"processRecord ΓPAGE_SIZE"| Acts2["Activities"]
+ WF2 -->|"continueAsNew\n(offset=NΓPAGE_SIZE)"| WF3
+
+ DB -->|"fetch page N"| WF3
+ WF3 -->|"processRecord ΓPAGE_SIZE"| Acts3["Activities"]
+ WF3 -->|"last page β return"| Done
+```
+
+The following describes each step in the diagram:
+
+1. The Workflow starts with `offset=0` and calls `fetchPage(offset, pageSize)` to retrieve the first page of records.
+2. It processes each record in the page by executing the `processRecord` Activity.
+3. After the page is fully processed, it calls `continueAsNew` with `offset + pageSize`, passing the updated offset to the next run.
+4. The next run begins with a clean history and repeats the same steps for the next page.
+5. When `fetchPage` returns fewer records than `pageSize`, the Workflow knows it has reached the last page and returns normally.
+
+## Implementation
+
+
+
+The following examples show how each SDK implements the Batch Iterator pattern.
+
+::: code-group
+```typescript [TypeScript]
+// workflows.ts
+import { continueAsNew, log, proxyActivities } from "@temporalio/workflow";
+import type * as activities from "./activities";
+import { PAGE_SIZE } from "./shared";
+
+const { fetchPage, processRecord } = proxyActivities({
+ startToCloseTimeout: "10 seconds",
+});
+
+export async function batchIteratorWorkflow(
+ offset: number = 0,
+ totalProcessed: number = 0
+): Promise {
+ const page = await fetchPage(offset, PAGE_SIZE);
+
+ for (const record of page) {
+ await processRecord(record);
+ totalProcessed++;
+ }
+
+ log.info(`Processed page at offset ${offset} (${page.length} records, running total: ${totalProcessed})`);
+
+ if (page.length === PAGE_SIZE) {
+ await continueAsNew(offset + PAGE_SIZE, totalProcessed);
+ }
+
+ return totalProcessed;
+}
+```
+
+```python [Python]
+# workflows.py
+from temporalio import workflow
+from temporalio.workflow import continue_as_new
+from datetime import timedelta
+from activities import fetch_page, process_record
+from shared import PAGE_SIZE
+
+
+@workflow.defn
+class BatchIteratorWorkflow:
+ @workflow.run
+ async def run(self, offset: int = 0, total_processed: int = 0) -> int:
+ page = await workflow.execute_activity(
+ fetch_page,
+ args=[offset, PAGE_SIZE],
+ start_to_close_timeout=timedelta(seconds=10),
+ )
+
+ for record in page:
+ await workflow.execute_activity(
+ process_record,
+ record,
+ start_to_close_timeout=timedelta(seconds=10),
+ )
+ total_processed += 1
+
+ workflow.logger.info(
+ f"Processed page at offset {offset} ({len(page)} records, running total: {total_processed})"
+ )
+
+ if len(page) == PAGE_SIZE:
+ continue_as_new(offset + PAGE_SIZE, total_processed)
+
+ return total_processed
+```
+
+```go [Go]
+// workflows.go
+package main
+
+import (
+ "go.temporal.io/sdk/workflow"
+)
+
+func BatchIteratorWorkflow(ctx workflow.Context, offset int, totalProcessed int) (int, error) {
+ ao := workflow.ActivityOptions{
+ StartToCloseTimeout: 10 * time.Second,
+ }
+ ctx = workflow.WithActivityOptions(ctx, ao)
+
+ var page []Record
+ if err := workflow.ExecuteActivity(ctx, FetchPage, offset, PageSize).Get(ctx, &page); err != nil {
+ return totalProcessed, err
+ }
+
+ for _, record := range page {
+ if err := workflow.ExecuteActivity(ctx, ProcessRecord, record).Get(ctx, nil); err != nil {
+ return totalProcessed, err
+ }
+ totalProcessed++
+ }
+
+ workflow.GetLogger(ctx).Info("Processed page",
+ "offset", offset,
+ "pageSize", len(page),
+ "totalProcessed", totalProcessed)
+
+ if len(page) == PageSize {
+ return totalProcessed, workflow.NewContinueAsNewError(ctx, BatchIteratorWorkflow, offset+PageSize, totalProcessed)
+ }
+
+ return totalProcessed, nil
+}
+```
+
+```java [Java]
+// BatchIteratorWorkflow.java
+import io.temporal.activity.ActivityOptions;
+import io.temporal.workflow.*;
+import java.time.Duration;
+import java.util.List;
+
+@WorkflowInterface
+public interface BatchIteratorWorkflow {
+ @WorkflowMethod
+ int run(int offset, int totalProcessed);
+}
+
+// BatchIteratorWorkflowImpl.java
+public class BatchIteratorWorkflowImpl implements BatchIteratorWorkflow {
+ private final Activities activities = Workflow.newActivityStub(
+ Activities.class,
+ ActivityOptions.newBuilder()
+ .setStartToCloseTimeout(Duration.ofSeconds(10))
+ .build()
+ );
+
+ @Override
+ public int run(int offset, int totalProcessed) {
+ List page = activities.fetchPage(offset, Shared.PAGE_SIZE);
+
+ for (Record record : page) {
+ activities.processRecord(record);
+ totalProcessed++;
+ }
+
+ Workflow.getLogger(BatchIteratorWorkflowImpl.class).info(
+ "Processed page at offset " + offset + " (" + page.size() + " records, total: " + totalProcessed + ")"
+ );
+
+ if (page.size() == Shared.PAGE_SIZE) {
+ throw Workflow.newContinueAsNewStub(BatchIteratorWorkflow.class)
+ .run(offset + Shared.PAGE_SIZE, totalProcessed);
+ }
+
+ return totalProcessed;
+ }
+}
+```
+:::
+
+## Best Practices
+
+- **Choose a page size that keeps history under 2,000 events.** Each page produces roughly `2 Γ pageSize` history events (Activity scheduled + completed). A page size of 500β800 records is a safe target.
+- **Include `totalProcessed` (or a similar counter) in the `continueAsNew` args.** This lets you observe overall progress via the Workflow input visible in the UI without querying internal state.
+- **Fetch inside an Activity, not the Workflow.** The `fetchPage` call must be an Activity β not inline Workflow code β so it can interact with external systems and be retried independently.
+- **Make `processRecord` idempotent.** If the Workflow is interrupted after some records in a page are processed but before `continueAsNew`, the next run replays the full page. Activities that have already completed are skipped by the replay, but your downstream system must tolerate duplicate calls in failure-recovery scenarios.
+- **Avoid accumulating large local state between pages.** `continueAsNew` does not carry over in-memory state; only the arguments you pass are available in the next run.
+
+## Common Pitfalls
+
+- **Forgetting `continueAsNew` on the last page.** If you call `continueAsNew` unconditionally, the Workflow loops forever even when the data source is exhausted. Check whether the returned page is shorter than `pageSize` before continuing.
+- **Passing mutable objects into `continueAsNew`.** All arguments are serialized. Pass only the minimal state needed (offset, counters) β not accumulated results or large data structures.
+- **Sequential processing bottlenecks.** The Batch Iterator processes one record at a time per page. If throughput matters more than rate limiting, consider [Sliding Window](sliding-window) or [MapReduce Tree](mapreduce-tree).
+
+## Related Resources
+
+- [Continue-as-New pattern](continue-as-new) β core concepts for history management via `continueAsNew`
+- [Sliding Window](sliding-window) β bounded concurrency that progresses at the rate of the fastest processor
+- [MapReduce Tree](mapreduce-tree) β fully parallel processing for maximum speed
+- [Temporal limits reference](https://docs.temporal.io/cloud/limits)
+- [Batch samples (Java)](https://github.com/temporalio/samples-java/tree/main/core/src/main/java/io/temporal/samples/batch/iterator)
diff --git a/docs/batch-processing-patterns.md b/docs/batch-processing-patterns.md
new file mode 100644
index 0000000..0d93e9d
--- /dev/null
+++ b/docs/batch-processing-patterns.md
@@ -0,0 +1,143 @@
+
+# Batch Processing Patterns
+
+Patterns for processing large volumes of records reliably, at scale, and without overwhelming downstream systems.
+
+Choose based on your throughput requirements, record set size, and whether you need rate limiting or maximum parallelism.
+
+## When to use which pattern
+
+| Pattern | Record set size | Parallelism model | Workflow-based rate control |
+|---|---|---|---|
+| [Basic Workflow](#basic-workflow-single-tier-fan-out) | Small (up to a few hundred records) | Sequential or parallel activities in one Workflow | No |
+| [Fan-Out with Child Workflows](fanout-child-workflows) | Up to ~4M records | Fixed concurrency (one child per chunk) | No |
+| [Batch Iterator](batch-iterator) | Unlimited | Limited (activities per page) | Yes β fixed page rate |
+| [Sliding Window](sliding-window) | Unlimited | Bounded window of concurrent children | Yes β configurable window |
+| [MapReduce Tree](mapreduce-tree) | Unlimited | Fully parallel recursive tree | No β maximum speed |
+
+
+
+---
+
+## Schedules
+
+Schedules allow Workflows to be executed on a recurring basis β think of them as a more powerful cron.
+
+- Supports `start` / `pause` / `stop` / `update` / `backfill` of scheduled Workflow executions
+- Configurable **Overlap Policies** control what happens when the previous run is still running
+- Full execution history visibility in the Temporal UI
+- Schedules can be created via the UI, CLI, or SDK
+
+```bash
+temporal schedule create \
+ --schedule-id 'your-schedule-id' \
+ --workflow-id 'your-workflow-id' \
+ --task-queue 'your-task-queue' \
+ --workflow-type 'YourWorkflowType'
+```
+
+**References:**
+- [Temporal Schedules](https://docs.temporal.io/workflows#schedule)
+- [CLI schedule commands](https://docs.temporal.io/cli/schedule)
+
+---
+
+## Basic Workflow (single-tier fan-out)
+
+The simplest form of batch processing: the Workflow fetches or receives record IDs and executes one Activity per record.
+
+- Activities can be executed sequentially or concurrently (using the SDK's async primitives)
+- **Limit: 2,000 in-flight Activities per Workflow run** (aim for 500)
+- If total event count is likely to exceed 2,000 (hard limit: 50,000), use the [Batch Iterator](batch-iterator) instead
+
+**Pros:** Simple
+**Cons:** Hard cap on concurrent Activities; all-or-nothing failure model; can overwhelm downstream systems
+
+```mermaid
+flowchart TD
+ Records["π Record IDs\n(fetched or passed in)"]
+ WF["Workflow"]
+ A1["Activity"]
+ A2["Activity"]
+ AN["Activity ..."]
+
+ Records --> WF
+ WF --> A1
+ WF --> A2
+ WF --> AN
+```
+
+---
+
+## Batch Signalling
+
+The Temporal CLI lets you signal, reset, cancel, or terminate multiple Workflows with a single command using a visibility query.
+
+- 1 running batch job per namespace
+- 50 Workflows per second per batch
+
+```bash
+# Signal all running Workflows of a given type
+temporal workflow signal \
+ --name MySignal \
+ --input '{"Input": "As-JSON"}' \
+ --query 'ExecutionStatus = "Running" AND WorkflowType="YourWorkflow"' \
+ --reason "Testing"
+
+# Terminate all running Workflows of a given type
+temporal workflow terminate \
+ --query 'ExecutionStatus = "Running" AND WorkflowType="SomeWorkflowType"' \
+ --reason "Terminate Test Workflows"
+```
+
+**Reference:** [CLI batch commands](https://docs.temporal.io/cli/batch)
+
+---
+
+## Key Limits
+
+Full reference: [Temporal Cloud limits](https://docs.temporal.io/cloud/limits)
+
+| Limit | Value |
+|---|---|
+| Unfinished actions per Workflow | 2,000 max (aim for 500). Includes Activities, Signals, Child Workflows, cancellation requests |
+| Events per Workflow history | 50,000 events max (aim for 2,000) **or** 50 MB total history size |
+| Signals per Workflow | 10,000 |
+| Updates per Workflow | 10 in-flight, 2,000 total |
+| Batch Signalling | 1 batch job per namespace; 50 Workflows/sec per batch |
diff --git a/docs/fanout-child-workflows.md b/docs/fanout-child-workflows.md
new file mode 100644
index 0000000..3e38e96
--- /dev/null
+++ b/docs/fanout-child-workflows.md
@@ -0,0 +1,280 @@
+
+# Fan-Out with Child Workflows
+
+:::info TLDR
+Split your record set into fixed-size chunks and start **one child Workflow per chunk** so that each chunk's history stays within Temporal's limits. Use this when you want maximum concurrency with no rate control and you can pre-compute how many chunks you need before the job starts. Keep the total number of children per parent under 1,000; use [Sliding Window](sliding-window) or [Batch Iterator](batch-iterator) for larger workloads.
+:::
+
+## Overview
+
+The Fan-Out pattern distributes a large record set across multiple independent child Workflows, each responsible for processing a fixed-size chunk. The parent Workflow assigns work by offset and length so that no record IDs need to be passed over the wire β only two integers per child.
+
+## Problem
+
+A single Workflow run can have at most 2,000 in-flight Activities (aim for 500) and at most 50,000 history events. Processing millions of records in a single Workflow run is therefore not possible.
+
+You need a way to partition a large record set, process each partition independently, and coordinate the overall job while keeping each Workflow's history within safe bounds.
+
+## Solution
+
+You split the total record count into fixed-size chunks and start one child Workflow per chunk. Each child is given an `offset` and a `length` so it knows which slice of the record set to fetch and process independently.
+
+The parent Workflow starts all children concurrently and waits for them all to complete. If a child fails the parent can retry that child without re-processing the records handled by other children.
+
+```mermaid
+flowchart TD
+ Records["π Total record set\n(N records)"]
+ Parent["Parent Workflow\n(fanOutWorkflow)"]
+ C1["Child Workflow\n(offset=0, length=chunk)"]
+ C2["Child Workflow\n(offset=chunk, length=chunk)"]
+ C3["Child Workflow\n(offset=2Γchunk, length=chunk)"]
+
+ Records --> Parent
+ Parent -->|"start child 1"| C1
+ Parent -->|"start child 2"| C2
+ Parent -->|"start child 3"| C3
+
+ C1 --> A1["processRecord Γchunk"]
+ C2 --> A2["processRecord Γchunk"]
+ C3 --> A3["processRecord Γchunk"]
+
+ A1 -->|"done"| Parent
+ A2 -->|"done"| Parent
+ A3 -->|"done"| Parent
+```
+
+The following describes each step in the diagram:
+
+1. The parent Workflow receives the total record count and a configured chunk size.
+2. It divides the total into chunks and starts one child Workflow per chunk, passing only `offset` and `length`.
+3. Each child independently fetches its slice of records (using the offset and length) and calls `processRecord` for each one.
+4. Each child completes and returns its result to the parent.
+5. The parent blocks until all children have completed, then returns the aggregated result.
+
+## Implementation
+
+
+
+The following examples show how each SDK implements the Fan-Out pattern.
+
+::: code-group
+```typescript [TypeScript]
+// workflows.ts
+import {
+ executeChild,
+ proxyActivities,
+ workflowInfo,
+} from "@temporalio/workflow";
+import type * as activities from "./activities";
+import { TASK_QUEUE, CHUNK_SIZE } from "./shared";
+
+const { processRecord } = proxyActivities({
+ startToCloseTimeout: "10 seconds",
+});
+
+export async function fanOutWorkflow(
+ totalRecords: number,
+ chunkSize: number = CHUNK_SIZE
+): Promise {
+ const children: Promise[] = [];
+
+ for (let offset = 0; offset < totalRecords; offset += chunkSize) {
+ const length = Math.min(chunkSize, totalRecords - offset);
+ children.push(
+ executeChild(recordBatchWorkflow, {
+ args: [offset, length],
+ taskQueue: TASK_QUEUE,
+ workflowId: `${workflowInfo().workflowId}/batch-${offset}`,
+ })
+ );
+ }
+
+ const results = await Promise.all(children);
+ return results.reduce((sum, n) => sum + n, 0);
+}
+
+export async function recordBatchWorkflow(
+ offset: number,
+ length: number
+): Promise {
+ let processed = 0;
+ for (let i = offset; i < offset + length; i++) {
+ await processRecord(i);
+ processed++;
+ }
+ return processed;
+}
+```
+
+```python [Python]
+# workflows.py
+from datetime import timedelta
+from temporalio import workflow
+from temporalio.workflow import ChildWorkflowHandle
+import asyncio
+from activities import process_record
+from shared import TASK_QUEUE, CHUNK_SIZE
+
+
+@workflow.defn
+class RecordBatchWorkflow:
+ @workflow.run
+ async def run(self, offset: int, length: int) -> int:
+ processed = 0
+ for i in range(offset, offset + length):
+ await workflow.execute_activity(
+ process_record,
+ i,
+ start_to_close_timeout=timedelta(seconds=10),
+ )
+ processed += 1
+ return processed
+
+
+@workflow.defn
+class FanOutWorkflow:
+ @workflow.run
+ async def run(self, total_records: int, chunk_size: int = CHUNK_SIZE) -> int:
+ handles: list[ChildWorkflowHandle] = []
+ parent_id = workflow.info().workflow_id
+
+ offset = 0
+ while offset < total_records:
+ length = min(chunk_size, total_records - offset)
+ handle = await workflow.start_child_workflow(
+ RecordBatchWorkflow.run,
+ args=[offset, length],
+ id=f"{parent_id}/batch-{offset}",
+ task_queue=TASK_QUEUE,
+ )
+ handles.append(handle)
+ offset += chunk_size
+
+ results = await asyncio.gather(*handles)
+ return sum(results)
+```
+
+```go [Go]
+// workflows.go
+package main
+
+import (
+ "fmt"
+ "time"
+
+ "go.temporal.io/sdk/workflow"
+)
+
+func FanOutWorkflow(ctx workflow.Context, totalRecords int, chunkSize int) (int, error) {
+ if chunkSize <= 0 {
+ chunkSize = ChunkSize
+ }
+
+ var futures []workflow.Future
+ parentID := workflow.GetInfo(ctx).WorkflowExecution.ID
+
+ for offset := 0; offset < totalRecords; offset += chunkSize {
+ length := chunkSize
+ if offset+chunkSize > totalRecords {
+ length = totalRecords - offset
+ }
+ off := offset // capture loop variable
+ cwo := workflow.ChildWorkflowOptions{
+ WorkflowID: parentID + "/batch-" + fmt.Sprintf("%d", off),
+ TaskQueue: TaskQueue,
+ }
+ cctx := workflow.WithChildOptions(ctx, cwo)
+ futures = append(futures, workflow.ExecuteChildWorkflow(cctx, RecordBatchWorkflow, off, length))
+ }
+
+ total := 0
+ for _, f := range futures {
+ var n int
+ if err := f.Get(ctx, &n); err != nil {
+ return total, err
+ }
+ total += n
+ }
+ return total, nil
+}
+
+func RecordBatchWorkflow(ctx workflow.Context, offset int, length int) (int, error) {
+ ao := workflow.ActivityOptions{
+ StartToCloseTimeout: 10 * time.Second,
+ }
+ ctx = workflow.WithActivityOptions(ctx, ao)
+
+ processed := 0
+ for i := offset; i < offset+length; i++ {
+ if err := workflow.ExecuteActivity(ctx, ProcessRecord, i).Get(ctx, nil); err != nil {
+ return processed, err
+ }
+ processed++
+ }
+ return processed, nil
+}
+```
+
+```java [Java]
+// FanOutWorkflow.java
+import io.temporal.activity.ActivityOptions;
+import io.temporal.workflow.*;
+import java.time.Duration;
+import java.util.ArrayList;
+import java.util.List;
+
+@WorkflowInterface
+public interface FanOutWorkflow {
+ @WorkflowMethod
+ int run(int totalRecords, int chunkSize);
+}
+
+// FanOutWorkflowImpl.java
+public class FanOutWorkflowImpl implements FanOutWorkflow {
+ @Override
+ public int run(int totalRecords, int chunkSize) {
+ if (chunkSize <= 0) chunkSize = Shared.CHUNK_SIZE;
+
+ List> promises = new ArrayList<>();
+ String parentId = Workflow.getInfo().getWorkflowId();
+
+ for (int offset = 0; offset < totalRecords; offset += chunkSize) {
+ int length = Math.min(chunkSize, totalRecords - offset);
+ ChildWorkflowOptions opts = ChildWorkflowOptions.newBuilder()
+ .setWorkflowId(parentId + "/batch-" + offset)
+ .setTaskQueue(Shared.TASK_QUEUE)
+ .build();
+ RecordBatchWorkflow child = Workflow.newChildWorkflowStub(RecordBatchWorkflow.class, opts);
+ promises.add(Async.function(child::run, offset, length));
+ }
+
+ int total = 0;
+ for (Promise p : promises) {
+ total += p.get();
+ }
+ return total;
+ }
+}
+```
+:::
+
+## Best Practices
+
+- **Use offset and length, not explicit IDs.** Pass only two integers to each child rather than a full slice of IDs. The child fetches its own records. This keeps history events small.
+- **Size chunks to stay under the Activity limit.** Each child Workflow can have at most 2,000 in-flight Activities. Aim for chunks of 500 records or fewer if each record maps to one Activity.
+- **Cap concurrent children in the parent.** Starting thousands of child Workflows simultaneously puts pressure on the namespace. Consider batching child starts or using [Sliding Window](sliding-window) if you need tighter concurrency control.
+- **Set `PARENT_CLOSE_POLICY_ABANDON`** for fire-and-forget fan-outs where the parent does not need to collect results. With the default `TERMINATE` policy, cancelling or timing out the parent will terminate all in-flight children.
+- **Give each child a deterministic Workflow ID** (`parentId/batch-`). This makes it safe to re-run the parent: Temporal deduplicates child starts by Workflow ID, so already-completed children are not re-executed.
+
+## Common Pitfalls
+
+- **Starting too many children at once.** Each child start adds to the parent's history. Keep total children per parent under 1,000 per [Temporal guidance](https://docs.temporal.io/workflows#when-to-use-child-workflows). If you need more children, switch to [MapReduce Tree](mapreduce-tree) or [Sliding Window](sliding-window).
+- **Passing large lists of IDs.** Workflow inputs are stored in event history. Passing millions of record IDs as a list will blow the history size limit. Use offset + length instead.
+- **Ignoring child failures.** A failed child does not automatically fail the parent unless you await all results. Always await child handles and handle errors explicitly.
+
+## Related Resources
+
+- [Child Workflows pattern](child-workflows) β core concepts for parent/child Workflow coordination
+- [Batch Iterator](batch-iterator) β unbounded record sets with Continue-as-New pagination
+- [Sliding Window](sliding-window) β bounded concurrency with maximum throughput
+- [Temporal limits reference](https://docs.temporal.io/cloud/limits)
diff --git a/docs/index.md b/docs/index.md
index 09df9f4..10226a9 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -190,3 +190,43 @@ Having these patterns in your toolbox helps you solve recurring problems in a ba
+
+## Batch processing patterns {.pattern-section-title}
+
+