Skip to content

feat: Add GuardComponent framework for pluggable input/output safety guards #378

@nmeisenzahl

Description

@nmeisenzahl

Summary

Flock should provide a generic guard component framework — an abstract base class GuardComponent with a clear contract for scanning inputs and outputs — that allows plugging in any content safety or prompt shielding backend. The first built-in implementation would be an AzurePromptShieldGuard for Azure AI Content Safety Prompt Shields, but the framework is designed to be open for other backends.

Motivation

Flock's blackboard architecture means artifacts published by one agent become input context for downstream agents. A malicious or compromised artifact can contain indirect prompt injection that manipulates a downstream agent's behavior. Guard components that scan both user prompts and context documents are essential for production multi-agent deployments.

Currently Flock has no built-in content safety capability. However, the existing AgentComponent lifecycle hooks (on_pre_evaluate / on_post_evaluate) and priority-based execution order already provide the right integration points — no new architectural patterns needed.

Proposed implementation

GuardComponent — Abstract base class

Extends AgentComponent. Concrete implementations only override scan_input() and/or scan_output(), returning a GuardVerdict. The base class handles lifecycle integration, text extraction from artifacts, and verdict routing (block / warn / annotate).

class GuardVerdict:
    safe: bool
    reason: str | None = None
    details: dict[str, Any] = {}
    provider: str = "unknown"

class GuardComponentConfig(AgentComponentConfig):
    on_input_flagged: Literal["block", "warn", "annotate"] = "block"
    on_output_flagged: Literal["block", "warn", "annotate"] = "warn"
    scan_input: bool = True
    scan_output: bool = False
    scan_context_artifacts: bool = True

class GuardComponent(AgentComponent, abc.ABC):
    @abc.abstractmethod
    async def scan_input(self, text: str, documents: list[str] | None = None, **kwargs) -> GuardVerdict: ...

    async def scan_output(self, text: str, **kwargs) -> GuardVerdict:
        return GuardVerdict(safe=True)  # default pass-through

    # on_pre_evaluate / on_post_evaluate wiring handled by base class

The scan_input(text, documents) signature takes raw strings rather than Flock-specific types, so guard implementations stay backend-focused and don't need to understand Flock internals.

AzurePromptShieldGuard — First built-in implementation

Calls the Prompt Shields REST API (POST /contentsafety/text:shieldPrompt?api-version=2024-09-01) to detect direct jailbreak attacks in user prompts and indirect injection in context documents. Supports both API key and Managed Identity authentication.

class AzurePromptShieldGuard(GuardComponent):
    name: str = "azure_prompt_shield"
    config: AzurePromptShieldConfig  # adds endpoint, max_document_length

    async def scan_input(self, text, documents=None, **kwargs) -> GuardVerdict:
        result = await self._call_api(text, documents or [])
        user_attack = result["userPromptAnalysis"]["attackDetected"]
        doc_attacks = [d["attackDetected"] for d in result.get("documentsAnalysis", [])]

        if user_attack or any(doc_attacks):
            return GuardVerdict(safe=False, reason="Prompt attack detected", 
                                details={"user_attack": user_attack, "doc_attacks": doc_attacks},
                                provider=self.name)
        return GuardVerdict(safe=True, provider=self.name)

Note: The stable azure-ai-contentsafety Python SDK (1.0.0) does not expose a dedicated shield_prompt method. The implementation calls the REST API directly via aiohttp. Should the SDK add native support, the implementation should be updated accordingly.

Usage

agent = Agent(
    name="support_agent",
    engine=DSPyEngine(model="azure/gpt-4.1"),
    components=[
        AzurePromptShieldGuard(
            priority=-10,  # run before other components
            config=AzurePromptShieldConfig(
                on_input_flagged="block",
                scan_context_artifacts=True,
            ),
        ),
    ],
)

Multiple guards compose naturally as separate components — Flock's priority-based ordering ensures they execute sequentially, and if one blocks, the exception propagates before subsequent guards or the engine run.

Related

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions