Use the master bootstrap script to install all components automatically:
⚠️ Important: The target OpenShift cluster must NOT have RHOAI (Red Hat OpenShift AI) installed before using this bootstrap. SREIPS has its own AI/ML infrastructure and conflicts may occur with existing RHOAI installations.
Before deployment, you need to create a Slack Bot/App for SREIPS notifications:
- Go to your Slack workspace: https://api.slack.com/apps
- Click Create New App → From scratch
- Name it (e.g., "SREIPS Bot") and select your workspace
- Click Create App
- In your app settings, go to OAuth & Permissions
- Scroll down to Scopes section
- Under Bot Token Scopes, add the following permissions:
chat:write- Send messages as SREIPS Botchat:write.public- Send messages to channels that SREIPS Bot isn't a member offiles:write- Upload, edit and delete files as SREIPS Botincoming-webhook- Post messages to specific channels in Slack
- Scroll up to OAuth Tokens for Your Workspace
- Click Install to Workspace
- Review permissions and click Allow
- Copy the Bot User OAuth Token (starts with
xoxb-...)- This is your
SLACK_API_KEYforconfig.env
- This is your
- In your app settings, go to Basic Information
- Scroll down to App Credentials section
- Copy the Signing Secret
- This is your
SIGNING_KEYforconfig.env - This is used to verify that requests to your remediation agent are coming from Slack
- This is your
- In your app settings, go to Interactivity & Shortcuts
- Toggle Interactivity to On
- Set the Request URL to:
<remediation-agent-route-url>/remediate- Example:
https://sreips-remediation-agent-sreips-agent.apps.your-cluster.com/remediate - To get the route URL after deployment, run:
oc get route sreips-remediation-agent -n sreips-agent -o jsonpath='{.spec.host}' - Then use:
https://<route-host>/remediate
- Example:
- Click Save Changes
Note: You'll need to update this URL after deploying SREIPS, as the route won't exist until the remediation agent is deployed.
- Create or choose a Slack channel (e.g.,
#sreips-helper) - In the channel, type
/invite @SREIPS Bot(or your bot name) - The channel name you use here is your
SLACK_CHANNELforconfig.env
# Copy the configuration template
cp config.env.template config.env
# Edit config.env and fill in your values
# Make sure to set SLACK_API_KEY and SLACK_CHANNEL from the steps above
vim config.env./bootstrap.shThis will automatically install all SREIPS components in the correct sequence with proper dependency handling.
Before running the bootstrap script, you need to configure the following in config.env:
- Slack API key - Obtained from steps above (starts with
xoxb-) - Slack channel - Channel name where notifications will be sent (e.g.,
sreips-helper) - Signing key - Slack signing secret for verifying requests from Slack (from Basic Information → App Credentials)
- Cluster name - Your OpenShift cluster identifier
- Root username - MinIO admin username (minimum 3 characters)
- Root password - MinIO admin password (minimum 8 characters)
- RH API Offline Token - Get from https://access.redhat.com/management/api
- Log in with your Red Hat account
- Navigate to API Tokens section
- Generate or copy your offline token
- Inference model - LLM model name (e.g.,
Llama-4-Scout-17B-16E-W4A16) - VLLM URL - Your vLLM inference endpoint
- VLLM API token - Authentication token for vLLM
- VLLM TLS verify - Set to
trueorfalsefor SSL verification
- Vector database ID -
⚠️ Important: In RHOAI-3 based implementation, this must be obtained after the data ingestion RAG pipeline completes. RHOAI-3 generates the vector store ID dynamically and no longer uses the given name. See Post-Deployment Steps below for instructions.
See config.env.template for detailed descriptions and example values.
- OpenShift CLI (
oc) installed and logged in to your cluster jqfor JSON parsingcurlfor API calls- Valid credentials for all services (Slack, Red Hat API, VLLM, etc.)
The SREIPS platform consists of 7 main components that are installed in sequence:
- sreips-core: Core SREIPS monitoring and automation framework based on Robusta
- minio: Object storage for data pipeline artifacts
- ocp-mcp: OpenShift MCP server that provides cluster management capabilities for the remediation agent
- rh-kcs-mcp: Red Hat Knowledgebase Content Services MCP server for KB access
- llamastack: AI/ML pipeline infrastructure with Milvus vector database
- sreips-agent: Main SREIPS agent that orchestrates troubleshooting workflows
- remediation-agent: Automated remediation agent for self-healing capabilities with interactive Slack buttons
For detailed architecture and data flow diagrams, see ARCHITECTURE.md
If you prefer to install components individually or need to re-run specific steps:
# Source the configuration and functions
source config.env
source bootstrap.sh
# Run individual installation functions
install_sreips_core # Step 2: Core monitoring framework
install_minio # Step 3: Object storage
install_ocp_mcp # Step 4: OpenShift MCP server for remediation agent
install_rh_kcs_mcp # Step 5: Red Hat KCS MCP server
install_llamastack # Step 6: AI/ML pipeline infrastructure
install_sreips_agent # Step 7: SREIPS and Remediation agentsNote: Manual deployment requires that you run steps in sequence as later components depend on earlier ones. The remediation agent specifically requires the OCP MCP server (step 4) to perform cluster operations.
If installation fails:
- Check that you're logged into OpenShift:
oc whoami - Verify all required variables are set in
config.env - Check pod status:
oc get pods -n <namespace> - View pod logs:
oc logs -n <namespace> <pod-name> - The script will provide detailed error messages indicating where the failure occurred
After the bootstrap script completes successfully, you need to complete the following steps:
In the new RHOAI-3 based implementation, the VECTOR_DB_ID must be obtained after the data ingestion RAG pipeline completes, as RHOAI-3 generates the vector store ID dynamically and no longer uses the given name.
-
Get the Vector Store ID from the Pipeline:
- Navigate to your RHOAI-3 Data Science Pipelines dashboard
- Find the completed data ingestion RAG pipeline run
- Copy the generated vector store ID from the pipeline output/logs
-
Update the ConfigMap:
# Edit the configmap to set VECTOR_DB_ID oc edit configmap sreips-agent-config -n sreips-agent- Add or update the
VECTOR_DB_IDenvironment variable with the vector store ID obtained from step 1 - Save and exit
- Add or update the
-
Restart the SREIPS Agent Pod:
# Restart the pod to pick up the new configuration oc delete pod -l app=sreips-agent -n sreips-agentThe pod will automatically restart with the new configuration.
-
Get the Remediation Agent Route URL:
oc get route remediation-agent -n sreips-agent -o jsonpath='{.spec.host}' -
Update Slack App Interactivity URL:
- Go back to your Slack app settings at https://api.slack.com/apps
- Navigate to Interactivity & Shortcuts
- Update the Request URL to:
https://<route-from-step-1>/remediate - Click Save Changes
This enables the interactive remediation buttons in Slack notifications.
To test SREIPS event detection and notification, apply the sample manifests in ./test-manifests. These manifests will generate simulated issues or failures. SREIPS will detect the resulting events, send detailed notifications to your configured Slack channel and include enriched solutions based on data from your enterprise knowledge base and Red Hat KCS.
The remediation-agent provides self-healing capabilities for resource quota issues:
- AI powered analysis: Automatically analyzes quota violations using LlamaStack
- One click fixes: Interactive Slack buttons to trigger automated remediation
- Safe operations: Uses the OCP MCP server to perform auditable cluster operations
- Real time feedback: Immediate success/failure notifications back to Slack
- Secure: Request verification using Slack signing secret to ensure authenticity
The test-manifests/ directory contains various test scenarios to validate SREIPS detection and notification capabilities:
oc apply -f test-manifests/01-crashloop-pod.yamlTests detection of pods stuck in CrashLoopBackOff state. SREIPS will analyze container logs and provide troubleshooting guidance.
oc apply -f test-manifests/02-imagepull-pod.yamlTests detection of image pull failures. SREIPS will identify the missing or inaccessible image and suggest resolution steps.
oc apply -f test-manifests/03-oom-pod.yamlTests detection of OOM killed containers. SREIPS will analyze memory usage patterns and recommend appropriate resource limits.
oc apply -f test-manifests/05-pvc-failure.yamlTests detection of persistent volume claim binding failures. SREIPS will analyze storage class availability and quota issues.
oc apply -f test-manifests/06-quota-exceeded-pod.yamlTests the automated remediation feature for resource quota violations. This will:
- Create a namespace with restrictive resource quotas
- Attempt to deploy a pod that exceeds the quota
- Trigger SREIPS to detect the quota violation
- Send a Slack notification with an interactive "Remediate" button
- Click the button to trigger automated quota adjustment via the remediation-agent