Flagseeker is a modular AI Plain Agent developped to test AI Agentic abilities at solving CTF Cybersecurity challenges.
The goal is to test out LLM capabilities at solving Cybersecurity tasks and understand how to enhance those capabilities through context engineering (ie. prompt engineering, RAG, tooling, etc). The current setup allows quick prototyping of various agents, environments, prompts and RAG integrations with fast benchmarking using containerised docker environments and multi-threading.
You can read more about the tool itself, see some output and understand how we got here in the following article.
# Building the docker container
./setup.sh
# installing python requirements
pipenv install
# for getting a shell with installed deps
pipenv shellI like to use openrouter to switch and test out different models. Hence, I've made a bash wrapper that sets up openrouter and lanfuse (for observability) which you can run as such:
./openrouter_runner.sh --env_image_name "kali-ctf-agent-runner" bench \
--benchmark_fp benchmarks/intercode-ctf/ic_ctf.yaml \
--log_dir logs/final \
--action_model "mistralai/magistral-small-2506" \
--thinking_model "mistralai/magistral-small-2506" \
--planning_model "mistralai/magistral-small-2506" \
--workers 8 \
--skip_tasks 1,2,3 \
--use_ragAlternatively, you can use the CLI directly as such:
python3 -m src.cli bench -h kingjulien at central-park-zoo (-)(main)
usage: cli.py bench [-h] --benchmark_fp BENCHMARK_FP --log_dir LOG_DIR [--number_of_attempts NUMBER_OF_ATTEMPTS] [--max_steps MAX_STEPS] [--skip_tasks SKIP_TASKS] [--only_tasks ONLY_TASKS]
[--plan_every PLAN_EVERY] [--plan_at_turns PLAN_AT_TURNS] [--dont_plan_at DONT_PLAN_AT] --action_model ACTION_MODEL [--thinking_model THINKING_MODEL] [--planning_model PLANNING_MODEL]
[--workers WORKERS] [--debug] [--use_rag]
options:
-h, --help show this help message and exit
--benchmark_fp BENCHMARK_FP
--log_dir LOG_DIR
--number_of_attempts NUMBER_OF_ATTEMPTS
--max_steps MAX_STEPS
--skip_tasks SKIP_TASKS
--only_tasks ONLY_TASKS
--plan_every PLAN_EVERY
--plan_at_turns PLAN_AT_TURNS
--dont_plan_at DONT_PLAN_AT
--action_model ACTION_MODEL
--thinking_model THINKING_MODEL
--planning_model PLANNING_MODEL
--workers WORKERS
--debug
--use_rag
It's using openai-sdk so you can use any openai compatible providers (ie. openrouter, litellm, etc) by setting the following environment variables:
export OPENAI_API_KEY="....."
export OPENAI_BASE_URL="https://openrouter.ai/api/v1" # change this to your providers' base urlThe CLI currently doesn't support single tasks, however you can run custom tasks with the challenge_solver jupyter notebook.
# starting the lab
pipenv shell
export OPENAI_API_KEY=sk-or-v1-..........
jupyter-lab
# follow the instructions in `challenge_solver.ipynb`A quick way to get the benchmark results using jq:
jq -r '[.benchmark.tasks[] | select(.is_solved)] | "\(length) \(input_filename)"' logs/*.json | sort -nr Quick and dirty way to remove failed json logs which I got a few during testing. Should rarely happen now but here it is in case...
find . -type f -name '*.json' -print0 |
while IFS= read -r -d '' f; do
if jq . "$f" >/dev/null 2>&1; then
: # valid JSON
else
rm -v -- "$f" # invalid -> delete
fi
doneLanfuse integration for observability.
Download and start:
# setup
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose upSetting it up:
# environment variables
LANGFUSE_SECRET_KEY="sk-lf-..."
LANGFUSE_PUBLIC_KEY="pk-lf-..."
LANGFUSE_HOST=http://localhost:3000If you're using the openrouter_runner bash script, you can set the variables in it and enable by setting running export USE_LANGFUSE=1.
Sometimes needed when testing or if you ctrl-c too much before we have time to cleanup:
# change ancestor with your image id
docker rm $(docker stop $(docker ps -a -q --filter ancestor=0677f4e09cab --format="{{.ID}}"))This tool expands on previous research by Palisade Research.
