GitHub - fyxme/flagseeker: Autonomous CTF AI Agent with self-contained hacking environment and RAG integration

Flagseeker is a modular AI Plain Agent developped to test AI Agentic abilities at solving CTF Cybersecurity challenges.

The goal is to test out LLM capabilities at solving Cybersecurity tasks and understand how to enhance those capabilities through context engineering (ie. prompt engineering, RAG, tooling, etc). The current setup allows quick prototyping of various agents, environments, prompts and RAG integrations with fast benchmarking using containerised docker environments and multi-threading.

You can read more about the tool itself, see some output and understand how we got here in the following article.

Installation

# Building the docker container
./setup.sh

# installing python requirements
pipenv install

# for getting a shell with installed deps
pipenv shell

Running benchmarks

Using the openrouter script

I like to use openrouter to switch and test out different models. Hence, I've made a bash wrapper that sets up openrouter and lanfuse (for observability) which you can run as such:

./openrouter_runner.sh --env_image_name "kali-ctf-agent-runner" bench \
    --benchmark_fp benchmarks/intercode-ctf/ic_ctf.yaml \
    --log_dir logs/final \
    --action_model "mistralai/magistral-small-2506" \
    --thinking_model "mistralai/magistral-small-2506" \
    --planning_model "mistralai/magistral-small-2506" \
    --workers 8 \
    --skip_tasks 1,2,3 \
    --use_rag

Using the cli directly

Alternatively, you can use the CLI directly as such:

python3 -m src.cli bench -h                                                                                            kingjulien at central-park-zoo (-)(main)
usage: cli.py bench [-h] --benchmark_fp BENCHMARK_FP --log_dir LOG_DIR [--number_of_attempts NUMBER_OF_ATTEMPTS] [--max_steps MAX_STEPS] [--skip_tasks SKIP_TASKS] [--only_tasks ONLY_TASKS]
                    [--plan_every PLAN_EVERY] [--plan_at_turns PLAN_AT_TURNS] [--dont_plan_at DONT_PLAN_AT] --action_model ACTION_MODEL [--thinking_model THINKING_MODEL] [--planning_model PLANNING_MODEL]
                    [--workers WORKERS] [--debug] [--use_rag]

options:
  -h, --help            show this help message and exit
  --benchmark_fp BENCHMARK_FP
  --log_dir LOG_DIR
  --number_of_attempts NUMBER_OF_ATTEMPTS
  --max_steps MAX_STEPS
  --skip_tasks SKIP_TASKS
  --only_tasks ONLY_TASKS
  --plan_every PLAN_EVERY
  --plan_at_turns PLAN_AT_TURNS
  --dont_plan_at DONT_PLAN_AT
  --action_model ACTION_MODEL
  --thinking_model THINKING_MODEL
  --planning_model PLANNING_MODEL
  --workers WORKERS
  --debug
  --use_rag

It's using openai-sdk so you can use any openai compatible providers (ie. openrouter, litellm, etc) by setting the following environment variables:

export OPENAI_API_KEY="....."
export OPENAI_BASE_URL="https://openrouter.ai/api/v1" # change this to your providers' base url

Running custom tasks using Jupyter notebook

The CLI currently doesn't support single tasks, however you can run custom tasks with the challenge_solver jupyter notebook.

# starting the lab
pipenv shell
export OPENAI_API_KEY=sk-or-v1-..........
jupyter-lab

# follow the instructions in `challenge_solver.ipynb`

Calculating benchmark results

A quick way to get the benchmark results using jq:

jq -r '[.benchmark.tasks[] | select(.is_solved)] | "\(length) \(input_filename)"' logs/*.json | sort -nr

Cleaning up failed logs

Quick and dirty way to remove failed json logs which I got a few during testing. Should rarely happen now but here it is in case...

find . -type f -name '*.json' -print0 |
while IFS= read -r -d '' f; do
  if jq . "$f" >/dev/null 2>&1; then
    :  # valid JSON
  else
    rm -v -- "$f"   # invalid -> delete
  fi
done

Other

langfuse for observability

Lanfuse integration for observability.

Download and start:

# setup
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up

Setting it up:

# environment variables

LANGFUSE_SECRET_KEY="sk-lf-..."
LANGFUSE_PUBLIC_KEY="pk-lf-..."
LANGFUSE_HOST=http://localhost:3000

If you're using the openrouter_runner bash script, you can set the variables in it and enable by setting running export USE_LANGFUSE=1.

Cleaning up docker containers

Sometimes needed when testing or if you ctrl-c too much before we have time to cleanup:

# change ancestor with your image id
docker rm $(docker stop $(docker ps -a -q --filter ancestor=0677f4e09cab --format="{{.ID}}"))

Credits

This tool expands on previous research by Palisade Research.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
benchmarks/intercode-ctf		benchmarks/intercode-ctf
docker		docker
logs		logs
research		research
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
challenge_solver.ipynb		challenge_solver.ipynb
multi_agent_runner.sh		multi_agent_runner.sh
openrouter_runner.sh		openrouter_runner.sh
requirements.txt		requirements.txt
setup.sh		setup.sh
todo.txt		todo.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Running benchmarks

Using the openrouter script

Using the cli directly

Running custom tasks using Jupyter notebook

Calculating benchmark results

Cleaning up failed logs

Other

langfuse for observability

Cleaning up docker containers

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Installation

Running benchmarks

Using the openrouter script

Using the cli directly

Running custom tasks using Jupyter notebook

Calculating benchmark results

Cleaning up failed logs

Other

langfuse for observability

Cleaning up docker containers

Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages