Skip to content

fyxme/flagseeker

Repository files navigation

flagseeker banner

Flagseeker is a modular AI Plain Agent developped to test AI Agentic abilities at solving CTF Cybersecurity challenges.

The goal is to test out LLM capabilities at solving Cybersecurity tasks and understand how to enhance those capabilities through context engineering (ie. prompt engineering, RAG, tooling, etc). The current setup allows quick prototyping of various agents, environments, prompts and RAG integrations with fast benchmarking using containerised docker environments and multi-threading.

You can read more about the tool itself, see some output and understand how we got here in the following article.

Installation

# Building the docker container
./setup.sh

# installing python requirements
pipenv install

# for getting a shell with installed deps
pipenv shell

Running benchmarks

Using the openrouter script

I like to use openrouter to switch and test out different models. Hence, I've made a bash wrapper that sets up openrouter and lanfuse (for observability) which you can run as such:

./openrouter_runner.sh --env_image_name "kali-ctf-agent-runner" bench \
    --benchmark_fp benchmarks/intercode-ctf/ic_ctf.yaml \
    --log_dir logs/final \
    --action_model "mistralai/magistral-small-2506" \
    --thinking_model "mistralai/magistral-small-2506" \
    --planning_model "mistralai/magistral-small-2506" \
    --workers 8 \
    --skip_tasks 1,2,3 \
    --use_rag

Using the cli directly

Alternatively, you can use the CLI directly as such:

python3 -m src.cli bench -h                                                                                            kingjulien at central-park-zoo (-)(main)
usage: cli.py bench [-h] --benchmark_fp BENCHMARK_FP --log_dir LOG_DIR [--number_of_attempts NUMBER_OF_ATTEMPTS] [--max_steps MAX_STEPS] [--skip_tasks SKIP_TASKS] [--only_tasks ONLY_TASKS]
                    [--plan_every PLAN_EVERY] [--plan_at_turns PLAN_AT_TURNS] [--dont_plan_at DONT_PLAN_AT] --action_model ACTION_MODEL [--thinking_model THINKING_MODEL] [--planning_model PLANNING_MODEL]
                    [--workers WORKERS] [--debug] [--use_rag]

options:
  -h, --help            show this help message and exit
  --benchmark_fp BENCHMARK_FP
  --log_dir LOG_DIR
  --number_of_attempts NUMBER_OF_ATTEMPTS
  --max_steps MAX_STEPS
  --skip_tasks SKIP_TASKS
  --only_tasks ONLY_TASKS
  --plan_every PLAN_EVERY
  --plan_at_turns PLAN_AT_TURNS
  --dont_plan_at DONT_PLAN_AT
  --action_model ACTION_MODEL
  --thinking_model THINKING_MODEL
  --planning_model PLANNING_MODEL
  --workers WORKERS
  --debug
  --use_rag

It's using openai-sdk so you can use any openai compatible providers (ie. openrouter, litellm, etc) by setting the following environment variables:

export OPENAI_API_KEY="....."
export OPENAI_BASE_URL="https://openrouter.ai/api/v1" # change this to your providers' base url

Running custom tasks using Jupyter notebook

The CLI currently doesn't support single tasks, however you can run custom tasks with the challenge_solver jupyter notebook.

# starting the lab
pipenv shell
export OPENAI_API_KEY=sk-or-v1-..........
jupyter-lab

# follow the instructions in `challenge_solver.ipynb`

Calculating benchmark results

A quick way to get the benchmark results using jq:

jq -r '[.benchmark.tasks[] | select(.is_solved)] | "\(length) \(input_filename)"' logs/*.json | sort -nr 

Cleaning up failed logs

Quick and dirty way to remove failed json logs which I got a few during testing. Should rarely happen now but here it is in case...

find . -type f -name '*.json' -print0 |
while IFS= read -r -d '' f; do
  if jq . "$f" >/dev/null 2>&1; then
    :  # valid JSON
  else
    rm -v -- "$f"   # invalid -> delete
  fi
done

Other

langfuse for observability

Lanfuse integration for observability.

Download and start:

# setup
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up

Setting it up:

# environment variables

LANGFUSE_SECRET_KEY="sk-lf-..."
LANGFUSE_PUBLIC_KEY="pk-lf-..."
LANGFUSE_HOST=http://localhost:3000

If you're using the openrouter_runner bash script, you can set the variables in it and enable by setting running export USE_LANGFUSE=1.

Cleaning up docker containers

Sometimes needed when testing or if you ctrl-c too much before we have time to cleanup:

# change ancestor with your image id
docker rm $(docker stop $(docker ps -a -q --filter ancestor=0677f4e09cab --format="{{.ID}}"))

Credits

This tool expands on previous research by Palisade Research.