AGENTS.md is the single source of truth for AI agent behavior in this repository.
- 4CAT is a Python based data analysis and collection tool that allows users to import, collect, and analyse datasets through a Web interface. Its key components are:
- Data sources: Python scripts capable of importing or collecting external data, e.g. as API requests or file uploads.
- Processors: modular Python scripts that process datasets for the purpose of data analysis. Processors can be chained to manipulate datasets outputted by other processors.
- Datasets: Outputs of datasources and processors. Datasets can take shape as various files, most commonly as CSVs and NDJSONs but also as ZIPs, SVGs, PNGs, or HTMLs.
- 4CAT's three main design principles are:
- modularity: 4CAT data sources and processors are meant to be compartmentalized to keep up with the volatile nature of social media APIs, data structures, and SOTA techniques in data analysis. This also ensures that one failing feature will not break 4CAT as a whole.
- transparency: It should be made clear how datasets are collected and processed. This is done through front-end GUI elements (e.g. detailed instructions in e.g. data source and processor instructions) as well as features like links to specific GitHub pages showing historical code versions.
- traceability: One should be able to retrace all collection and analyses steps in 4CAT. This works together with the other two principles.
- Repo with:
backend/: Python backend using Postgres. This:- Schedules workers via
WorkerManager(backend/lib/manager.py), which polls theJobQueueand spawns workers as threads. - Parses and processes search requests.
- Contains various helper methods and classes.
- See
backend/database.sqlfor the database definition. backend/workers/: Built-in system workers (API handler, dataset cancellation, update checker, cleanup, metrics, extension management, etc.).backend/lib/: Core backend classes—worker.py(BasicWorker),processor.py(BasicProcessor),search.py(Search),manager.py(WorkerManager),preset.py,scraper.py,proxied_requests.py.
- Schedules workers via
webtool/: Python Flask, Jinja2, and JS front-end. This:- Defines the Web interface components and functionality.
- Handles API requests to the backend.
- Flask app is defined in
webtool/__init__.py; WSGI entry point iswebtool/4cat.wsgi. - Views are split by concern:
views/api_tool.py,views/api_standalone.py,views/views_dataset.py,views/views_admin.py,views/views_user.py,views/views_explorer.py,views/views_extensions.py,views/views_misc.py,views/views_restart.py. - Jinja2 templates live in
webtool/templates/; static assets inwebtool/static/.
common/: Contains important files and classes used by both the back-end daemon and the front-end web app. Also contains shared helper functions and assets (i.e. static files) used for analyses. Key modules:common/lib/dataset.py—DataSetclass (extendsFourcatModule).common/lib/fourcat_module.py—FourcatModule, the root superclass forDataSetandBasicProcessor. Contains compatibility-checking logic.common/lib/module_loader.py—ModuleCollector, which scansprocessors/,datasources/,backend/workers/, and extension dirs for workers/processors at startup.common/lib/database.py— Postgres database wrapper.common/lib/helpers.py— Shared utility functions.common/lib/user_input.py— User input validation/sanitization.common/lib/config_definition.py— Default config definitions.common/lib/llm.py— LLM integration helpers.common/lib/job.py/common/lib/queue.py— Job and JobQueue.common/config_manager.py—ConfigManager(reads INI + databasesettingstable + memcached caching) andConfigWrapper(provides user-scoped config).
datasources/: Data sources are defined here. These can concern import definitions or full fledged scrapers. Each datasource folder follows a standard structure:__init__.py— must defineDATASOURCE(internal ID) andNAME(display name); optionally importsinit_datasource.- A search worker file (e.g.
search_bsky.py) whosetypemust follow the{DATASOURCE}-searchor{DATASOURCE}-importnaming convention. - Optional:
DESCRIPTION.md(shown in the UI),database.sql(datasource-specific tables), Explorer CSS/HTML files.
processors/: Modular Python scripts that manipulate datasets in some way. This is very diverse; processors include machine learning analyses as well as 'Download images' and simple metric calculations. Organized into subdirectories by category:audio/,conversion/,filtering/,machine-learning/,machine_learning/,metrics/,networks/,presets/,statistics/,text-analysis/,twitter/,visualisation/.helper-scripts/: Shared helper scripts. This importantly also contains various migration scripts to upgrade to new 4CAT versions. This often has to do with database manipulation. See the Versioning and Migrations section below.config/: Configuration directory. Containsconfig.ini(INI-based runtime config),module_config.bin(pickled module-defined config options),.current-version(last-migrated version).config/extensions/holds installed extensions. Extensions are modular add-ons to 4CAT that are loaded in on startup. These can be data sources, processors, or workers. Folders here may be managed externally and concern symlinks. Each can have its ownrequirements.txt).
The core inheritance chain for data processing:
FourcatModule (common/lib/fourcat_module.py)
├── DataSet (common/lib/dataset.py)
└── BasicWorker (backend/lib/worker.py, threading.Thread)
└── BasicProcessor (backend/lib/processor.py)
└── Search (backend/lib/search.py)
- BasicWorker: Abstract thread-based worker. Key attributes:
type(must match job ID),max_workers(parallelism limit, default 1). Workers with anensure_jobclassmethod auto-queue recurring jobs on startup. - BasicProcessor: Abstract processor. Required class attributes:
type,category,title,description,extension. Required methods:process(),get_options(). Theoptionsclass attribute is deprecated; useget_options()instead. - Search: Abstract search worker for datasources. Its
typeshould end with-searchor-import(e.g.bsky-search).
settings— key-value config store (name, value, tag).jobs— job queue (jobtype, remote_id, timestamps, interval for recurring).datasets— all datasets (key, type, key_parent for chaining, parameters as JSON, result_file, status, progress, etc.).datasets_owners— many-to-many user-dataset ownership.annotations— dataset item annotations.metrics— datasource metrics by date.users— user accounts (name, password hash, userdata JSON, tags JSONB).access_tokens— API access tokens.users_favourites,users_notifications— user preferences and notifications.
docker-compose.ymldefines 4 services:db(Postgres),memcached,backend,webtool.- All services are configured via a
.envfile (referenced byenv_file: .env). - The Docker image is based on
python:3.11-slim-trixie(seedocker/Dockerfile). Dependencies are installed viapip install -r requirements.txt(which just runspip install -e .usingsetup.py). Gunicorn is installed separately for the frontend. - Backend entrypoint (
docker/docker-entrypoint.sh):- Waits for Postgres and memcached to be healthy.
- Seeds the database from
backend/database.sqlif tables don't exist (fresh install). - Removes stale PID lockfile.
- Runs
helper-scripts/migrate.py -yto apply pending migrations. - Runs
docker.docker_setupto sync.envvars intoconfig/config.ini. - Starts the backend daemon via
python3 4cat-daemon.py start.
- Frontend entrypoint (
docker/wait-for-backend.sh): Waits for backend, runs frontend-specific migration, then starts Gunicorn (default: 4 workers, 4 threads,gthreadclass, binding0.0.0.0:5000). - Named volumes:
4cat_db,4cat_data,4cat_config,4cat_logs.
- Backend:
python 4cat-daemon.py start(orrestart/stop). - Python
>= 3.11is required (enforced insetup.py). - Install dependencies:
pip install -e .(runssetup.pywhich unions core, processor, and extension packages). - The
python-daemonpackage is Unix-only and excluded on Windows (os.name == "nt").
- INI-based (
config/config.ini): Primary runtime config, read byConfigManager. Docker'sdocker_setup.pysyncs environment variables into this file. - Database: The
settingstable stores runtime-configurable settings (name/value/tag).ConfigManagerreads from both INI and database, with memcached caching. - Module config: Module-defined
configdicts (on worker classes) are collected byModuleCollectorat startup and cached toconfig/module_config.bin. - Legacy:
config.pyin the repo root contains some legacy constants. PreferConfigManager/config.inipatterns. - Extensions: Installed under
config/extensions/. Each extension can include its ownrequirements.txt(auto-installed bysetup.py). Enabled/disabled viaextensions.enabledsetting.
- The current version is stored in the
VERSIONfile (currently1.53). Do not edit this file casually. config/.current-versiontracks the last-migrated version for the running instance.helper-scripts/migrate.pycomparesVERSIONto.current-versionand runs the appropriatemigrate-X.XX-X.XX.pyscripts fromhelper-scripts/migrate/in sequence.- Migration scripts handle database schema changes, data transformations, and config updates between versions.
- Docker runs migration automatically on each startup.
- For any schema or breaking change: create a new
migrate-{old}-{new}.pyscript and bump theVERSIONfile. Never edit existing migration scripts.
- Preserve existing behavior unless a requested change intentionally modifies behavior.
- Prioritize correctness, testability, and clear error handling.
- Keep edits small, reviewable, and aligned with existing patterns and design principles.
- Follow existing project style, conventions, and naming.
- Indentation: The codebase uses tabs for Python indentation. Match this in all edits.
- Python
>= 3.11is required. Use modern Python features (e.g.match,|for union types) when appropriate. - Avoid adding dependencies unless clearly necessary. If a dependency is added, add it to the appropriate set in
setup.py(core_packagesorprocessor_packages) with a version pin. - Add comments only when logic is non-obvious.
- Breaking changes are allowed when needed for correctness, maintainability, or delivery speed.
- Prefer minimal, targeted changes over broad refactors unless a broader change is the best path.
- Dependencies are managed via
setup.py(not Poetry). Install withpip install -e .. - Respect environment-based behavior documented in repository docs/config.
- For data-model or persistence changes:
- Avoid destructive changes unless explicitly requested.
- Call out migration or compatibility impact clearly.
- If adding/altering database tables: update
backend/database.sqlfor fresh installs and create a migration script inhelper-scripts/migrate/for existing installs.
- When creating a new processor:
- Subclass
BasicProcessor. Definetype,category,title,description,extension. - Implement
process()andget_options(). Do not use the deprecatedoptionsclass attribute. - Place it in the appropriate
processors/subdirectory.
- Subclass
- When creating a new datasource:
- Create a folder under
datasources/with an__init__.pydefiningDATASOURCEandNAME. - Create a search worker extending
Searchwithtypeset to{DATASOURCE}-searchor{DATASOURCE}-import. - Optionally add
DESCRIPTION.md,database.sql, and explorer files.
- Create a folder under
- Use existing frontend formatting.
- Define reusable Jinja2 components when patterns emerge, but avoid over-engineering for future reuse.
- Views are organized by concern in
webtool/views/. API endpoints are inapi_tool.pyandapi_standalone.py. - Static assets go in
webtool/static/; templates inwebtool/templates/.
- Run tests with
pytestfrom the repo root. Config is inpytest.ini. - The test suite is in
tests/test_modules.py. It validates:- Logger initialization (
test_logger). - Module loading (
test_module_collector): ensures all workers, processors, and datasources load without errors and no modules are missing. - Processor validity (
test_processors, depends ontest_module_collector): for every processor, checks it is aBasicProcessorsubclass, has required attributes (type,category,title,description,extension), has required methods (get_options,process), thatget_options()runs without error, that the deprecatedoptionsattribute is not present, and that the class can be instantiated. - Datasource validity (
test_datasources, depends ontest_module_collector): checks search worker naming conventions ({DATASOURCE}-searchor{DATASOURCE}-import) and that each datasourcehas_worker.
- Logger initialization (
- Tests mock
Database,ConfigManager,Job,JobQueue, andDataSetextensively viaunittest.mock. See fixtures intest_modules.pyfor patterns. - After making changes: at minimum run
pytestto ensure all modules still load and pass validation. New processors/datasources must passtest_processors/test_datasources. - There are no integration or end-to-end tests. For complex logic, consider adding targeted unit tests.
ruffis available as a dependency. Runruff check .to lint changed files. No customruffconfig exists (defaults apply), but respect the tab-indentation convention.- Make sure to check whether you are on a Unix-like system (macOS/Linux) or Windows to choose the correct commands.
- Never commit secrets or real credentials.
- Avoid destructive operations by default (data deletion, schema resets, irreversible changes).
- Explicitly document risks, assumptions, and trade-offs when they matter.
- For each meaningful change, report in a concise format covering:
- what changed
- why it changed
- how it was validated
- any suggested follow-up
- Read relevant files first.
- Propose a short plan for non-trivial changes.
- Implement a minimal patch.
- Run targeted validation (
pytest,ruff check). - Report outcomes with file paths.
- If any other instruction file conflicts with
AGENTS.md, followAGENTS.md.