Contributing to Boost Data Collector

This document describes how to contribute to the project, with emphasis on the service layer and data-write rules.

Creating a new collector

Use the startcollector management command to generate a new Django app with the usual collector layout (stub models.py, services.py, AbstractCollector + BaseCollectorCommand, tests/ package, migrations/0001_initial.py, and schedule_snippet.yaml). Run it from the repository root so the new package sits next to the other apps.

# Preview only (no files written)
python manage.py startcollector my_platform --dry-run

# Create ./my_platform/ (pick a unique snake_case name)
python manage.py startcollector my_platform

What you get

App package my_platform/ with apps.py (BigAutoField, correct name), models.py (stub run-state model), services.py (stub record_run — all writes for this app should stay in this module per Service layer below).
management/commands/run_my_platform.py — collector subclasses AbstractCollector (not the deprecated CollectorBase); command subclasses BaseCollectorCommand.
tests/test_run_my_platform_command.py — smoke test; it runs only after you register the app (next step).
schedule_snippet.yaml — commented template to paste into config/boost_collector_schedule.yaml (see Workflow.md).

What you must do manually

Add "my_platform" to INSTALLED_APPS in config/settings.py (keep alphabetical order with the other project apps).
Merge the task from schedule_snippet.yaml into config/boost_collector_schedule.yaml under the right groups.<name>.tasks entry (see Workflow.md).
Run python manage.py migrate so the new tables exist.
When the app imports other apps or defines cross-app foreign keys, update cross-app-dependencies.md (add or adjust the row for your app).
As services.py grows, run python scripts/generate_service_docs.py and commit the generated docs/service_api/ updates when you add public service functions.

Docs and contracts

Collector lifecycle, errors, and optional sync_pinecone: How_to_add_a_collector.md and Core_public_API.md.
CI: The pyright workflow runs python scripts/validate_collector_scaffold.py, which recreates a throwaway app under .test_artifacts/, then runs ruff and a scoped pyright check on that tree.

Service layer: single place for writes

Each Django app that has models provides a services.py module. This is the only place where code should create, update, or delete rows for that app’s models.

Rule

All inserts/updates/deletes for an app’s models must go through functions in that app’s services.py.
Do not call Model.objects.create(), model.save(), or model.delete() from outside services.py (e.g. from management commands, views, other apps, or tests that are not testing the service layer itself).

Why

Single place for write logic: Validation, defaults, and side effects live in one module.
Easier to change: Schema or business rules can be updated in one place.
Clear API: Contributors know where to look and what to call.

Which apps have a service layer

App	File	Notes
`cppa_user_tracker`	`cppa_user_tracker/services.py`	Identity, profiles, emails, staging.
`github_activity_tracker`	`github_activity_tracker/services.py`	Repos, languages, licenses, issues, PRs.
`boost_library_tracker`	`boost_library_tracker/services.py`	Boost libraries, versions, dependencies, categories, roles.
`boost_library_docs_tracker`	`boost_library_docs_tracker/services.py`	BoostDocContent and BoostLibraryDocumentation (doc scrape and sync status).
`boost_usage_tracker`	`boost_usage_tracker/services.py`	External repos, Boost usage, missing-header tmp.
`cppa_pinecone_sync`	`cppa_pinecone_sync/services.py`	Pinecone fail list and sync status writes.
`discord_activity_tracker`	`discord_activity_tracker/services.py`	Servers, channels, messages, reactions (Discord user profiles in cppa_user_tracker).
`cppa_youtube_script_tracker`	`cppa_youtube_script_tracker/services.py`	YouTube channels, videos, tags, transcript state, speaker links.
`clang_github_tracker`	`clang_github_tracker/services.py`	Clang/llvm GitHub issue, PR, and commit upserts; fetch watermarks.
`boost_mailing_list_tracker`	`boost_mailing_list_tracker/services.py`	Mailing list messages and names.
`cppa_slack_tracker`	`cppa_slack_tracker/services.py`	Slack teams, channels, messages, membership.
`wg21_paper_tracker`	`wg21_paper_tracker/services.py`	WG21 papers, authors, mailings.

For a full list of functions, parameter/return types, and validation (e.g. empty name raises ValueError), see docs/Service_API.md and the per-app docs in docs/service_api/ (index: docs/service_api/README.md). DTO protocols shared across trackers are documented in docs/service_api/core_protocols.md (generated from core/protocols.py).

Regenerating service API docs

Reference tables in docs/service_api/*.md are produced by scripts/generate_service_docs.py from each app’s services.py and from core/protocols.py.

Markers: Each file contains  … . The script replaces only that region. Put hand-written notes (usage, cross-app warnings, command help) below the END marker.
Regenerate locally: python scripts/generate_service_docs.py (optional: --app <django_app_label> for one module).
Check only: python scripts/generate_service_docs.py --check exits non-zero if committed markdown would change.
CI / pre-commit: The lint job runs pre-commit, which includes this check. Pull requests that change only ignored paths (**.md, docs/** per .github/workflows/actions.yml) do not run CI; any PR that touches **/services.py or core/protocols.py still runs the check—regenerate docs before pushing.

How to use

From management commands or other apps: Import and call the service functions.

from cppa_user_tracker.services import create_identity, add_email
from github_activity_tracker.services import get_or_create_language, add_pull_request_label

identity = create_identity(display_name="Jane")
add_email(identity.profiles.first(), "jane@example.com")
lang, _ = get_or_create_language("Python")
add_pull_request_label(pr, "bug")

Adding new write behavior: Add a new function in the app’s services.py (and optionally a helper in the same module). Do not add new writes by calling the model or manager directly from outside services.py.
Reading data: No restriction. Use the ORM as usual: Model.objects.filter(...), model.related_set.all(), etc.

Testing

Running tests: From the project root, install dev deps (pip install -r requirements-dev.lock or uv pip install -r requirements-dev.lock), start the test database (docker compose -f docker-compose.test.yml up -d), set DATABASE_URL (and SECRET_KEY for the process) as in README.md, then run python -m pytest. Tests always use PostgreSQL (config.test_settings); there is no SQLite fallback.
See README.md and docs/Development_guideline.md for full commands and options.
Unit tests for services.py: Call the service functions and assert on the database (or mocks) as needed.
Other tests: Prefer service functions when setting up data. If you must create models directly for tests, keep it in test code (e.g. fixtures or test helpers) and avoid doing the same in production code.

Performance benchmarks

Throughput checks live under benchmarks/ and use pytest-benchmark. They are not collected during normal pytest runs: set RUN_BENCHMARKS=1 so the root conftest.py stops ignoring that directory (see collect_ignore). Tests are marked with @pytest.mark.benchmark.

Prerequisites: Same as unit tests: PostgreSQL, DATABASE_URL, SECRET_KEY, DJANGO_SETTINGS_MODULE=config.test_settings (see README.md).

Run locally (from repo root, with Postgres up):

export RUN_BENCHMARKS=1
export DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5433/postgres
export SECRET_KEY=for-local-only
export DJANGO_SETTINGS_MODULE=config.test_settings
# Optional: batch size (default 50; match benchmarks/baselines.json "n")
export BENCHMARK_COMMIT_N=50

uv run pytest benchmarks/ -m benchmark --benchmark-only \
  --benchmark-json=bench.json -v \
  --benchmark-disable-gc
uv run python benchmarks/compare_to_baseline.py bench.json benchmarks/baselines.json

Baselines: benchmarks/baselines.json stores maximum acceptable median seconds per scenario (for the configured n). The compare script fails if any median exceeds baseline_median × 1.25 (more than 25% slower than the reference). After a deliberate performance change or a CI image upgrade, update median_seconds (and n if you change BENCHMARK_COMMIT_N) using stats.median from the generated JSON.

CI: The .github/workflows/benchmarks.yml workflow runs on workflow_dispatch only, uploads bench.json as an artifact, and runs the compare step on success.

Other guidelines

Branching: Create feature branches from develop. Open pull requests against develop. See docs/Development_guideline.md.
Code style: Use Python 3.11+ and follow Django and project conventions. Use the project’s logging (logging.getLogger(__name__)). Before pushing, run uv run pyright (with dev deps) for the paths covered by pyrightconfig.json, and ensure CI’s lint / pyright / test jobs would pass.
Database: Use the Django ORM and migrations. Writes only through the service layer as above.
Docs: Update this file (and app services.py docstrings) when adding new apps or changing the write rules. After changing services.py or core/protocols.py, run python scripts/generate_service_docs.py and commit the updated docs/service_api/ files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing to Boost Data Collector

Creating a new collector

Service layer: single place for writes

Rule

Why

Which apps have a service layer

Regenerating service API docs

How to use

Testing

Performance benchmarks

Other guidelines

Related documentation

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing to Boost Data Collector

Creating a new collector

Service layer: single place for writes

Rule

Why

Which apps have a service layer

Regenerating service API docs

How to use

Testing

Performance benchmarks

Other guidelines

Related documentation