Problem
The boost-data-collector is moving toward production deployment on GCP (staging was the Week 20 deliverable). The Docker image and runtime configuration need hardening for production readiness: the current Dockerfile and docker-compose configuration are optimized for development, not for a production environment handling credentials for six platforms (GitHub, Slack, Discord, Pinecone, YouTube, WG21). Production deployment requires a hardened container image, health check endpoints that reflect actual collector health (not just process liveness), and baseline monitoring to detect silent data gaps — the domain's most feared failure mode.
Acceptance Criteria
- Harden Dockerfile: non-root user, minimal base image, pinned system package versions,
.dockerignore excludes dev files
- Add health check endpoint (or enhance existing) that reports: last successful collection timestamp per collector group, Celery worker status, database connectivity
- Add structured logging output (JSON format) suitable for GCP Cloud Logging / Stackdriver ingestion
- Add a
docker-compose.prod.yml override (or document the production overrides) with resource limits, restart policies, and secret injection via environment
- Pair with Daniel on the GCP Cloud Run / Cloud SQL deployment configuration
- Verify the hardened image builds, passes smoke tests, and runs migrations successfully
Implementation Notes
Coordinate with Daniel (@snowfox1003), who owns the GCP staging deployment. Key areas:
(1) the Dockerfile currently runs as root — add USER nonroot;
(2) the gunicorn config should use --worker-class gthread or --worker-class uvicorn.workers.UvicornWorker for async support;
(3) Celery worker should have --max-tasks-per-child set to prevent memory leaks from long-running workers;
(4) the HEALTHCHECK Docker instruction should hit the health endpoint. The docker-compose.ci.yml already exists as a reference for test configuration.
Problem
The boost-data-collector is moving toward production deployment on GCP (staging was the Week 20 deliverable). The Docker image and runtime configuration need hardening for production readiness: the current Dockerfile and docker-compose configuration are optimized for development, not for a production environment handling credentials for six platforms (GitHub, Slack, Discord, Pinecone, YouTube, WG21). Production deployment requires a hardened container image, health check endpoints that reflect actual collector health (not just process liveness), and baseline monitoring to detect silent data gaps — the domain's most feared failure mode.
Acceptance Criteria
.dockerignoreexcludes dev filesdocker-compose.prod.ymloverride (or document the production overrides) with resource limits, restart policies, and secret injection via environmentImplementation Notes
Coordinate with Daniel (@snowfox1003), who owns the GCP staging deployment. Key areas:
(1) the
Dockerfilecurrently runs as root — addUSER nonroot;(2) the gunicorn config should use
--worker-class gthreador--worker-class uvicorn.workers.UvicornWorkerfor async support;(3) Celery worker should have
--max-tasks-per-childset to prevent memory leaks from long-running workers;(4) the
HEALTHCHECKDocker instruction should hit the health endpoint. Thedocker-compose.ci.ymlalready exists as a reference for test configuration.