diff --git a/README.md b/README.md index 930ee61..8e56851 100644 --- a/README.md +++ b/README.md @@ -24,31 +24,33 @@ Currently works with npm, PyPI, pub.dev, and Composer, which all include publish ## Supported Registries -| Registry | Language/Platform | URL Resolution | Handler | Completed | -|----------|-------------------|:--------------:|:-------:|:---------:| -| npm | JavaScript | Yes | Yes | ✓ | -| Cargo | Rust | Yes | Yes | ✓ | -| RubyGems | Ruby | Yes | Yes | ✓ | -| Go proxy | Go | Yes | Yes | ✓ | -| Hex | Elixir | Yes | Yes | ✓ | -| pub.dev | Dart | Yes | Yes | ✓ | -| PyPI | Python | Yes | Yes | ✓ | -| Maven | Java | Yes | Yes | ✓ | -| NuGet | .NET | Yes | Yes | ✓ | -| Composer | PHP | Yes | Yes | ✓ | -| Conan | C/C++ | Yes | Yes | ✓ | -| Conda | Python/R | Yes | Yes | ✓ | -| CRAN | R | Yes | Yes | ✓ | -| Container | Docker/OCI | Yes | Yes | ✓ | -| Debian | Debian/Ubuntu | Yes | Yes | ✓ | -| RPM | RHEL/Fedora | Yes | Yes | ✓ | -| Alpine | Alpine Linux | No | No | ✗ | -| Arch | Arch Linux | No | No | ✗ | -| Chef | Chef | No | No | ✗ | -| Generic | Any | No | No | ✗ | -| Helm | Kubernetes | No | No | ✗ | -| Swift | Swift | No | No | ✗ | -| Vagrant | Vagrant | No | No | ✗ | +| Registry | Language/Platform | Cooldown | Completed | +|----------|-------------------|:--------:|:---------:| +| npm | JavaScript | Yes | ✓ | +| Cargo | Rust | | ✓ | +| RubyGems | Ruby | | ✓ | +| Go proxy | Go | | ✓ | +| Hex | Elixir | | ✓ | +| pub.dev | Dart | Yes | ✓ | +| PyPI | Python | Yes | ✓ | +| Maven | Java | | ✓ | +| NuGet | .NET | | ✓ | +| Composer | PHP | Yes | ✓ | +| Conan | C/C++ | | ✓ | +| Conda | Python/R | | ✓ | +| CRAN | R | | ✓ | +| Container | Docker/OCI | | ✓ | +| Debian | Debian/Ubuntu | | ✓ | +| RPM | RHEL/Fedora | | ✓ | +| Alpine | Alpine Linux | | ✗ | +| Arch | Arch Linux | | ✗ | +| Chef | Chef | | ✗ | +| Generic | Any | | ✗ | +| Helm | Kubernetes | | ✗ | +| Swift | Swift | | ✗ | +| Vagrant | Vagrant | | ✗ | + +Cooldown requires publish timestamps in metadata. Registries without a "Yes" in the cooldown column either don't expose timestamps or haven't been wired up yet. ## Quick Start @@ -465,9 +467,10 @@ Recently cached: | Endpoint | Description | |----------|-------------| -| `GET /` | Welcome message and endpoint list | +| `GET /` | Dashboard (web UI) | | `GET /health` | Health check (returns "ok" if healthy) | | `GET /stats` | Cache statistics (JSON) | +| `GET /metrics` | Prometheus metrics | | `GET /npm/*` | npm registry protocol | | `GET /cargo/*` | Cargo sparse index protocol | | `GET /gem/*` | RubyGems protocol | @@ -667,6 +670,46 @@ Response: └─────────┘ ``` +## Web Interface + +The proxy serves a web UI at the root URL. No separate frontend build is needed -- templates and assets are embedded in the binary. + +- **Dashboard** (`/`) -- cache stats, popular packages, recently cached artifacts, and vulnerability overview. +- **Install guide** (`/install`) -- per-ecosystem configuration instructions, so you don't have to look them up here. +- **Package browser** (`/packages`) -- browse all cached packages with filtering by ecosystem and sorting by hits, size, name, or vulnerability count. +- **Search** (`/search?q=...`) -- search cached packages by name. +- **Package detail** (`/package/{ecosystem}/{name}`) -- metadata, license, vulnerabilities, and version list for a package. You can select two versions to compare. +- **Version detail** (`/package/{ecosystem}/{name}/{version}`) -- per-version metadata, integrity hash, artifact cache status, and hit counts. +- **Source browser** (`/package/{ecosystem}/{name}/{version}/browse`) -- browse files inside cached archives with syntax highlighting for text files and image previews. +- **Version diff** (`/package/{ecosystem}/{name}/compare/{v1}...{v2}`) -- side-by-side diff of two cached versions showing added, removed, and changed files. + +## Monitoring + +The proxy exposes Prometheus metrics at `GET /metrics`. All metric names are prefixed with `proxy_`. + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `proxy_cache_hits_total` | counter | `ecosystem` | Cache hits | +| `proxy_cache_misses_total` | counter | `ecosystem` | Cache misses | +| `proxy_cache_size_bytes` | gauge | | Total size of cached artifacts | +| `proxy_cached_artifacts_total` | gauge | | Number of cached artifacts | +| `proxy_upstream_fetch_duration_seconds` | histogram | `ecosystem` | Time spent fetching from upstream | +| `proxy_upstream_errors_total` | counter | `ecosystem`, `error_type` | Upstream fetch failures | +| `proxy_storage_operation_duration_seconds` | histogram | `operation` | Storage read/write latency | +| `proxy_storage_errors_total` | counter | `operation` | Storage read/write failures | +| `proxy_active_requests` | gauge | | In-flight requests | + +Cache size and artifact count are refreshed every 60 seconds. The remaining metrics update on each request. + +Scrape config for Prometheus: + +```yaml +scrape_configs: + - job_name: git-pkgs-proxy + static_configs: + - targets: ["localhost:8080"] +``` + ## Production Deployment ### Systemd Service diff --git a/docs/architecture.md b/docs/architecture.md index be9a6a6..8b207bd 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -7,29 +7,24 @@ This document describes the internal architecture of the git-pkgs proxy. The proxy is a caching HTTP server that sits between package manager clients and upstream registries. It intercepts requests, checks a local cache, and either serves cached content or fetches from upstream. ``` -┌─────────────────────────────────────────────────────────────────┐ -│ HTTP Server │ -│ ┌─────────────────────────────────────────────────────────┐ │ -│ │ Router (ServeMux) │ │ -│ │ /npm/* -> NPMHandler │ │ -│ │ /cargo/* -> CargoHandler │ │ -│ │ /health -> healthHandler │ │ -│ │ /stats -> statsHandler │ │ -│ └─────────────────────────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────┐ │ -│ │ Proxy │ │ -│ │ - GetOrFetchArtifact() │ │ -│ │ - Coordinates DB, Storage, Fetcher │ │ -│ └─────────────────────────────────────────────────────────┘ │ -│ │ │ │ │ -│ ▼ ▼ ▼ │ +┌──────────────────────────────────────────────────────────────────┐ +│ HTTP Server │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ Router (Chi) │ │ +│ │ /npm/* -> NPMHandler /health -> healthHandler │ │ +│ │ /cargo/* -> CargoHandler /stats -> statsHandler │ │ +│ │ /gem/* -> GemHandler /metrics -> prometheus │ │ +│ │ ...16 ecosystems /api/* -> APIHandler │ │ +│ │ / -> Web UI │ │ +│ └──────────────────────────────────────────────────────────┘ │ +│ │ │ │ │ +│ ▼ ▼ ▼ │ │ ┌───────────┐ ┌─────────────┐ ┌─────────────┐ │ -│ │ Database │ │ Storage │ │ Upstream │ │ -│ │ (SQLite) │ │ (Filesystem)│ │ (Fetcher) │ │ +│ │ Database │ │ Storage │ │ Upstream │ │ +│ │ SQLite or │ │ Filesystem │ │ Registries │ │ +│ │ Postgres │ │ or S3 │ │ (Fetcher) │ │ │ └───────────┘ └─────────────┘ └─────────────┘ │ -└─────────────────────────────────────────────────────────────────┘ +└──────────────────────────────────────────────────────────────────┘ ``` ## Request Flow @@ -91,29 +86,87 @@ Metadata is not cached - always fetched fresh. This ensures clients see new vers ### `internal/database` -SQLite database for cache metadata. Uses `modernc.org/sqlite` (pure Go, no CGO). +SQLite or PostgreSQL database for cache metadata. SQLite uses `modernc.org/sqlite` (pure Go, no CGO). PostgreSQL uses `lib/pq`. + +The schema is compatible with [git-pkgs](https://github.com/git-pkgs) databases. The proxy adds the `artifacts` and `vulnerabilities` tables on top of the shared `packages` and `versions` tables, so both tools can point at the same database. **Tables:** ```sql packages ( - id, purl, ecosystem, name, namespace, latest_version, - license, description, homepage, repository_url, upstream_url, - metadata_fetched_at, created_at, updated_at + id INTEGER PRIMARY KEY, -- SERIAL on Postgres + purl TEXT NOT NULL, -- unique, e.g. pkg:npm/lodash + ecosystem TEXT NOT NULL, + name TEXT NOT NULL, + latest_version TEXT, + license TEXT, + description TEXT, + homepage TEXT, + repository_url TEXT, + registry_url TEXT, + supplier_name TEXT, + supplier_type TEXT, + source TEXT, + enriched_at DATETIME, + vulns_synced_at DATETIME, + created_at DATETIME, + updated_at DATETIME ) +-- indexes: purl (unique), (ecosystem, name) versions ( - id, purl, package_id, version, license, integrity, - published_at, yanked, metadata_fetched_at, created_at, updated_at + id INTEGER PRIMARY KEY, + purl TEXT NOT NULL, -- unique, e.g. pkg:npm/lodash@4.17.21 + package_purl TEXT NOT NULL, -- FK to packages.purl + license TEXT, + published_at DATETIME, + integrity TEXT, -- subresource integrity hash + yanked INTEGER DEFAULT 0, -- BOOLEAN on Postgres + source TEXT, + enriched_at DATETIME, + created_at DATETIME, + updated_at DATETIME ) +-- indexes: purl (unique), package_purl artifacts ( - id, version_id, filename, upstream_url, storage_path, - content_hash, size, content_type, fetched_at, - hit_count, last_accessed_at, created_at, updated_at + id INTEGER PRIMARY KEY, + version_purl TEXT NOT NULL, + filename TEXT NOT NULL, + upstream_url TEXT NOT NULL, + storage_path TEXT, -- null until cached + content_hash TEXT, -- SHA-256 + size INTEGER, -- BIGINT on Postgres + content_type TEXT, + fetched_at DATETIME, + hit_count INTEGER DEFAULT 0, -- BIGINT on Postgres + last_accessed_at DATETIME, + created_at DATETIME, + updated_at DATETIME +) +-- indexes: (version_purl, filename) unique, storage_path, last_accessed_at + +vulnerabilities ( + id INTEGER PRIMARY KEY, + vuln_id TEXT NOT NULL, -- e.g. CVE-2021-1234 + ecosystem TEXT NOT NULL, + package_name TEXT NOT NULL, + severity TEXT, + summary TEXT, + fixed_version TEXT, + cvss_score REAL, + "references" TEXT, -- JSON array + fetched_at DATETIME, + created_at DATETIME, + updated_at DATETIME ) +-- indexes: (vuln_id, ecosystem, package_name) unique, (ecosystem, package_name) ``` +On PostgreSQL, `INTEGER PRIMARY KEY` becomes `SERIAL`, `DATETIME` becomes `TIMESTAMP`, `INTEGER DEFAULT 0` booleans become `BOOLEAN DEFAULT FALSE`, and size/count columns use `BIGINT`. + +The `MigrateSchema()` function handles backward compatibility with older git-pkgs databases by adding missing columns via `ALTER TABLE` as needed. + **Key operations:** - `GetPackageByPURL()` - Look up package by PURL - `GetVersionByPURL()` - Look up version by PURL @@ -121,6 +174,7 @@ artifacts ( - `UpsertPackage/Version/Artifact()` - Insert or update records - `RecordArtifactHit()` - Increment hit counter, update access time - `GetLeastRecentlyUsedArtifacts()` - For cache eviction +- `SearchPackages()` - Full-text search across cached packages ### `internal/storage` @@ -201,12 +255,27 @@ HTTP protocol handlers for each registry type. ### `internal/server` -HTTP server setup. +HTTP server setup, web UI, and API handlers. - Creates and wires together all components -- Mounts handlers at appropriate paths -- Adds logging middleware -- Health and stats endpoints +- Mounts protocol handlers at ecosystem-specific paths +- Middleware: request ID, real IP, logging, panic recovery, active request tracking +- Web UI: dashboard, package browser, source browser, version comparison +- Templates are embedded in the binary via `//go:embed` +- Enrichment API for package metadata, vulnerability scanning, and outdated detection +- Health, stats, and Prometheus metrics endpoints + +### `internal/metrics` + +Prometheus metrics for cache performance, upstream latency, storage operations, and active requests. See the Monitoring section of the README for the full metric list. + +### `internal/cooldown` + +Version age filtering for supply chain attack mitigation. Configurable at global, ecosystem, and per-package levels. Supported by npm, PyPI, pub.dev, and Composer handlers. + +### `internal/enrichment` + +Package metadata enrichment. Fetches license, description, homepage, repository URL, and vulnerability data from upstream registries. Powers the `/api/` endpoints and the web UI's package detail pages. ### `internal/config`