Skip to content

Latest commit

 

History

History
323 lines (277 loc) · 14.4 KB

File metadata and controls

323 lines (277 loc) · 14.4 KB
SPDX-FileCopyrightText 2026-present Arthit Suriyawongkul
SPDX-FileType DOCUMENTATION
SPDX-License-Identifier CC0-1.0

Pitloom SBOM generator - implementation summary

Project overview

Successfully implemented a complete, production-ready prototype of an SBOM (Software Bill of Materials) generator for Python projects. Supports Hatchling, Poetry, and setuptools build backends. The generator produces SPDX 3.0 compliant SBOMs in JSON-LD format.

What was delivered

✅ Core functionality

  1. SPDX 3.0 data models (spdx-python-model)

    • Fully migrated to the official spdx-python-model library
    • Proper JSON-LD serialization and validation
    • Deterministic UUIDv5 SPDX document IDs (compute_doc_uuid) keyed on project name, version, normalized dependencies, and SHA-256 Merkle root of wheel files
    • Per-element sequential IDs (generate_spdx_id) reproducible across builds
  2. Metadata extraction (src/pitloom/extract/)

    • pyproject.py -- reads pyproject.toml; supports PEP 621 [project], Poetry [tool.poetry] (fallback when [project] is absent), and merging of both when both sections are present ([project] wins field-by-field)
    • poetry.py -- extracts metadata from [tool.poetry] and [tool.poetry.dependencies]; converts Poetry version specifiers (^, ~, bare versions) to PEP 440; [tool.poetry.group.*] dev/deploy dependency groups are intentionally excluded from the SBOM
    • setuptools.py -- reads setup.cfg and setup.py for setuptools projects; detect_build_backend() auto-selects the right extractor; merge_metadata() fills gaps across sources (setup.cfg > setup.py)
    • Extracts project metadata (name, version, description, authors, URLs)
    • Handles dynamic versions from __about__.py
    • Parses dependency specifications with version constraints
    • Returns (ProjectMetadata, PitloomConfig) tuple
  3. SPDX 3 exporter (src/pitloom/export/spdx3_json.py)

    • JSON-LD output using official bindings and SHACLObjectSet
    • Clean API for building SPDX documents and adding elements
    • Graceful component ingestion via spdx3.JSONLDDeserializer
  4. SBOM generator (src/pitloom/assemble/)

    • generate_sbom() orchestrates the full pipeline
    • Builds DocumentModel from extracted metadata
    • Passes DocumentModel to build() assembler in assemble/spdx3/
    • Merges pre-generated SBOM fragments
    • Generates copyright information from metadata
  5. Hatchling build hook (src/pitloom/plugins/hatch.py)

    • PitloomBuildHook registered via pluggy entry point ([project.entry-points."hatch"])
    • Generates SBOM in initialize(), stages to a TemporaryDirectory
    • Appends staged path to build_data["sbom_files"] -- Hatchling 1.28.0+ places it at .dist-info/sboms/<filename> (PEP 770) natively
    • finalize() cleans up the staging directory
    • Config: sbom-basename, creator-name, creator-email, fragments, enabled
  6. Command-line interface (src/pitloom/__main__.py)

    • User-friendly argparse-based CLI
    • Default output filename derived from project metadata ({name}-{version}.spdx3.json) or [tool.pitloom] sbom-basename when set
    • Creator information options
    • Clear error messages
  7. Metadata provenance tracking (src/pitloom/extract/pyproject.py, src/pitloom/loom.py)

    • Tracks source of each metadata field
    • Records extraction method (static, dynamic, or inferred)
    • Supports dynamic introspection via loom.py inspection
    • Uses SPDX 3 comment attribute
    • See docs/design/metadata-provenance.md
  8. ML tracking SDK (src/pitloom/loom.py)

    • Dual-syntax ContextDecorator (@loom.run and with loom.run)
    • Emits SPDX 3 SBOM fragments automatically during ML executions
    • Seamlessly ingested into project SBOMs using [tool.pitloom.fragments] config

✅ Testing (comprehensive coverage - all passing)

  1. Model & provenance tests

    • SPDX ID generation
    • CreationMetadata serialization and provenance tracking
    • spdx-python-model validation
  2. Metadata extraction tests

    • Basic metadata extraction and generic fragment paths
    • Error handling for missing files
    • Dynamic and build-time version extraction via importlib.metadata
  3. Generator integration tests

    • End-to-end SBOM generation
    • Generic fragment merging via Deserialization
  4. SDK tracker tests

    • test_loom.py verifies both Decorator and Context Manager tracking
    • Asserts caller-inspection relative path generation

✅ Quality assurance

  • Linting: pylint 10.00/10, flake8 clean, ruff clean
  • Type checking: mypy -- no issues across all source files
  • Type hints: Comprehensive type annotations throughout
  • Documentation: Inline docstrings for all public APIs

✅ Documentation

  1. README.md: Complete usage guide with examples
  2. docs/implementation/demo.md: Prototype capabilities and validation
  3. docs/implementation/demo-provenance.md: Provenance tracking demo
  4. docs/design/format-neutral-representation.md: Multi-format support plan
  5. docs/design/metadata-provenance.md: Provenance tracking specification
  6. docs/design/metadata-sources.md: Metadata sources research and integration plan
  7. docs/implementation/setuptools-support.md: Setuptools extractor design and limitations
  8. Inline documentation: Comprehensive docstrings

Validation with sentimentdemo

Successfully generated SPDX 3 SBOM for the reference repository:

$ loom /tmp/sentimentdemo -o sbom.spdx3.json
Generating SBOM for project in: /tmp/sentimentdemo
SBOM written to: sbom.spdx3.json

Generated SBOM structure

  • Total Elements: 13
  • CreationInfo: 1 (with timestamp and creator)
  • Person: 1 (creator information)
  • SpdxDocument: 1 (root document)
  • software_Sbom: 1 (SBOM declaration)
  • software_Package: 5 (main package + 4 dependencies)
  • Relationship: 4 (dependsOn relationships)

Captured information

Main package:

Dependencies (all captured correctly):

  • fasttext: 0.9.3
  • newmm-tokenizer: 0.2.2
  • numpy: 1.26.4
  • th-simple-preprocessor: 0.10.1

Technical achievements

1. Clean architecture

This tree is the canonical reference; README.md and design docs point here.

pitloom/
├── docs/
│   ├── design/
│   │   ├── architecture-overview.md
│   │   ├── format-neutral-representation.md
│   │   ├── hatchling-build-hook.md
│   │   ├── metadata-provenance.md
│   │   ├── metadata-sources.md
│   │   ├── mlflow-extractor.md
│   │   ├── model-metadata-extraction.md
│   │   ├── protobom-evaluation.md
│   │   ├── roadmap.md             # Canonical roadmap
│   │   ├── sbom-enrichment.md
│   │   └── sbom-fragments.md
│   ├── implementation/
│   │   ├── demo.md
│   │   ├── demo-provenance.md
│   │   ├── setuptools-support.md  # Setuptools extractor design and limitations
│   │   └── summary.md             # this file; canonical project structure
│   ├── mascot.png
│   └── resources.md
├── src/
│   └── pitloom/
│       ├── assemble/            # Layers 2+3 -- build DocumentModel + map to spec
│       │   ├── spdx3/           # SPDX 3 specific (future: spdx23, cyclonedx)
│       │   │   ├── ai.py        # AI model element assembly
│       │   │   ├── dataset.py   # Dataset element assembly
│       │   │   ├── deps.py      # Dependency element assembly
│       │   │   ├── document.py  # build(DocumentModel) -> Spdx3JsonExporter
│       │   │   ├── fragments.py # Fragment merging
│       │   │   └── __init__.py
│       │   └── __init__.py      # generate_sbom() orchestrator + backend routing
│       ├── core/                # Format-neutral data models (no SBOM lib deps)
│       │   ├── ai_metadata.py      # AiModelMetadata, ModelFormat
│       │   ├── config.py           # PitloomConfig ([tool.pitloom] settings)
│       │   ├── creation.py         # CreationMetadata (creator / timestamp)
│       │   ├── dataset_metadata.py # DatasetMetadata
│       │   ├── document.py         # DocumentModel (assembled, pre-serialization)
│       │   ├── models.py           # Deterministic UUIDs, Merkle root, SPDX ID generation
│       │   └── project.py          # ProjectMetadata, ProjectFile
│       ├── export/              # Layer 4 -- serialise to physical format
│       │   └── spdx3_json.py    # SPDX 3 JSON-LD serialiser
│       ├── extract/             # Layer 1 -- read from sources
│       │   ├── ai_model.py         # AI model dispatcher + format detection
│       │   ├── _croissant.py       # Croissant metadata parser
│       │   ├── _croissant_keys.py  # Croissant JSON-LD key constants
│       │   ├── _extract_utils.py   # Shared extraction utilities
│       │   ├── _fasttext.py        # fastText (.ftz, .bin)
│       │   ├── _gguf.py            # GGUF (.gguf)
│       │   ├── _hdf5.py            # HDF5 / Keras v1–v2 (.h5, .hdf5)
│       │   ├── _keras.py           # Keras v3 (.keras)
│       │   ├── _numpy.py           # NumPy (.npy, .npz)
│       │   ├── _onnx.py            # ONNX (.onnx)
│       │   ├── _pytorch.py         # PyTorch classic (.pt, .pth)
│       │   ├── _pytorch_pt2.py     # PyTorch PT2 / ExecuTorch (.pt2)
│       │   ├── _safetensors.py     # Safetensors (.safetensors)
│       │   ├── dataset.py          # Dataset metadata extraction (Croissant)
│       │   ├── poetry.py           # [tool.poetry] extractor; Poetry -> PEP 440 conversion
│       │   ├── pyproject.py        # pyproject.toml extractor ([project] + [tool.poetry] merge)
│       │   ├── scanner.py          # Heuristic scanner for AI model files
│       │   └── setuptools.py       # setup.cfg + setup.py extractor; backend detection; merge
│       ├── plugins/             # Build-system integrations
│       │   └── hatch.py         # Hatchling BuildHookInterface (PEP 770)
│       ├── __about__.py         # Package version (__version__)
│       ├── __init__.py
│       ├── __main__.py          # CLI entry point (loom / python -m pitloom)
│       ├── loom.py              # ML tracking SDK (Run context manager / decorator)
│       └── py.typed             # PEP 561 marker
├── tests/
│   ├── fixtures/
│   │   ├── croissant/           # Croissant dataset metadata fixtures
│   │   ├── fasttext/            # fastText model fixtures
│   │   ├── fragments/           # Pre-generated SPDX 3 fragment fixtures
│   │   ├── gguf/                # GGUF model fixtures
│   │   ├── hdf5/                # HDF5 / Keras model fixtures
│   │   ├── keras/               # Keras v3 model fixtures
│   │   ├── numpy/               # NumPy array fixtures
│   │   ├── onnx/                # ONNX model fixtures
│   │   ├── pytorch/             # PyTorch classic model fixtures
│   │   ├── pytorch_pt2/         # PyTorch PT2 / ExecuTorch fixtures
│   │   ├── safetensors/         # Safetensors model fixtures
│   │   ├── sampleproject-hatchling/   # Minimal Hatchling wheel-build fixture
│   │   ├── sampleproject-poetry/      # Real-world Poetry fixture (mistral-inference)
│   │   ├── sampleproject-setuptools/  # Minimal setuptools metadata fixture
│   │   ├── sentimentdemo-handcrafted.spdx3.json
│   │   └── README.md
│   ├── conftest.py
│   ├── test_dataset_metadata.py
│   ├── test_extract_ai_model.py
│   ├── test_extract_croissant.py
│   ├── test_extract_fasttext.py
│   ├── test_extract_gguf.py
│   ├── test_extract_hdf5.py
│   ├── test_extract_keras.py
│   ├── test_extract_numpy.py
│   ├── test_extract_onnx.py
│   ├── test_extract_pytorch.py
│   ├── test_extract_pytorch_pt2.py
│   ├── test_extract_safetensors.py
│   ├── test_fragments.py
│   ├── test_generator.py
│   ├── test_hatch_hook.py
│   ├── test_jcs.py
│   ├── test_loom.py
│   ├── test_main_cli.py
│   ├── test_metadata.py
│   ├── test_models.py
│   ├── test_provenance.py
│   ├── test_poetry.py
│   ├── test_setuptools.py
│   ├── test_spdx3_compliance.py
│   ├── test_spdx3_dataset.py
│   └── test_wheel_integration.py
├── AGENTS.md
├── CHANGELOG.md
├── CITATION.cff
├── LICENSE
├── README.md
├── codemeta.json
└── pyproject.toml               # Project config and Hatchling build settings

2. Extensible design

  • Easy to add new extractors (PDM, Flit, etc.)
  • Easy to add new assemblers/exporters (CycloneDX, AIDOC, etc.) consuming the same DocumentModel -- no changes to extractors needed
  • Clean separation of concerns: extractors -> DocumentModel -> serializers

3. Best practices

  • src-layout for proper package structure
  • Type hints with Python 3.10+ compatibility
  • Comprehensive error handling
  • Runtime dependencies kept minimal and declared in pyproject.toml

Comparison with reference SBOM

Feature Reference SBOM Pitloom Generated Status
SPDX 3.0 Structure ✅ Complete
Package Metadata ✅ Complete
Dependencies ✅ Complete
Relationships ✅ Complete
File-level Details ⚠️ 🔄 Roadmap
AI/Dataset Profiles ✅ Complete
License Expressions ⚠️ 🔄 Roadmap

Legend:

  • ✅ Complete: Fully implemented
  • ⚠️ Basic: Core functionality present, enhancements planned
  • 🔄 Roadmap: Planned for future releases

Roadmap

See docs/design/roadmap.md for the canonical, up-to-date roadmap.