This guide provides comprehensive documentation for using SRC2PURL to identify package coordinates from source code directories.
- Installation
- Quick Start
- Discovery Strategy
- Command Line Usage
- Python API Usage
- API Authentication
- Performance Optimization
- Troubleshooting
pip install src2purlgit clone https://github.com/SemClone/src2purl.git
cd src2purl
pip install -e .SRC2PURL requires Python 3.8 or higher. All dependencies are automatically installed via pip.
# Identify package from source directory
src2purl /path/to/source/code
# With Software Heritage archive (slower but more comprehensive)
src2purl /path/to/source --use-swh
# High confidence matches only
src2purl /path/to/source --confidence-threshold 0.85Package: requests
Version: 2.28.0
PURL: pkg:pypi/requests@2.28.0
License: Apache-2.0
Confidence: 95%
SRC2PURL uses a sophisticated 2-phase discovery approach:
- SWHID Generation: Creates Software Heritage ID from directory contents
- Repository Search: Queries GitHub and SCANOSS APIs
- Software Heritage (optional): Deep provenance analysis with
--use-swh
- Manifest Parsing: Extracts metadata from package manifests
- Cross-validation: Validates Phase 1 findings
- Metadata Enhancement: Enriches results with additional information
| Mode | Time | Accuracy | Use Case |
|---|---|---|---|
| Default (Fast) | 5-15 seconds | High | Most projects |
| With SWH | 90+ seconds | Very High | Security audits |
# Standard identification
src2purl /path/to/project
# With verbose output
src2purl /path/to/project --verbose
# JSON output format
src2purl /path/to/project --output-format json
# Save results to file
src2purl /path/to/project -o results.json# Set confidence threshold
src2purl /path/to/project --confidence-threshold 0.85
# Detect subcomponents in monorepos
src2purl /path/to/project --detect-subcomponents
# Control scanning depth
src2purl /path/to/project --max-depth 2
# Skip license detection (faster)
src2purl /path/to/project --no-license-detection
# Clear cache
src2purl --clear-cache# Enable Software Heritage archive checking
src2purl /path/to/project --use-swh
# With API token for better rate limits
export SWH_API_TOKEN=your_token
src2purl /path/to/project --use-swh
# Validate SWHID
src2purl-validate /path/to/directoryfrom src2purl import identify_package
# Basic identification
result = identify_package("/path/to/source")
print(f"Package: {result.name}@{result.version}")
print(f"PURL: {result.purl}")
print(f"License: {result.license}")
print(f"Confidence: {result.confidence:.0%}")from src2purl import identify_package
# With options
result = identify_package(
path="/path/to/source",
use_swh=True, # Enable Software Heritage
confidence_threshold=0.85, # High confidence only
detect_subcomponents=True, # Find monorepo components
verbose=True # Detailed logging
)
# Access detailed results
if result.subcomponents:
for component in result.subcomponents:
print(f" - {component.name}: {component.purl}")
# Check discovery methods used
for method in result.discovery_methods:
print(f"Discovery method: {method}")from src2purl import identify_package
import json
projects = [
"/path/to/project1",
"/path/to/project2",
"/path/to/project3"
]
results = []
for project_path in projects:
result = identify_package(project_path)
results.append({
"path": project_path,
"purl": result.purl,
"confidence": result.confidence
})
# Save results
with open("batch_results.json", "w") as f:
json.dump(results, f, indent=2)The GitHub token provides the most value and is free to obtain:
- Go to https://github.com/settings/tokens
- Generate a new token (no special permissions needed)
- Set the environment variable:
export GITHUB_TOKEN=your_github_personal_access_tokenBenefits:
- Rate limit increases from 10 to 5000 requests/hour
- More accurate repository identification
- Better search results
export SCANOSS_API_KEY=your_scanoss_keyRegister at https://www.scanoss.com for a free API key.
export SWH_API_TOKEN=your_swh_tokenRegister at https://archive.softwareheritage.org/api/
SRC2PURL caches API responses to improve performance:
# Default cache location: ~/.cache/src2purl
src2purl /path/to/project
# Disable cache
src2purl /path/to/project --no-cache
# Clear cache
src2purl --clear-cache- Use GitHub token: Dramatically improves API rate limits
- Avoid --use-swh for speed: Only use when comprehensive analysis needed
- Skip license detection: Use
--no-license-detectionfor faster scans - Limit depth: Use
--max-depth 1for shallow scans - Cache results: Let caching work for repeated scans
| Project Size | Default Mode | With SWH |
|---|---|---|
| Small (100 files) | 5-8 seconds | 90+ seconds |
| Medium (1000 files) | 10-15 seconds | 120+ seconds |
| Large (5000+ files) | 15-25 seconds | 180+ seconds |
# Increase verbosity to see what's happening
src2purl /path/to/project --verbose
# Try with Software Heritage
src2purl /path/to/project --use-swh
# Lower confidence threshold
src2purl /path/to/project --confidence-threshold 0.3Error: API rate limit exceeded
Solution: Add API tokens (especially GitHub token):
export GITHUB_TOKEN=your_token
src2purl /path/to/project# Skip license detection
src2purl /path/to/project --no-license-detection
# Reduce scanning depth
src2purl /path/to/project --max-depth 1
# Ensure caching is enabled
src2purl /path/to/project # Cache is enabled by defaultFor detailed debugging information:
# Maximum verbosity
src2purl /path/to/project --verbose
# With Python logging
PYTHONPATH=. python -m src2purl --verbose /path/to/project# Show help message
src2purl --help
# Check version
src2purl --version- name: Identify Package
run: |
pip install src2purl
src2purl . --output-format json -o package-info.json
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}identify-package:
script:
- pip install src2purl
- src2purl . --output-format json -o package-info.json
artifacts:
paths:
- package-info.json- Always use GitHub token: Free and provides significant benefits
- Start with default mode: Only use
--use-swhwhen needed - Cache API responses: Default caching improves repeat performance
- Use confidence thresholds: Filter results based on your needs
- Process in batches: Use Python API for multiple projects
- Monitor rate limits: Check API usage if processing many projects
- API Reference - Detailed Python API documentation
- Discovery Methods - In-depth explanation of identification strategies
- Examples - More code examples and use cases