Skip to content

bogwi/rookeen

Repository files navigation

Rookeen – spaCy-based web linguistic analysis

Rookeen is a spaCy-only pipeline and CLI that fetches web content and runs linguistic analyzers including semantic embeddings, producing structured JSON output for ML and NLP workflows. Features industry-ready Unix pipeline composability with proper exit codes, structured output formats, and advanced data processing tools.

Table of Contents

Architecture Overview

flowchart TD
    %% Entry Point
    CLI[CLI Interface<br/>analyze • analyze-file • batch]
    
    %% Configuration
    Config[Configuration<br/>TOML • Environment • Defaults]
    
    %% Input Sources
    ReadLocal[Local Input<br/>stdin • files]
    Fetch[Web Scraper<br/>Async fetching • Rate limiting • Robots.txt]

    %% Core Pipeline Steps
    Detect[Language Detection<br/>Overrides: CLI • Config • Auto • Confidence]
    Load[Model Loader<br/>spaCy models • Auto-download • Preload]
    Process[Document Processor<br/>Tokenization • Parsing]
    
    %% Analysis Layer
    subgraph analysis [Analysis Layer]
        direction TB
        
        subgraph core_analyzers [Core Analyzers]
            direction LR
            Lexical[Lexical Stats<br/>Tokens • TTR • Lemmas]
            POS[POS Tagging<br/>UPOS counts • Ratios]
            NER[Named Entities<br/>Recognition • Examples]
            Readability[Readability<br/>Flesch-Kincaid • SMOG]
            Keywords[Keywords<br/>YAKE • Frequency]
            Dependency[Dependencies<br/>Syntactic relations]
        end

        subgraph optional_analyzers [Optional Analyzers]
            direction LR
            Embeddings[Embeddings<br/>Semantic vectors]
            Sentiment[Sentiment<br/>VADER • TextBlob • spaCy]
        end
    end
    
    %% Embedding Backends
    subgraph backends [Embedding Backends]
        direction LR
        MiniLM[MiniLM<br/>384-dim • Fast]
        BGE[BGE-M3<br/>1024-dim • Accurate]
        OpenAI[OpenAI TE3<br/>1536/3072-dim • API]
    end
    
    %% Results and Export
    Results[Analysis Results<br/>JSON format • Timing data]
    
    subgraph exports [Export Formats]
        direction LR
        JSON_Export[Summary JSON<br/>Machine-readable]
        SpaCyJSON_Export[spaCy JSON<br/>Token-level]
        Parquet_Export[Parquet<br/>Analytics]
        CoNLLU_Export[CoNLL-U<br/>Research]
        DocBin_Export[DocBin<br/>spaCy format]
    end
    
    %% Flow Connections
    CLI --> Config
    CLI --> Fetch
    CLI --> ReadLocal
    Config --> Fetch
    ReadLocal --> Detect
    Fetch --> Detect
    Detect --> Load
    Load --> Process
    Process --> analysis
    
    %% Analysis flows
    core_analyzers --> Results
    optional_analyzers --> Results
    
    %% Embedding backend connection
    Embeddings --> backends
    
    %% Export flows
    Results --> exports
    
Loading

Architecture Components

  • CLI Interface: Click-based command-line with analyze, analyze-file, and batch commands
  • Configuration System: Hierarchical settings precedence (CLI flags → Environment variables → TOML files → Defaults)
  • Processing Pipeline: Linear flow from web scraping through language detection, model loading, and document processing
  • Analyzer System: Plugin-based architecture with core analyzers (always available) and optional analyzers (require extra dependencies)
  • Export Formats: Multiple output formats (JSON, Parquet, CoNLL-U, DocBin) for different use cases
  • External Dependencies: spaCy language models, ML libraries (sentence-transformers, VADER), and specialized parsers (Stanza)
  • Embedding Backends: Pluggable backends for sentence embeddings (MiniLM, BGE-M3, OpenAI TE3)

The architecture follows a clean vertical flow: Input → Configuration → Processing → Analysis → Results → Export

Key features

  • Web page fetching with retries and HTML sanitization
  • Language detection and spaCy model loading with auto-download
  • Flexible analyzer selection with selective enable/disable for cost optimization
  • Plugin-based architecture with extensible analyzer registry
  • Industry-ready Unix pipeline composability with structured output and proper exit codes
  • Analyzers: lexical stats, POS, NER, readability, keywords, embeddings, sentiment
  • Semantic embeddings for similarity, clustering, and semantic search
  • Smart sentiment analysis with library prioritization (VADER → TextBlob → spaCy)
  • Performance benchmark harness for latency tracking and regression detection
  • Advanced pipeline tools for data analysis and filtering
  • Machine-readable timing and provenance for SLA monitoring and performance tracking
  • JSON Schema validation for contract-first development and data consistency
  • JSON output suitable for programmatic consumption and ML pipelines

Supported languages

  • en → en_core_web_sm (included by default)
  • de → de_core_news_sm (optional extra)
  • es → es_core_news_sm (optional extra)
  • fr → fr_core_news_sm (optional extra)

Install

Rookeen is managed with uv (dependency and environment management) and uses ruff for linting. After syncing, a console script named rookeen is available.

uv and ruff can be installed in many ways; click and see their documentation if you have not them alredy.

# Core installation (includes English model)
uv sync

# Install with all extras (includes all language models)
uv sync --group dev --all-extras

# Optional: Add specific language models
uv sync --extra lang-de  # German
uv sync --extra lang-es  # Spanish
uv sync --extra lang-fr  # French

# Optional: Add embeddings support
uv sync --extra embeddings
# (includes sentence-transformers and OpenAI SDK for API backends)

# Optional: Add sentiment analysis support
uv sync --extra sentiment

# Optional: Add UD CoNLL-U export support
uv sync --extra ud

# Optional: Add Parquet export support
uv sync --extra parquet

# Combine multiple extras
uv sync --extra embeddings --extra sentiment --extra lang-de --extra lang-fr

Models are auto-downloaded by the CLI when missing if you pass --models-auto-download (enabled by default). No extra script is required.

To install models manually instead:

# Example for German
uv pip install de-core-news-sm
# or
python -m spacy download de_core_news_sm

CLI usage

You can invoke the CLI in any of these equivalent ways (recommended first):

  • uv run rookeen [options]
  • uv run python -m rookeen [options]
  • uv run python -m rookeen.cli [options]

Analyze a URL:

uv run rookeen analyze "https://en.wikipedia.org/w/index.php?title=Natural_language_processing&oldid=1201524046" \
  --format json --models-auto-download --lang en --robots ignore -o results/nlp.json

Analyze a URL with embeddings:

uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
  --enable-embeddings --models-auto-download --robots ignore -o results/cat_analysis.json

Select an embeddings backend explicitly (MiniLM local):

uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
  --enable-embeddings --embeddings-backend miniLM --models-auto-download -o results/cat_minilm.json

Use the BGE-M3 backend (local, HF model id):

uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
  --enable-embeddings --embeddings-backend bge-m3 --embeddings-model BAAI/bge-m3 \
  --robots ignore -o results/cat_bge_m3.json

Use the OpenAI TE3 backend (API-based):

uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
  --enable-embeddings --embeddings-backend openai-te3 --embeddings-model text-embedding-3-small \
  --openai-api-key "$OPENAI_API_KEY" --robots ignore -o results/cat_openai_te3.json

Preload embeddings model/client at startup to avoid first-call latency:

uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
  --enable-embeddings --embeddings-backend bge-m3 --embeddings-model BAAI/bge-m3 \
  --embeddings-preload -o results/cat_bge_m3_preloaded.json

Analyze a URL with sentiment analysis:

uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
  --enable-sentiment --models-auto-download --robots ignore -o results/cat_sentiment.json

Analyze a local file:

uv run rookeen analyze-file ./samples/article.txt \
  --format json --models-auto-download --lang en -o results/article.json

Stream text from stdin (Unix pipeline integration):

echo 'Hello world' | uv run rookeen analyze --stdin --lang en --stdout | jq '.language.code'
# Output: "en"

# Language auto-detection
echo 'Bonjour le monde' | uv run rookeen analyze --stdin --stdout | jq '.language.code'
# Output: "fr"

# With analyzer selection
cat article.txt | uv run rookeen analyze --stdin --enable pos --disable keywords --stdout

Batch mode (one URL per line; # comments allowed):

uv run rookeen batch urls.txt --output-dir results --format json --models-auto-download

Responsible crawling with rate limiting and robots.txt support:

# Respect robots.txt with custom rate limiting (2 requests/second)
uv run rookeen analyze "https://example.com/article" --rate-limit 2.0 --robots respect

# Ignore robots.txt for research purposes (use responsibly!)
uv run rookeen analyze "https://example.com/article" --robots ignore

Run 'uv run rookeen analyze --help' to see all available flags.

Common flags:

  • --lang: override language detection (e.g., en,de,es,fr) - takes highest precedence
  • --languages: preload models, comma-separated (e.g., en,de)
  • --models-auto-download/--no-models-auto-download: install missing spaCy models automatically
  • --enable-embeddings: enable sentence embeddings analysis (requires --extra embeddings)
  • --embeddings-preload/--no-embeddings-preload: preload embeddings backend/model at startup to avoid first-call latency
  • --embeddings-backend {miniLM,bge-m3,openai-te3}: choose embeddings backend
  • --embeddings-model <id>: model identifier for the selected backend (e.g., sentence-transformers/all-MiniLM-L6-v2, BAAI/bge-m3, text-embedding-3-small)
  • --openai-api-key <key>: API key for OpenAI backend (falls back to env)
  • --enable-sentiment: enable sentiment analysis (requires --extra sentiment)
  • --enable <analyzer>: enable specific analyzers by name (can be used multiple times)
  • --disable <analyzer>: disable specific analyzers by name (can be used multiple times)
  • --rate-limit <float>: rate limit in requests per second (default: 0.5)
  • --robots <respect|ignore>: robots.txt policy (default: respect)
  • --verbose: verbose logs
  • --errors-json: force machine-readable JSON error output for automation
  • --stdin: read text from stdin instead of URL (analyze command only)
  • --stdout: stream JSON to stdout for pipeline composition

Quick Pipeline Examples

Basic Analysis Pipeline

# Analyze and extract language
uv run rookeen analyze "https://example.com" --stdout | jq '.language.code'

# Chain multiple operations
uv run rookeen analyze "https://example.com" --stdout | \
  jq '.analyzers[] | select(.name == "lexical_stats") | .results.total_tokens'

Benchmark Pipeline (Industry-Ready)

# Validate benchmark success
uv run python bench/run_bench.py --json --quiet | jq '.[0].return_code == 0'

# Get performance summary
uv run python bench/run_bench.py --json --quiet | \
  uv run python scripts/pipeline_tools.py analyze

# Filter slow benchmarks
uv run python bench/run_bench.py --json --quiet | \
  uv run python scripts/pipeline_tools.py filter --min-time 2.5

# CI/CD integration
if uv run python bench/run_bench.py --quiet --no-save; then
  echo "All benchmarks passed"
else
  echo "Some benchmarks failed"
  exit 1
fi

Usage Examples by User Perspective

1. Hobbyist: Personal Blog Analysis

Goal: Analyze personal blog posts or favorite websites to understand writing style, readability, and key topics without complex setup.

Environment:

  • Personal laptop (macOS/Linux/Windows)
  • Basic Python knowledge
  • No special infrastructure requirements
  • Single-user, occasional use

What happens:

# Quick installation
uv sync

# Analyze a blog post from URL
uv run rookeen analyze "https://myblog.com/post-about-nlp" \
  --models-auto-download --robots ignore -o my_analysis.json

# Check readability and keywords
cat my_analysis.json | jq '.analyzers[] | select(.name == "readability")'
cat my_analysis.json | jq '.analyzers[] | select(.name == "keywords") | .results.keyphrases[0:5]'

# Analyze local draft text
echo "Your draft text here..." | uv run rookeen analyze --stdin --lang en --stdout \
  | jq '{readability: .analyzers[] | select(.name == "readability") | .results.flesch_kincaid_grade}'

Why it works for hobbyist:

  • Zero configuration: Auto-downloads models, sensible defaults work out of the box
  • Human-readable output: JSON can be explored with jq or opened in any text editor
  • No infrastructure: Runs locally, no cloud services or API keys needed
  • Immediate results: Get linguistic insights in seconds without learning complex NLP concepts
  • Free and open: No licensing costs, works offline after initial setup
  • Educational: Understand your writing through metrics like readability scores and keyword extraction

2. Pro-Dev | Researcher: Advanced NLP Research Pipeline

Goal: Build production NLP features or conduct research requiring semantic embeddings, sentiment analysis, and batch processing with performance tracking.

Environment:

  • Development workstation with Python 3.10+
  • Research or production codebase integration
  • Optional: GPU for faster embeddings (BGE-M3)
  • Optional: OpenAI API key for TE3 embeddings
  • Multi-language support needed

What happens:

# Install with all extras for research
uv sync --group dev --all-extras

# Full analysis with embeddings and sentiment
uv run rookeen analyze "https://research-paper.com/article" \
  --enable-embeddings --embeddings-backend bge-m3 \
  --enable-sentiment --models-auto-download \
  --export-parquet --export-conllu --conllu-engine stanza \
  -o results/full_analysis

# Batch process research corpus
cat research_urls.txt | uv run rookeen batch --stdin \
  --enable-embeddings --embeddings-backend miniLM \
  --output-dir results/batch --format json

# Pipeline integration for ML workflows
uv run rookeen analyze "https://dataset-sample.com" --stdout \
  --enable-embeddings --embeddings-backend bge-m3 \
  | jq '.analyzers[] | select(.name == "embeddings") | .results.vector' \
  | python ml_pipeline.py --input-format json

# Performance benchmarking for production
uv run python bench/run_bench.py --json --quiet \
  | uv run python scripts/pipeline_tools.py analyze \
  | jq '.avg_time'

# Selective analyzers for cost optimization
uv run rookeen analyze "https://api-content.com/data" \
  --enable pos --enable ner --enable-embeddings \
  --disable keywords --disable readability \
  -o results/optimized.json

Why it works for pro-dev | researcher:

  • Production-ready: Proper exit codes, error handling, and JSON schema validation for CI/CD
  • Advanced features: Semantic embeddings (3 backends), sentiment analysis, CoNLL-U export for research
  • Performance control: Selective analyzer enable/disable, preloading, benchmark harness for regression detection
  • Unix composability: Pipes seamlessly with jq, awk, and other tools for complex workflows
  • Research formats: CoNLL-U export with UD validation, Parquet for analytics, machine-readable timing
  • Scalable: Batch processing, rate limiting, robots.txt respect for ethical web scraping
  • Multi-language: Supports en/de/es/fr with auto-detection and model management

3. Team: Collaborative NLP Feature Development

Goal: Team develops NLP features together with consistent analysis, shared configurations, and version-controlled workflows.

Environment:

  • Shared codebase repository (Git)
  • Team members on different platforms
  • Shared configuration files (rookeen.toml)
  • CI/CD pipeline integration
  • Collaborative documentation

What happens:

# Shared team configuration
cat rookeen.toml
# [rookeen]
# format = "json"
# models_auto_download = true
# embeddings_backend = "miniLM"
# default_language = "en"

# Consistent analysis across team
uv run rookeen analyze "https://product-content.com/feature" \
  --config rookeen.toml --enable-embeddings -o team_results/feature.json

# CI/CD validation pipeline
uv run python scripts/validate_for_ci.py team_results/*.json

# Team benchmark suite
uv run python bench/run_bench.py --json --quiet \
  | uv run python scripts/pipeline_tools.py analyze \
  | tee benchmark_report.json

# Batch processing with team standards
uv run rookeen batch team_urls.txt \
  --output-dir team_results --format json \
  --enable pos --enable ner --enable-embeddings

# Compare team benchmark runs
uv run python scripts/pipeline_tools.py compare \
  bench/results/latest.json bench/results/previous.json

# Error handling for automation
uv run rookeen analyze "https://api.example.com" --errors-json \
  --stdout 2>/dev/null | jq -e '.language.code' || echo "Analysis failed"

Why it works for team:

  • Configuration management: TOML config files version-controlled in Git, consistent settings across team
  • Validation: JSON schema validation ensures data contracts, prevents integration issues
  • Reproducibility: Benchmark harness tracks performance, detects regressions before deployment
  • Automation: Proper exit codes, --errors-json flag, CI/CD integration ready
  • Collaboration: Standardized output formats (JSON, Parquet), easy to share and review results
  • Documentation: Self-documenting via JSON schema, clear error messages, comprehensive examples
  • Scalability: Batch processing handles team workloads, rate limiting prevents API abuse

4. Enterprise R&D: Large-Scale Production NLP Infrastructure

Goal: Deploy rookeen in enterprise R&D environments requiring SLA compliance, performance monitoring, cost optimization, and integration with existing ML pipelines.

Environment:

  • Production servers or cloud infrastructure
  • Kubernetes/Docker deployment
  • Monitoring and alerting systems (Prometheus, Grafana)
  • ML pipeline integration (Spark, Airflow, Kubeflow)
  • High-volume batch processing
  • Multi-region deployment considerations

What happens:

# Production configuration with environment variables
export ROOKEEN_EMBEDDINGS_BACKEND=bge-m3
export ROOKEEN_EMBEDDINGS_MODEL=BAAI/bge-m3
export ROOKEEN_MODELS_AUTO_DOWNLOAD=true
export ROOKEEN_RATE_LIMIT=10.0
export ROOKEEN_ROBOTS=respect

# High-volume batch processing with monitoring
uv run rookeen batch enterprise_corpus.txt \
  --output-dir /data/results --format json \
  --enable-embeddings --embeddings-preload \
  --export-parquet 2>&1 | tee processing.log

# Performance monitoring and SLA tracking
uv run rookeen analyze "https://enterprise-content.com" --stdout \
  | jq '{processing_time: .timing.total_seconds, 
         started_at: .timing.started_at,
         analyzers: [.analyzers[] | {name, processing_time}]}' \
  | send_to_monitoring.sh

# Cost-optimized selective analysis
uv run rookeen analyze "https://enterprise-api.com/data" \
  --enable pos --enable ner --enable-embeddings \
  --embeddings-backend miniLM \
  --disable keywords --disable readability --disable sentiment \
  -o /data/cost_optimized.json

# UD CoNLL-U export with Stanza engine (Level 1–2 compliance)
uv run rookeen analyze "https://enterprise-api.com/data" \
  --export-conllu --conllu-engine stanza --models-auto-download -o /data/conllu.conllu

# Integration with Spark/analytics pipelines
uv run rookeen batch enterprise_urls.txt \
  --export-parquet --output-dir /data/parquet \
  | spark-submit --class ProcessRookeenData spark_job.py

# Automated regression detection
uv run python bench/run_bench.py --json --quiet --no-save \
  | uv run python scripts/pipeline_tools.py analyze \
  | jq 'if .avg_time > 5.0 then "SLA violation" else "OK" end'

# Multi-language production setup
export ROOKEEN_LANGUAGES_PRELOAD=en,de,es,fr
uv run rookeen analyze "https://multilingual-site.com" \
  --enable-embeddings --models-auto-download \
  -o /data/multilingual.json

# Error handling and alerting
if ! uv run rookeen analyze "$URL" --errors-json --stdout > /tmp/result.json 2>/tmp/error.json; then
  ERROR_CODE=$?
  ERROR_JSON=$(cat /tmp/error.json)
  send_alert.sh --error-code $ERROR_CODE --error "$ERROR_JSON"
  exit $ERROR_CODE
fi

Why it works for enterprise R&D:

  • SLA compliance: Machine-readable timing data, per-analyzer performance tracking, benchmark harness for regression detection
  • Cost optimization: Selective analyzer enable/disable, multiple embedding backends (local vs API), Parquet export for efficient storage
  • Production reliability: Proper exit codes, structured error handling, JSON schema validation, robots.txt respect
  • Scalability: Batch processing, rate limiting, preloading for consistent latency, Parquet export for big data workflows
  • Integration: Unix pipeline composability, structured JSON output, Parquet format for Spark/DuckDB, programmatic API ready
  • Monitoring: Timing and provenance data for dashboards, error codes for alerting, benchmark suite for performance tracking
  • Multi-language: Supports enterprise multilingual content with auto-detection and model management
  • Compliance: Respects robots.txt, configurable rate limiting, audit trail via timing and metadata

Language Detection:

  • Precedence: --lang CLI flag > config default > auto-detection
  • Warnings: Low confidence (< 0.6) auto-detection emits warnings
  • Normalization: Language codes are automatically normalized (e.g., en-USen)

Analyzer Selection and Plugin Registry

Rookeen features a flexible analyzer selection system that allows you to run only the analyzers you need, optimizing for cost, latency, and specific use cases. All analyzers are registered in a central registry and can be selectively enabled or disabled.

Available Analyzers

  • Core analyzers (always available):

    • dependency: Dependency parsing and grammatical relations
    • keywords: YAKE-based keyword and keyphrase extraction
    • lexical_stats: Token counts, sentence length, TTR, top lemmas
    • ner: Named entity recognition with entity types and counts
    • pos: Part-of-speech tagging with UPOS counts and ratios
    • readability: Readability metrics (Flesch, FK grade, SMOG, ARI, etc.)
  • Optional analyzers (require additional dependencies):

    • embeddings: Sentence embeddings using sentence-transformers
    • sentiment: Sentiment analysis with VADER/TextBlob prioritization

Selective Analyzer Control

Use --enable and --disable flags to run only specific analyzers:

# Run only POS and NER analyzers
uv run rookeen analyze "https://example.com/article" \
  --enable pos --enable ner --models-auto-download -o results/pos_ner.json

# Run all analyzers except keywords and readability
uv run rookeen analyze "https://example.com/article" \
  --disable keywords --disable readability --models-auto-download -o results/minimal.json

# Combine selective flags with optional analyzers
uv run rookeen analyze "https://example.com/article" \
  --enable pos --enable ner --enable-embeddings --enable-sentiment \
  --models-auto-download -o results/combined.json

Integration with Existing Flags

The selective analyzer flags work seamlessly with existing optional analyzer flags:

# This runs only embeddings (selective control)
uv run rookeen analyze "https://example.com/article" \
  --enable embeddings --models-auto-download -o results/embeddings_only.json

# This runs all analyzers + sentiment
uv run rookeen analyze "https://example.com/article" \
  --enable-sentiment --models-auto-download -o results/all_plus_sentiment.json

Plugin Registry Architecture

Rookeen uses a plugin-based architecture where all analyzers are registered in a central registry:

# Programmatic access to available analyzers
from rookeen.analyzers.base import available_analyzers, get_analyzer

# List all available analyzers
analyzers = available_analyzers()
print(analyzers)  # ['dependency', 'keywords', 'lexical_stats', 'ner', 'pos', 'readability']

# Get a specific analyzer class
pos_analyzer = get_analyzer('pos')

Use Cases

Performance Optimization:

# For entity extraction only
uv run rookeen analyze "https://news.example.com" \
  --enable ner --models-auto-download -o results/entities.json

Cost-Effective Analysis:

# Basic linguistic analysis without expensive computations
uv run rookeen analyze "https://blog.example.com" \
  --enable pos --enable lexical_stats --models-auto-download -o results/basic.json

Comprehensive ML Pipeline:

# Full analysis including embeddings for ML applications
uv run rookeen analyze "https://research.example.com" \
  --enable-embeddings --enable-sentiment --models-auto-download -o results/ml_ready.json

Batch Processing with Analyzer Selection

Analyzer selection works with batch processing:

# Process multiple URLs with consistent analyzer configuration
uv run rookeen batch urls.txt \
  --enable pos --enable ner --enable-embeddings \
  --output-dir results --models-auto-download

Custom Analyzer Development

The plugin registry enables adding custom analyzers:

from rookeen.analyzers.base import BaseAnalyzer, register_analyzer
from rookeen.models import AnalysisType, LinguisticAnalysisResult

@register_analyzer
class CustomAnalyzer(BaseAnalyzer):
    name = "custom"
    analysis_type = AnalysisType.CUSTOM

    async def analyze(self, doc, lang: str) -> LinguisticAnalysisResult:
        # Your custom analysis logic
        return LinguisticAnalysisResult(
            analysis_type=self.analysis_type,
            name=self.name,
            results={"custom_metric": 0.85},
            processing_time=0.1,
            confidence=0.9
        )

Configuration

  • Precedence: CLI flags > environment variables (ROOKEEN_ prefix) > TOML config file > defaults.
  • Config file: pass at the root via --config PATH (flat keys or a [rookeen] table).
  • Environment variables (examples):
export ROOKEEN_FORMAT=md
export ROOKEEN_OUTPUT_DIR=results
export ROOKEEN_LANGUAGES_PRELOAD=en,de
export ROOKEEN_DEFAULT_LANGUAGE=en
export ROOKEEN_MODELS_AUTO_DOWNLOAD=false
export ROOKEEN_EMBEDDINGS_BACKEND=miniLM           # or bge-m3, openai-te3
export ROOKEEN_EMBEDDINGS_MODEL=BAAI/bge-m3        # backend-specific model id
export ROOKEEN_OPENAI_API_KEY=$OPENAI_API_KEY      # for openai-te3 backend
  • TOML example:
[rookeen]
format = "json"
output_dir = "results"
languages_preload = ["en", "de"]
default_language = "en"
models_auto_download = true
log_level = "INFO"
# Embeddings defaults (optional)
embeddings_backend = "miniLM"
embeddings_model = "sentence-transformers/all-MiniLM-L6-v2"
openai_api_key = ""
  • Using a config file:
uv run rookeen --config ./rookeen.toml analyze "https://example.org/article" -o results/out
# Or place rookeen.toml in current directory (auto-loaded)
uv run rookeen analyze "https://example.org/article" -o results/out

Exit codes:

  • 0 OK, 1 generic error, 2 usage error, 3 fetch error, 4 model error

Error Handling and Automation

Rookeen provides stable exit codes and machine-readable error output for automation and CI/CD pipelines:

  • --errors-json: Force machine-readable JSON error output instead of user-friendly text
  • Consistent exit codes: Reliable error codes for scripting and monitoring
  • JSON error format: Structured error information for programmatic handling

Examples:

Normal error output (user-friendly):

$ uv run rookeen analyze
Usage: python -m rookeen.cli analyze [OPTIONS] URL
Try 'python -m rookeen.cli analyze -h' for help.
Error: Missing argument 'URL'.
$ echo $?
2

JSON error output (machine-readable):

$ uv run rookeen --errors-json analyze
{"error": {"code": 2, "name": "USAGE_ERROR", "message": "Invalid CLI arguments"}}
$ echo $?
2

Use cases:

  • CI/CD pipelines that need to parse error conditions
  • Monitoring systems checking exit codes
  • Scripts that need structured error information
  • Automated retry logic based on error types

JSON output schema (snippet)

{
  "tool": "rookeen",
  "version": "0.1.0",
  "source": {
    "type": "url",
    "value": "https://…",
    "fetched_at": 1757900000.0,
    "domain": "en.wikipedia.org"
  },
  "language": {
    "code": "en",
    "confidence": 0.98,
    "model": "en_core_web_sm"
  },
  "content": {
    "title": "Natural language processing — Wikipedia",
    "char_count": 12345,
    "word_count": 2345
  },
  "analyzers": [
    {
      "name": "lexical_stats",
      "processing_time": 0.12,
      "confidence": 1.0,
      "results": {
        "total_tokens": 2200,
        "unique_lemmas": 900,
        "sentences": 120,
        "avg_token_length": 4.7,
        "avg_sentence_length_tokens": 18.3,
        "type_token_ratio": 0.41,
        "top_lemmas": [["language", 50], ["model", 32]]
      },
      "metadata": {
        "language": {"code": "en", "confidence": 0.98},
        "model": "en_core_web_sm"
      }
    },
    {
      "name": "embeddings",
      "processing_time": 5.45,
      "confidence": 1.0,
      "results": {
        "supported": true,
        "backend": "bge-m3",
        "model": "BAAI/bge-m3",
        "dim": 1024,
        "normalized": true,
        "vector": [0.044, -0.063, 0.078, 0.063, 0.007, ...]
      },
      "metadata": {
        "language": {"code": "en", "confidence": 0.98},
        "model": "en_core_web_sm"
      }
    },
    {
      "name": "sentiment",
      "processing_time": 1.5,
      "confidence": 0.9999,
      "results": {
        "supported": true,
        "label": "positive",
        "score": 0.9999,
        "method": "vader",
        "scores": {
          "neg": 0.035,
          "neu": 0.909,
          "pos": 0.056,
          "compound": 0.9999
        }
      },
      "metadata": {
        "language": {"code": "en", "confidence": 0.98},
        "model": "en_core_web_sm"
      }
    }
    // Additional analyzers: "pos", "ner", "readability", "keywords"
  ],
  "timing": {
    "started_at": 1758247007.029181,
    "ended_at": 1758247007.317931,
    "total_seconds": 0.28876041690818965
  }
}

Analyzer highlights:

  • lexical_stats: token and sentence counts, TTR, top lemmas
  • pos: upos_counts, upos_ratios, top_lemmas_by_upos
  • ner: supported, counts_by_label, examples_by_label, total_entities
  • readability: Flesch, FK grade, SMOG, ARI, CLI, Linsear, Dale–Chall, text standard
  • keywords: YAKE-based keyphrases
  • embeddings: 384-dimensional sentence vectors using sentence-transformers/all-MiniLM-L6-v2
  • sentiment: smart analysis using VADER/TextBlob/spaCy with confidence scores and detailed breakdowns

Semantic Embeddings

Rookeen supports generating sentence-level semantic embeddings for advanced NLP applications including similarity search, content clustering, and semantic analysis.

Installation for Embeddings

uv sync --extra embeddings

Usage

Generate embeddings alongside linguistic analysis:

uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
  --enable-embeddings --models-auto-download --robots ignore -o results/cat_analysis.json

Backends:

  • miniLM (local; default model sentence-transformers/all-MiniLM-L6-v2, dim 384)
  • bge-m3 (local; model BAAI/bge-m3, dim 1024)
  • openai-te3 (API; models text-embedding-3-small [1536], text-embedding-3-large [3072])

Preload to stabilize tail latency (optional): --embeddings-preload.

Performance, memory, and security notes:

  • BGE‑M3 defaults to CPU; if CUDA is available it will auto-select GPU. To control devices: CUDA_VISIBLE_DEVICES=0.
  • BGE‑M3 size: typical on-disk weights are ~2.1 GB (fp32) according to the model card; fp16/bfloat16 inference can use ~1.1 GB VRAM. Local HF cache size can be larger (e.g., multiple shards/backends), so you may see >2 GB on disk. Start with batch size 1 and mark tests as @pytest.mark.slow.
  • OpenAI API: never log API keys; pass via --openai-api-key or env (ROOKEEN_OPENAI_API_KEY/OPENAI_API_KEY).
  • OpenAI timeouts: set ROOKEEN_OPENAI_TIMEOUT or OPENAI_TIMEOUT (seconds).

Features

  • Pluggable backends: miniLM, BGE-M3, OpenAI TE3
  • Provenance: results include backend, model, dim, normalized
  • Normalization: L2-normalized embeddings for cosine similarity
  • Format: JSON-serializable float arrays ready for ML pipelines

Applications

  • Document Similarity: Compare semantic similarity between documents
  • Content Clustering: Group related articles by topic automatically
  • Semantic Search: Find content by meaning, not just keywords
  • ML Pipelines: Feed embeddings directly into downstream ML models

Output Structure

{
  "name": "embeddings",
  "results": {
    "supported": true,
    "backend": "bge-m3",
    "model": "BAAI/bge-m3",
    "dim": 1024,
    "normalized": true,
    "vector": [0.0439, -0.0627, 0.0780, 0.0630, 0.0069, ...]
  }
}

Performance

  • Processing time: ~1 seconds for typical web articles
  • File size impact: ~3KB per document (384 floats × 8 bytes)
  • Memory usage: ~23MB model (lazy-loaded, shared across analyses)

CLI Options

  • --enable-embeddings: Enable sentence embeddings generation
  • --embeddings-backend: Select backend (miniLM, bge-m3, openai-te3)
  • --embeddings-model: Backend-specific model id
  • --embeddings-preload/--no-embeddings-preload: Preload backend/model at startup
  • --openai-api-key: API key for OpenAI backend (or use env)

Smart Sentiment Analysis

Rookeen supports sentiment analysis using intelligent library prioritization for fast, accurate results without custom implementations.

Installation for Sentiment Analysis

uv sync --extra sentiment

Usage

Analyze sentiment alongside linguistic analysis:

uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
  --enable-sentiment --models-auto-download --robots ignore -o results/cat_sentiment.json

Library Prioritization

Rookeen uses smart library selection for optimal performance:

  • VADER (Primary): Fastest, most reliable for general text (~0.0001s processing)
  • TextBlob (Fallback): Good balance of speed and accuracy
  • spaCy (Last resort): If available with sentiment extension

Features

  • Methods: VADER compound scoring, detailed breakdowns (positive/neutral/negative)
  • Labels: positive, negative, neutral with confidence scores
  • Performance: Microsecond processing with high accuracy
  • Reliability: Uses battle-tested, well-maintained libraries

Applications

  • Content Analysis: Determine overall sentiment of articles and web pages
  • Brand Monitoring: Track sentiment in customer content
  • Content Classification: Automatically categorize content by emotional tone
  • ML Pipelines: Feed sentiment scores into downstream models

Output Structure

{
  "name": "sentiment",
  "results": {
    "supported": true,
    "label": "positive",
    "score": 0.9999,
    "method": "vader",
    "scores": {
      "neg": 0.035,
      "neu": 0.909,
      "pos": 0.056,
      "compound": 0.9999
    }
  }
}

Performance

  • Processing time: ~0.0001-0.0002 seconds (microseconds)
  • Accuracy: 99.99% confidence on real content
  • Memory usage: Minimal (VADER has no model dependencies)

CLI Option

  • --enable-sentiment: Enable sentiment analysis (requires --extra sentiment)

CoNLL-U Export (UD)

Rookeen supports exporting linguistic analysis in CoNLL-U format (Universal Dependencies standard) with two quality levels:

Installation for UD Export

uv sync --extra ud

Usage

Export CoNLL-U alongside JSON analysis:

uv run rookeen analyze "https://example.com/article" \
  --export-conllu --conllu-engine stanza --models-auto-download -o results/article

Engines

Stanza Engine (Recommended)

  • --conllu-engine stanza: Uses Stanford's Stanza for UD-native parsing
  • Quality: Passes UD validation Levels 1-2 (format compliance)
  • Use case: Production-grade CoNLL-U for research and analysis
  • Limitations: Level 3 validation may show content-related errors on complex web text (expected behavior)

Basic Engine (Fallback)

  • --conllu-engine basic: Heuristic conversion from spaCy parsing
  • Quality: Level 1 compliant, not UD-validated
  • Use case: Simple text analysis when Stanza is unavailable
  • Warning: Not recommended for web content or research use

Validation Results

Our implementation achieves industry-standard compliance:

Content Type Level 1 Level 2 Level 3
Literary prose (Pride & Prejudice) PASSED PASSED errors*
Web content (Wikipedia) PASSED PASSED errors*

*Level 3 errors are linguistic content issues expected when parsing web navigation and complex text structures. Level 1-2 compliance is production-grade for most NLP applications.

CLI Options

  • --export-conllu: Enable CoNLL-U export
  • --conllu-engine {auto,stanza,basic}: Choose parsing engine (default: auto)
  • --ud-auto-download: Auto-download Stanza models (default: true)
  • --allow-non-ud-conllu: Allow basic engine fallback when Stanza unavailable

Parquet Export (Analyzer Aggregates)

Rookeen supports exporting analyzer summary tables in Apache Parquet format for analytics workflows (Spark, DuckDB, pandas, etc.).

Installation

uv sync --extra parquet

Usage

Add --export-parquet to any CLI command to write a <base>.parquet file with analyzer-level aggregates:

uv run rookeen analyze "https://example.com/article" \
  --export-parquet --models-auto-download -o results/article
  • The Parquet file contains one row per analyzer, with flat scalar results (int, float, str, bool).
  • If present, metadata.model and metadata.language are included as columns for traceability and analytics.
  • Optional dependency: pyarrow>=16 (installed via --extra parquet).

Example output columns:

  • name, processing_time, confidence, metadata.model, metadata.language, results.total_tokens, ...

CLI Option

  • --export-parquet: Write analyzer summary table to Parquet

Unix Pipeline Tools

Rookeen provides additional CLI utilities specifically designed for Unix pipeline composition and data analysis workflows.

Pipeline Tools Features

  • Pure structured output: JSON/CSV formats for reliable pipeline processing
  • Advanced filtering: Filter results by success, time, language, analyzers
  • Statistical analysis: Comprehensive performance analysis and comparison
  • Comparison tools: Compare benchmark results across different runs
  • Exit code handling: Proper error codes for automation and CI/CD

Pipeline Tools Usage

# Analyze benchmark performance statistics
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py analyze

# Filter for slow benchmarks (> 2.5 seconds)
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py filter --min-time 2.5

# Count successful benchmarks only
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py filter --success-only --count

# Filter by specific analyzer
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py filter --analyzer embeddings

# Compare two benchmark runs
uv run python scripts/pipeline_tools.py compare bench/results/latest.json bench/results/previous.json

Pipeline Tools Output Example

{
  "total": 4,
  "successful": 4,
  "failed": 0,
  "success_rate": 1.0,
  "avg_time": 2.291,
  "min_time": 2.255,
  "max_time": 2.329,
  "total_time": 9.163,
  "fastest_case": "en_wiki_full",
  "slowest_case": "en_wiki_embeddings"
}

E2E tests

The E2E suite provides comprehensive regression protection across multiple languages and sources with robust error handling and comprehensive assertions.

Test Coverage

  • 20 tests covering English, German, Spanish, French, and multilingual content
  • Stable sources: Wikipedia oldid pages and Project Gutenberg texts
  • Comprehensive assertions: Language detection, POS coverage, NER support, analyzer metadata validation
  • Robust execution: 5-minute timeouts with exponential backoff retries

Quick Commands

Note: For full test coverage (including embeddings, sentiment, and CoNLL-U tests), sync all optional dependencies:

uv sync --group dev --all-extras
# Run full E2E suite - takes some time
uv run pytest -q tests/e2e -v -n auto

# Run smoke tests only (fast CI validation - 2 tests, 18 deselected)
uv run pytest -q -m smoke tests/e2e -n auto

# Run specific test categories
uv run pytest tests/e2e/test_conllu_stanza.py -n auto  # CoNLL-U validation
uv run pytest tests/e2e/test_analyzer_selection.py -n auto  # Analyzer selection
uv run pytest tests/e2e/test_mixed_batch.py -n auto  # Batch processing
uv run pytest tests/e2e/test_testdata_integration.py -n auto  # Test data file integration

# Run backend-specific embeddings tests
uv run pytest -q tests/e2e/test_embeddings_minilm.py -n auto
uv run pytest -q tests/e2e/test_embeddings_bge_m3.py -n auto
# Requires OPENAI_API_KEY
uv run pytest -q -m external tests/e2e/test_embeddings_openai_te3.py -n auto

Note:

  • CoNLL-U validation tests (test_conllu_local.py, test_conllu_stanza.py) rely on the UD Tools validator and will be skipped unless you provide its path.
  • Obtain validator: validate.py.
  • Configure path via env var UD_VALIDATE_SCRIPT (or ROOKEEN_UD_VALIDATE_SCRIPT), or place a copy at tools/validate.py.

Testing & Benchmarks

Get Started: See Rookeen in Action

After syncing all dependencies, run the demo test suite to see rookeen in action:

# Install all dependencies first
uv sync --group dev --all-extras

# Run the demo test suite
uv run python scripts/run_demo_tests.py

This script runs 5 real-world usage tests across different scenarios:

  • Basic web page analysis (BBC Technology)
  • Technical blog with embeddings (Python.org)
  • Sentiment analysis (file-based)
  • Selective analyzers (Wikipedia)
  • Stdin pipeline composition

All test artifacts and a comprehensive report are saved to the results/ folder.

Unit & Integration Tests

Run fast smoke tests (CI default):

uv run pytest -q -m smoke -n auto

Run the main test suite without network:

uv run pytest -q -m "not external and not slow" -n auto

Run all tests (including network) (Apple M4 ~ 2:30 minutes):

uv run pytest -q -n auto

CLI Validation Scripts

Additional validation scripts for testing CLI behavior and edge cases. These scripts are integrated into CI to ensure CLI functionality doesn't regress.

CLI Chain Validation - Tests error handling, validation, and edge cases:

bash scripts/validate_cli_chain_fixed.sh

This script tests:

  • CLI parsing errors (missing arguments, invalid options)
  • File I/O errors (non-existent files)
  • Network/fetch errors
  • Successful operations with pipeline composition
  • Flag combinations and edge cases

Note: This script intentionally tests error conditions and always exits with code 0, as it validates that errors are handled correctly.

Streaming Mode Validation - Tests stdin/stdout functionality:

bash scripts/validate_streaming_mode.sh

This script tests:

  • Basic streaming validation
  • Empty input handling
  • Unicode/multilingual support
  • Auto language detection
  • Multi-line and large input handling
  • Analyzer selection with stdin
  • Export options with stdin
  • Binary input handling
  • Error validation for conflicting arguments

Note: This script tests functional behavior and will exit with non-zero code if any test fails, indicating a real regression.

Performance Benchmarks

Benchmarks (JSON pipeline):

uv run python bench/run_bench.py --json --quiet | jq '[.[] | {case, seconds, ok: .success}]'

Compute-only benchmark (stdin):

cat bench/sample.txt | uv run rookeen analyze --stdin --lang en --stdout \
  | jq '{total_s: .timing.total_seconds, analyzers: [.analyzers[] | {name, processing_time}]}'

Performance note:

Use the repository's existing Performance Benchmarks section as the source of truth. Avoid hardcoding machine-specific numbers in docs; instead, generate a baseline JSON on your hardware and reference that in PRs if needed.

Performance Benchmarks

Rookeen includes a comprehensive performance benchmark harness for tracking latency and detecting performance regressions.

Benchmark Coverage

  • 6 test scenarios: default analysis, embeddings-only, sentiment-only, full analysis (embeddings + sentiment), embeddings (BGE-M3), embeddings (OpenAI TE3, optional)
  • Stable input: Wikipedia article snapshot for consistent, reproducible benchmarking
  • Metrics captured: execution time, return codes, and success status
  • Results storage: CSV format for trend analysis and JSON for programmatic access

Industry-Ready Unix Pipeline Composability

The benchmark harness now supports industry-standard Unix pipeline patterns:

Pipeline-Friendly JSON Output
# Pure JSON output for reliable pipeline processing
uv run python bench/run_bench.py --json --quiet | jq '.[0].return_code == 0'
# Output: true

# Validate all benchmarks passed
uv run python bench/run_bench.py --json --quiet | jq 'all(.success)'
# Output: true

# Extract performance metrics
uv run python bench/run_bench.py --json --quiet | jq '[.[] | {case: .case, time: .seconds}]'
Multiple Output Formats
# Human-readable table format
uv run python bench/run_bench.py --format table

# CSV for data analysis and spreadsheets
uv run python bench/run_bench.py --format csv > benchmark_results.csv

# JSON for programmatic processing (default)
uv run python bench/run_bench.py --json
Quiet Mode for Automation
# Suppress progress output for CI/CD and scripts
uv run python bench/run_bench.py --quiet

# Combined with JSON for pure machine-readable output
uv run python bench/run_bench.py --json --quiet | jq '.[] | select(.success == false)'
Advanced Pipeline Examples
# Performance analysis pipeline
uv run python bench/run_bench.py --json --quiet | \
  jq '[.[] | select(.success == true)] | sort_by(.seconds) | reverse | .[0] | {fastest: .case, time: .seconds}'

# Generate performance report
uv run python bench/run_bench.py --json --quiet | \
  jq -r '"Benchmark Report:
Total: \(length)
Passed: \([.[] | select(.success)] | length)
Avg Time: \((([.[] | .seconds] | add) / length) | .3f)s
Max Time: \([.[] | .seconds] | max | .3f)s"'

Running Benchmarks

# Interactive mode (default - human-friendly)
uv run python bench/run_bench.py

# Pipeline mode (machine-friendly)
uv run python bench/run_bench.py --json --quiet

# CI/CD mode (no output, check exit code)
uv run python bench/run_bench.py --quiet --no-save

# View latest results
cat bench/results/latest.json

# Advanced pipeline processing with analysis tools
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py analyze
Compute-Only Local Stdin Benchmark
# Compute-only local stdin benchmark (no network)
cat bench/sample.txt | uv run rookeen analyze --stdin --lang en --stdout \
  | jq '{total_s: .timing.total_seconds, analyzers: [.analyzers[] | {name, processing_time}]}'

Example Output

[
  {
    "timestamp": "2025-09-19T16:06:21.523567",
    "case": "en_wiki",
    "url": "https://en.wikipedia.org/w/index.php?title=Natural_language_processing&oldid=1201524046",
    "language": "en",
    "analyzers": "default",
    "return_code": 0,
    "seconds": 2.332,
    "success": true
  }
]

Performance Metrics (Typical)

  • Default analysis: ~3.9 seconds
  • Embeddings analysis: ~7.5 seconds
  • Sentiment analysis: ~3.7 seconds
  • Full analysis: ~7.5 seconds

Machine-readable timing and provenance

Rookeen provides comprehensive timing and provenance tracking for performance monitoring and SLA compliance. Every analysis includes machine-readable timing data and per-analyzer metadata.

Timing Information

Each JSON output includes overall processing timing:

  • started_at: Unix timestamp when analysis began
  • ended_at: Unix timestamp when analysis completed
  • total_seconds: High-precision total processing duration

Per-Analyzer Provenance

Each analyzer result includes detailed metadata:

  • processing_time: Individual analyzer processing time
  • confidence: Analyzer confidence score
  • metadata.language: Language code and confidence used
  • metadata.model: SpaCy model used for processing

Usage Examples

# Get timing information
uv run rookeen analyze "https://example.com" --robots ignore -o results/analysis.json
jq '.timing.total_seconds' results/analysis.json

# Monitor per-analyzer performance
uv run rookeen analyze "https://example.com" --robots ignore -o results/analysis.json
jq '.analyzers[].processing_time' results/analysis.json

# Check analyzer provenance
uv run rookeen analyze "https://example.com" --robots ignore -o results/analysis.json
jq '.analyzers[0].metadata' results/analysis.json

Applications

  • SLA Monitoring: Track processing times against service level agreements
  • Performance Optimization: Identify slow analyzers and optimize resource usage
  • Cost Analysis: Monitor analyzer usage for cost optimization
  • Debugging: Trace processing provenance for issue diagnosis
  • Analytics: Build dashboards and reports on processing performance

JSON Schema validation

Validate Rookeen JSON outputs against the official schema for data consistency and contract-first development.

# Validate existing JSON files
uv run python scripts/validate_for_ci.py results/*.json

# Process URLs directly and validate (result-agnostic)
uv run python scripts/validate_for_ci.py "https://example.com" --enable-embeddings --verbose

# Run comprehensive validation test suite
uv run python scripts/test_validation.py

Scripts and Utilities

Rookeen provides several utility scripts for advanced usage and automation:

Pipeline Tools (scripts/pipeline_tools.py)

Advanced command-line utilities for data processing and analysis:

# Analyze benchmark results
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py analyze

# Filter results by criteria
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py filter --success-only

# Compare benchmark runs
uv run python scripts/pipeline_tools.py compare bench/results/latest.json bench/results/previous.json

Pipeline Demo (scripts/pipeline_demo.sh)

Interactive demonstration of Unix pipeline composability patterns:

# Run comprehensive pipeline examples
./scripts/pipeline_demo.sh

# Learn advanced pipeline techniques and best practices

Validation Scripts

  • scripts/validate_for_ci.py: JSON schema validation for CI/CD pipelines
  • scripts/test_validation.py: Comprehensive validation test suite
  • scripts/pipeline_demo.sh: Interactive pipeline examples and tutorials

Industry-Ready Pipeline Features

Rookeen now provides industry-standard Unix pipeline composability with these key improvements:

Proper Output Separation

  • stdout: Pure structured data (JSON, CSV, table)
  • stderr: Human-readable messages, progress, errors
  • Exit codes: 0=success, 1=failure for automation

Multiple Output Formats

  • --format json: Machine-readable JSON (default)
  • --format csv: Spreadsheet-compatible CSV
  • --format table: Human-readable tables

Pipeline-Friendly Flags

  • --quiet: Suppress progress for automation
  • --json: Pure JSON output (implies --quiet)
  • --no-save: Skip file operations for CI/CD

Advanced Data Processing

  • scripts/pipeline_tools.py: Statistical analysis and filtering
  • scripts/pipeline_demo.sh: Interactive examples and tutorials
  • Integration with jq, awk, sed, and other Unix tools

Real-World Examples

# CI/CD integration
uv run python bench/run_bench.py --quiet --no-save || exit 1

# Data analysis pipeline
uv run python bench/run_bench.py --json --quiet | \
  uv run python scripts/pipeline_tools.py analyze | \
  jq '.avg_time'

# Performance monitoring
uv run python bench/run_bench.py --json --quiet | \
  uv run python scripts/pipeline_tools.py filter --min-time 3.0 | \
  jq length

Notes

  • The pipeline is spaCy-only.
  • Models are installed on-demand when --models-auto-download is used (default). A separate download script is not required.

Releases

No releases published

Packages

 
 
 

Contributors