Rookeen is a spaCy-only pipeline and CLI that fetches web content and runs linguistic analyzers including semantic embeddings, producing structured JSON output for ML and NLP workflows. Features industry-ready Unix pipeline composability with proper exit codes, structured output formats, and advanced data processing tools.
- Architecture Overview
- Key features
- Supported languages
- Install
- CLI usage
- Quick Pipeline Examples
- Usage Examples by User Perspective
- Language Detection
- Analyzer Selection and Plugin Registry
- Configuration
- Error Handling and Automation
- JSON output schema (snippet)
- Semantic Embeddings
- Smart Sentiment Analysis
- CoNLL-U Export (UD)
- Parquet Export (Analyzer Aggregates)
- Unix Pipeline Tools
- E2E tests
- Testing & Benchmarks
- Performance Benchmarks
- Machine-readable timing and provenance
- JSON Schema validation
- Scripts and Utilities
- Industry-Ready Pipeline Features
- Notes
flowchart TD
%% Entry Point
CLI[CLI Interface<br/>analyze • analyze-file • batch]
%% Configuration
Config[Configuration<br/>TOML • Environment • Defaults]
%% Input Sources
ReadLocal[Local Input<br/>stdin • files]
Fetch[Web Scraper<br/>Async fetching • Rate limiting • Robots.txt]
%% Core Pipeline Steps
Detect[Language Detection<br/>Overrides: CLI • Config • Auto • Confidence]
Load[Model Loader<br/>spaCy models • Auto-download • Preload]
Process[Document Processor<br/>Tokenization • Parsing]
%% Analysis Layer
subgraph analysis [Analysis Layer]
direction TB
subgraph core_analyzers [Core Analyzers]
direction LR
Lexical[Lexical Stats<br/>Tokens • TTR • Lemmas]
POS[POS Tagging<br/>UPOS counts • Ratios]
NER[Named Entities<br/>Recognition • Examples]
Readability[Readability<br/>Flesch-Kincaid • SMOG]
Keywords[Keywords<br/>YAKE • Frequency]
Dependency[Dependencies<br/>Syntactic relations]
end
subgraph optional_analyzers [Optional Analyzers]
direction LR
Embeddings[Embeddings<br/>Semantic vectors]
Sentiment[Sentiment<br/>VADER • TextBlob • spaCy]
end
end
%% Embedding Backends
subgraph backends [Embedding Backends]
direction LR
MiniLM[MiniLM<br/>384-dim • Fast]
BGE[BGE-M3<br/>1024-dim • Accurate]
OpenAI[OpenAI TE3<br/>1536/3072-dim • API]
end
%% Results and Export
Results[Analysis Results<br/>JSON format • Timing data]
subgraph exports [Export Formats]
direction LR
JSON_Export[Summary JSON<br/>Machine-readable]
SpaCyJSON_Export[spaCy JSON<br/>Token-level]
Parquet_Export[Parquet<br/>Analytics]
CoNLLU_Export[CoNLL-U<br/>Research]
DocBin_Export[DocBin<br/>spaCy format]
end
%% Flow Connections
CLI --> Config
CLI --> Fetch
CLI --> ReadLocal
Config --> Fetch
ReadLocal --> Detect
Fetch --> Detect
Detect --> Load
Load --> Process
Process --> analysis
%% Analysis flows
core_analyzers --> Results
optional_analyzers --> Results
%% Embedding backend connection
Embeddings --> backends
%% Export flows
Results --> exports
- CLI Interface: Click-based command-line with
analyze,analyze-file, andbatchcommands - Configuration System: Hierarchical settings precedence (CLI flags → Environment variables → TOML files → Defaults)
- Processing Pipeline: Linear flow from web scraping through language detection, model loading, and document processing
- Analyzer System: Plugin-based architecture with core analyzers (always available) and optional analyzers (require extra dependencies)
- Export Formats: Multiple output formats (JSON, Parquet, CoNLL-U, DocBin) for different use cases
- External Dependencies: spaCy language models, ML libraries (sentence-transformers, VADER), and specialized parsers (Stanza)
- Embedding Backends: Pluggable backends for sentence embeddings (MiniLM, BGE-M3, OpenAI TE3)
The architecture follows a clean vertical flow: Input → Configuration → Processing → Analysis → Results → Export
- Web page fetching with retries and HTML sanitization
- Language detection and spaCy model loading with auto-download
- Flexible analyzer selection with selective enable/disable for cost optimization
- Plugin-based architecture with extensible analyzer registry
- Industry-ready Unix pipeline composability with structured output and proper exit codes
- Analyzers: lexical stats, POS, NER, readability, keywords, embeddings, sentiment
- Semantic embeddings for similarity, clustering, and semantic search
- Smart sentiment analysis with library prioritization (VADER → TextBlob → spaCy)
- Performance benchmark harness for latency tracking and regression detection
- Advanced pipeline tools for data analysis and filtering
- Machine-readable timing and provenance for SLA monitoring and performance tracking
- JSON Schema validation for contract-first development and data consistency
- JSON output suitable for programmatic consumption and ML pipelines
- en →
en_core_web_sm(included by default) - de →
de_core_news_sm(optional extra) - es →
es_core_news_sm(optional extra) - fr →
fr_core_news_sm(optional extra)
Rookeen is managed with uv (dependency and environment management) and uses ruff for linting.
After syncing, a console script named rookeen is available.
uv and ruff can be installed in many ways; click and see their documentation if you have not them alredy.
# Core installation (includes English model)
uv sync
# Install with all extras (includes all language models)
uv sync --group dev --all-extras
# Optional: Add specific language models
uv sync --extra lang-de # German
uv sync --extra lang-es # Spanish
uv sync --extra lang-fr # French
# Optional: Add embeddings support
uv sync --extra embeddings
# (includes sentence-transformers and OpenAI SDK for API backends)
# Optional: Add sentiment analysis support
uv sync --extra sentiment
# Optional: Add UD CoNLL-U export support
uv sync --extra ud
# Optional: Add Parquet export support
uv sync --extra parquet
# Combine multiple extras
uv sync --extra embeddings --extra sentiment --extra lang-de --extra lang-frModels are auto-downloaded by the CLI when missing if you pass --models-auto-download (enabled by default). No extra script is required.
To install models manually instead:
# Example for German
uv pip install de-core-news-sm
# or
python -m spacy download de_core_news_smYou can invoke the CLI in any of these equivalent ways (recommended first):
- uv run rookeen [options]
- uv run python -m rookeen [options]
- uv run python -m rookeen.cli [options]
Analyze a URL:
uv run rookeen analyze "https://en.wikipedia.org/w/index.php?title=Natural_language_processing&oldid=1201524046" \
--format json --models-auto-download --lang en --robots ignore -o results/nlp.jsonAnalyze a URL with embeddings:
uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
--enable-embeddings --models-auto-download --robots ignore -o results/cat_analysis.jsonSelect an embeddings backend explicitly (MiniLM local):
uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
--enable-embeddings --embeddings-backend miniLM --models-auto-download -o results/cat_minilm.jsonUse the BGE-M3 backend (local, HF model id):
uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
--enable-embeddings --embeddings-backend bge-m3 --embeddings-model BAAI/bge-m3 \
--robots ignore -o results/cat_bge_m3.jsonUse the OpenAI TE3 backend (API-based):
uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
--enable-embeddings --embeddings-backend openai-te3 --embeddings-model text-embedding-3-small \
--openai-api-key "$OPENAI_API_KEY" --robots ignore -o results/cat_openai_te3.jsonPreload embeddings model/client at startup to avoid first-call latency:
uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
--enable-embeddings --embeddings-backend bge-m3 --embeddings-model BAAI/bge-m3 \
--embeddings-preload -o results/cat_bge_m3_preloaded.jsonAnalyze a URL with sentiment analysis:
uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
--enable-sentiment --models-auto-download --robots ignore -o results/cat_sentiment.jsonAnalyze a local file:
uv run rookeen analyze-file ./samples/article.txt \
--format json --models-auto-download --lang en -o results/article.jsonStream text from stdin (Unix pipeline integration):
echo 'Hello world' | uv run rookeen analyze --stdin --lang en --stdout | jq '.language.code'
# Output: "en"
# Language auto-detection
echo 'Bonjour le monde' | uv run rookeen analyze --stdin --stdout | jq '.language.code'
# Output: "fr"
# With analyzer selection
cat article.txt | uv run rookeen analyze --stdin --enable pos --disable keywords --stdoutBatch mode (one URL per line; # comments allowed):
uv run rookeen batch urls.txt --output-dir results --format json --models-auto-downloadResponsible crawling with rate limiting and robots.txt support:
# Respect robots.txt with custom rate limiting (2 requests/second)
uv run rookeen analyze "https://example.com/article" --rate-limit 2.0 --robots respect
# Ignore robots.txt for research purposes (use responsibly!)
uv run rookeen analyze "https://example.com/article" --robots ignoreRun 'uv run rookeen analyze --help' to see all available flags.
Common flags:
--lang: override language detection (e.g.,en,de,es,fr) - takes highest precedence--languages: preload models, comma-separated (e.g.,en,de)--models-auto-download/--no-models-auto-download: install missing spaCy models automatically--enable-embeddings: enable sentence embeddings analysis (requires--extra embeddings)--embeddings-preload/--no-embeddings-preload: preload embeddings backend/model at startup to avoid first-call latency--embeddings-backend {miniLM,bge-m3,openai-te3}: choose embeddings backend--embeddings-model <id>: model identifier for the selected backend (e.g.,sentence-transformers/all-MiniLM-L6-v2,BAAI/bge-m3,text-embedding-3-small)--openai-api-key <key>: API key for OpenAI backend (falls back to env)--enable-sentiment: enable sentiment analysis (requires--extra sentiment)--enable <analyzer>: enable specific analyzers by name (can be used multiple times)--disable <analyzer>: disable specific analyzers by name (can be used multiple times)--rate-limit <float>: rate limit in requests per second (default: 0.5)--robots <respect|ignore>: robots.txt policy (default: respect)--verbose: verbose logs--errors-json: force machine-readable JSON error output for automation--stdin: read text from stdin instead of URL (analyze command only)--stdout: stream JSON to stdout for pipeline composition
# Analyze and extract language
uv run rookeen analyze "https://example.com" --stdout | jq '.language.code'
# Chain multiple operations
uv run rookeen analyze "https://example.com" --stdout | \
jq '.analyzers[] | select(.name == "lexical_stats") | .results.total_tokens'# Validate benchmark success
uv run python bench/run_bench.py --json --quiet | jq '.[0].return_code == 0'
# Get performance summary
uv run python bench/run_bench.py --json --quiet | \
uv run python scripts/pipeline_tools.py analyze
# Filter slow benchmarks
uv run python bench/run_bench.py --json --quiet | \
uv run python scripts/pipeline_tools.py filter --min-time 2.5
# CI/CD integration
if uv run python bench/run_bench.py --quiet --no-save; then
echo "All benchmarks passed"
else
echo "Some benchmarks failed"
exit 1
fiGoal: Analyze personal blog posts or favorite websites to understand writing style, readability, and key topics without complex setup.
Environment:
- Personal laptop (macOS/Linux/Windows)
- Basic Python knowledge
- No special infrastructure requirements
- Single-user, occasional use
What happens:
# Quick installation
uv sync
# Analyze a blog post from URL
uv run rookeen analyze "https://myblog.com/post-about-nlp" \
--models-auto-download --robots ignore -o my_analysis.json
# Check readability and keywords
cat my_analysis.json | jq '.analyzers[] | select(.name == "readability")'
cat my_analysis.json | jq '.analyzers[] | select(.name == "keywords") | .results.keyphrases[0:5]'
# Analyze local draft text
echo "Your draft text here..." | uv run rookeen analyze --stdin --lang en --stdout \
| jq '{readability: .analyzers[] | select(.name == "readability") | .results.flesch_kincaid_grade}'Why it works for hobbyist:
- Zero configuration: Auto-downloads models, sensible defaults work out of the box
- Human-readable output: JSON can be explored with
jqor opened in any text editor - No infrastructure: Runs locally, no cloud services or API keys needed
- Immediate results: Get linguistic insights in seconds without learning complex NLP concepts
- Free and open: No licensing costs, works offline after initial setup
- Educational: Understand your writing through metrics like readability scores and keyword extraction
Goal: Build production NLP features or conduct research requiring semantic embeddings, sentiment analysis, and batch processing with performance tracking.
Environment:
- Development workstation with Python 3.10+
- Research or production codebase integration
- Optional: GPU for faster embeddings (BGE-M3)
- Optional: OpenAI API key for TE3 embeddings
- Multi-language support needed
What happens:
# Install with all extras for research
uv sync --group dev --all-extras
# Full analysis with embeddings and sentiment
uv run rookeen analyze "https://research-paper.com/article" \
--enable-embeddings --embeddings-backend bge-m3 \
--enable-sentiment --models-auto-download \
--export-parquet --export-conllu --conllu-engine stanza \
-o results/full_analysis
# Batch process research corpus
cat research_urls.txt | uv run rookeen batch --stdin \
--enable-embeddings --embeddings-backend miniLM \
--output-dir results/batch --format json
# Pipeline integration for ML workflows
uv run rookeen analyze "https://dataset-sample.com" --stdout \
--enable-embeddings --embeddings-backend bge-m3 \
| jq '.analyzers[] | select(.name == "embeddings") | .results.vector' \
| python ml_pipeline.py --input-format json
# Performance benchmarking for production
uv run python bench/run_bench.py --json --quiet \
| uv run python scripts/pipeline_tools.py analyze \
| jq '.avg_time'
# Selective analyzers for cost optimization
uv run rookeen analyze "https://api-content.com/data" \
--enable pos --enable ner --enable-embeddings \
--disable keywords --disable readability \
-o results/optimized.jsonWhy it works for pro-dev | researcher:
- Production-ready: Proper exit codes, error handling, and JSON schema validation for CI/CD
- Advanced features: Semantic embeddings (3 backends), sentiment analysis, CoNLL-U export for research
- Performance control: Selective analyzer enable/disable, preloading, benchmark harness for regression detection
- Unix composability: Pipes seamlessly with
jq,awk, and other tools for complex workflows - Research formats: CoNLL-U export with UD validation, Parquet for analytics, machine-readable timing
- Scalable: Batch processing, rate limiting, robots.txt respect for ethical web scraping
- Multi-language: Supports en/de/es/fr with auto-detection and model management
Goal: Team develops NLP features together with consistent analysis, shared configurations, and version-controlled workflows.
Environment:
- Shared codebase repository (Git)
- Team members on different platforms
- Shared configuration files (
rookeen.toml) - CI/CD pipeline integration
- Collaborative documentation
What happens:
# Shared team configuration
cat rookeen.toml
# [rookeen]
# format = "json"
# models_auto_download = true
# embeddings_backend = "miniLM"
# default_language = "en"
# Consistent analysis across team
uv run rookeen analyze "https://product-content.com/feature" \
--config rookeen.toml --enable-embeddings -o team_results/feature.json
# CI/CD validation pipeline
uv run python scripts/validate_for_ci.py team_results/*.json
# Team benchmark suite
uv run python bench/run_bench.py --json --quiet \
| uv run python scripts/pipeline_tools.py analyze \
| tee benchmark_report.json
# Batch processing with team standards
uv run rookeen batch team_urls.txt \
--output-dir team_results --format json \
--enable pos --enable ner --enable-embeddings
# Compare team benchmark runs
uv run python scripts/pipeline_tools.py compare \
bench/results/latest.json bench/results/previous.json
# Error handling for automation
uv run rookeen analyze "https://api.example.com" --errors-json \
--stdout 2>/dev/null | jq -e '.language.code' || echo "Analysis failed"Why it works for team:
- Configuration management: TOML config files version-controlled in Git, consistent settings across team
- Validation: JSON schema validation ensures data contracts, prevents integration issues
- Reproducibility: Benchmark harness tracks performance, detects regressions before deployment
- Automation: Proper exit codes,
--errors-jsonflag, CI/CD integration ready - Collaboration: Standardized output formats (JSON, Parquet), easy to share and review results
- Documentation: Self-documenting via JSON schema, clear error messages, comprehensive examples
- Scalability: Batch processing handles team workloads, rate limiting prevents API abuse
Goal: Deploy rookeen in enterprise R&D environments requiring SLA compliance, performance monitoring, cost optimization, and integration with existing ML pipelines.
Environment:
- Production servers or cloud infrastructure
- Kubernetes/Docker deployment
- Monitoring and alerting systems (Prometheus, Grafana)
- ML pipeline integration (Spark, Airflow, Kubeflow)
- High-volume batch processing
- Multi-region deployment considerations
What happens:
# Production configuration with environment variables
export ROOKEEN_EMBEDDINGS_BACKEND=bge-m3
export ROOKEEN_EMBEDDINGS_MODEL=BAAI/bge-m3
export ROOKEEN_MODELS_AUTO_DOWNLOAD=true
export ROOKEEN_RATE_LIMIT=10.0
export ROOKEEN_ROBOTS=respect
# High-volume batch processing with monitoring
uv run rookeen batch enterprise_corpus.txt \
--output-dir /data/results --format json \
--enable-embeddings --embeddings-preload \
--export-parquet 2>&1 | tee processing.log
# Performance monitoring and SLA tracking
uv run rookeen analyze "https://enterprise-content.com" --stdout \
| jq '{processing_time: .timing.total_seconds,
started_at: .timing.started_at,
analyzers: [.analyzers[] | {name, processing_time}]}' \
| send_to_monitoring.sh
# Cost-optimized selective analysis
uv run rookeen analyze "https://enterprise-api.com/data" \
--enable pos --enable ner --enable-embeddings \
--embeddings-backend miniLM \
--disable keywords --disable readability --disable sentiment \
-o /data/cost_optimized.json
# UD CoNLL-U export with Stanza engine (Level 1–2 compliance)
uv run rookeen analyze "https://enterprise-api.com/data" \
--export-conllu --conllu-engine stanza --models-auto-download -o /data/conllu.conllu
# Integration with Spark/analytics pipelines
uv run rookeen batch enterprise_urls.txt \
--export-parquet --output-dir /data/parquet \
| spark-submit --class ProcessRookeenData spark_job.py
# Automated regression detection
uv run python bench/run_bench.py --json --quiet --no-save \
| uv run python scripts/pipeline_tools.py analyze \
| jq 'if .avg_time > 5.0 then "SLA violation" else "OK" end'
# Multi-language production setup
export ROOKEEN_LANGUAGES_PRELOAD=en,de,es,fr
uv run rookeen analyze "https://multilingual-site.com" \
--enable-embeddings --models-auto-download \
-o /data/multilingual.json
# Error handling and alerting
if ! uv run rookeen analyze "$URL" --errors-json --stdout > /tmp/result.json 2>/tmp/error.json; then
ERROR_CODE=$?
ERROR_JSON=$(cat /tmp/error.json)
send_alert.sh --error-code $ERROR_CODE --error "$ERROR_JSON"
exit $ERROR_CODE
fiWhy it works for enterprise R&D:
- SLA compliance: Machine-readable timing data, per-analyzer performance tracking, benchmark harness for regression detection
- Cost optimization: Selective analyzer enable/disable, multiple embedding backends (local vs API), Parquet export for efficient storage
- Production reliability: Proper exit codes, structured error handling, JSON schema validation, robots.txt respect
- Scalability: Batch processing, rate limiting, preloading for consistent latency, Parquet export for big data workflows
- Integration: Unix pipeline composability, structured JSON output, Parquet format for Spark/DuckDB, programmatic API ready
- Monitoring: Timing and provenance data for dashboards, error codes for alerting, benchmark suite for performance tracking
- Multi-language: Supports enterprise multilingual content with auto-detection and model management
- Compliance: Respects robots.txt, configurable rate limiting, audit trail via timing and metadata
Language Detection:
- Precedence:
--langCLI flag > config default > auto-detection - Warnings: Low confidence (< 0.6) auto-detection emits warnings
- Normalization: Language codes are automatically normalized (e.g.,
en-US→en)
Rookeen features a flexible analyzer selection system that allows you to run only the analyzers you need, optimizing for cost, latency, and specific use cases. All analyzers are registered in a central registry and can be selectively enabled or disabled.
-
Core analyzers (always available):
dependency: Dependency parsing and grammatical relationskeywords: YAKE-based keyword and keyphrase extractionlexical_stats: Token counts, sentence length, TTR, top lemmasner: Named entity recognition with entity types and countspos: Part-of-speech tagging with UPOS counts and ratiosreadability: Readability metrics (Flesch, FK grade, SMOG, ARI, etc.)
-
Optional analyzers (require additional dependencies):
embeddings: Sentence embeddings using sentence-transformerssentiment: Sentiment analysis with VADER/TextBlob prioritization
Use --enable and --disable flags to run only specific analyzers:
# Run only POS and NER analyzers
uv run rookeen analyze "https://example.com/article" \
--enable pos --enable ner --models-auto-download -o results/pos_ner.json
# Run all analyzers except keywords and readability
uv run rookeen analyze "https://example.com/article" \
--disable keywords --disable readability --models-auto-download -o results/minimal.json
# Combine selective flags with optional analyzers
uv run rookeen analyze "https://example.com/article" \
--enable pos --enable ner --enable-embeddings --enable-sentiment \
--models-auto-download -o results/combined.jsonThe selective analyzer flags work seamlessly with existing optional analyzer flags:
# This runs only embeddings (selective control)
uv run rookeen analyze "https://example.com/article" \
--enable embeddings --models-auto-download -o results/embeddings_only.json
# This runs all analyzers + sentiment
uv run rookeen analyze "https://example.com/article" \
--enable-sentiment --models-auto-download -o results/all_plus_sentiment.jsonRookeen uses a plugin-based architecture where all analyzers are registered in a central registry:
# Programmatic access to available analyzers
from rookeen.analyzers.base import available_analyzers, get_analyzer
# List all available analyzers
analyzers = available_analyzers()
print(analyzers) # ['dependency', 'keywords', 'lexical_stats', 'ner', 'pos', 'readability']
# Get a specific analyzer class
pos_analyzer = get_analyzer('pos')Performance Optimization:
# For entity extraction only
uv run rookeen analyze "https://news.example.com" \
--enable ner --models-auto-download -o results/entities.jsonCost-Effective Analysis:
# Basic linguistic analysis without expensive computations
uv run rookeen analyze "https://blog.example.com" \
--enable pos --enable lexical_stats --models-auto-download -o results/basic.jsonComprehensive ML Pipeline:
# Full analysis including embeddings for ML applications
uv run rookeen analyze "https://research.example.com" \
--enable-embeddings --enable-sentiment --models-auto-download -o results/ml_ready.jsonAnalyzer selection works with batch processing:
# Process multiple URLs with consistent analyzer configuration
uv run rookeen batch urls.txt \
--enable pos --enable ner --enable-embeddings \
--output-dir results --models-auto-downloadThe plugin registry enables adding custom analyzers:
from rookeen.analyzers.base import BaseAnalyzer, register_analyzer
from rookeen.models import AnalysisType, LinguisticAnalysisResult
@register_analyzer
class CustomAnalyzer(BaseAnalyzer):
name = "custom"
analysis_type = AnalysisType.CUSTOM
async def analyze(self, doc, lang: str) -> LinguisticAnalysisResult:
# Your custom analysis logic
return LinguisticAnalysisResult(
analysis_type=self.analysis_type,
name=self.name,
results={"custom_metric": 0.85},
processing_time=0.1,
confidence=0.9
)- Precedence: CLI flags > environment variables (
ROOKEEN_prefix) > TOML config file > defaults. - Config file: pass at the root via
--config PATH(flat keys or a[rookeen]table). - Environment variables (examples):
export ROOKEEN_FORMAT=md
export ROOKEEN_OUTPUT_DIR=results
export ROOKEEN_LANGUAGES_PRELOAD=en,de
export ROOKEEN_DEFAULT_LANGUAGE=en
export ROOKEEN_MODELS_AUTO_DOWNLOAD=false
export ROOKEEN_EMBEDDINGS_BACKEND=miniLM # or bge-m3, openai-te3
export ROOKEEN_EMBEDDINGS_MODEL=BAAI/bge-m3 # backend-specific model id
export ROOKEEN_OPENAI_API_KEY=$OPENAI_API_KEY # for openai-te3 backend- TOML example:
[rookeen]
format = "json"
output_dir = "results"
languages_preload = ["en", "de"]
default_language = "en"
models_auto_download = true
log_level = "INFO"
# Embeddings defaults (optional)
embeddings_backend = "miniLM"
embeddings_model = "sentence-transformers/all-MiniLM-L6-v2"
openai_api_key = ""- Using a config file:
uv run rookeen --config ./rookeen.toml analyze "https://example.org/article" -o results/out
# Or place rookeen.toml in current directory (auto-loaded)
uv run rookeen analyze "https://example.org/article" -o results/outExit codes:
- 0 OK, 1 generic error, 2 usage error, 3 fetch error, 4 model error
Rookeen provides stable exit codes and machine-readable error output for automation and CI/CD pipelines:
--errors-json: Force machine-readable JSON error output instead of user-friendly text- Consistent exit codes: Reliable error codes for scripting and monitoring
- JSON error format: Structured error information for programmatic handling
Examples:
Normal error output (user-friendly):
$ uv run rookeen analyze
Usage: python -m rookeen.cli analyze [OPTIONS] URL
Try 'python -m rookeen.cli analyze -h' for help.
Error: Missing argument 'URL'.
$ echo $?
2JSON error output (machine-readable):
$ uv run rookeen --errors-json analyze
{"error": {"code": 2, "name": "USAGE_ERROR", "message": "Invalid CLI arguments"}}
$ echo $?
2Use cases:
- CI/CD pipelines that need to parse error conditions
- Monitoring systems checking exit codes
- Scripts that need structured error information
- Automated retry logic based on error types
{
"tool": "rookeen",
"version": "0.1.0",
"source": {
"type": "url",
"value": "https://…",
"fetched_at": 1757900000.0,
"domain": "en.wikipedia.org"
},
"language": {
"code": "en",
"confidence": 0.98,
"model": "en_core_web_sm"
},
"content": {
"title": "Natural language processing — Wikipedia",
"char_count": 12345,
"word_count": 2345
},
"analyzers": [
{
"name": "lexical_stats",
"processing_time": 0.12,
"confidence": 1.0,
"results": {
"total_tokens": 2200,
"unique_lemmas": 900,
"sentences": 120,
"avg_token_length": 4.7,
"avg_sentence_length_tokens": 18.3,
"type_token_ratio": 0.41,
"top_lemmas": [["language", 50], ["model", 32]]
},
"metadata": {
"language": {"code": "en", "confidence": 0.98},
"model": "en_core_web_sm"
}
},
{
"name": "embeddings",
"processing_time": 5.45,
"confidence": 1.0,
"results": {
"supported": true,
"backend": "bge-m3",
"model": "BAAI/bge-m3",
"dim": 1024,
"normalized": true,
"vector": [0.044, -0.063, 0.078, 0.063, 0.007, ...]
},
"metadata": {
"language": {"code": "en", "confidence": 0.98},
"model": "en_core_web_sm"
}
},
{
"name": "sentiment",
"processing_time": 1.5,
"confidence": 0.9999,
"results": {
"supported": true,
"label": "positive",
"score": 0.9999,
"method": "vader",
"scores": {
"neg": 0.035,
"neu": 0.909,
"pos": 0.056,
"compound": 0.9999
}
},
"metadata": {
"language": {"code": "en", "confidence": 0.98},
"model": "en_core_web_sm"
}
}
// Additional analyzers: "pos", "ner", "readability", "keywords"
],
"timing": {
"started_at": 1758247007.029181,
"ended_at": 1758247007.317931,
"total_seconds": 0.28876041690818965
}
}Analyzer highlights:
- lexical_stats: token and sentence counts, TTR, top lemmas
- pos:
upos_counts,upos_ratios,top_lemmas_by_upos - ner:
supported,counts_by_label,examples_by_label,total_entities - readability: Flesch, FK grade, SMOG, ARI, CLI, Linsear, Dale–Chall, text standard
- keywords: YAKE-based keyphrases
- embeddings: 384-dimensional sentence vectors using
sentence-transformers/all-MiniLM-L6-v2 - sentiment: smart analysis using VADER/TextBlob/spaCy with confidence scores and detailed breakdowns
Rookeen supports generating sentence-level semantic embeddings for advanced NLP applications including similarity search, content clustering, and semantic analysis.
uv sync --extra embeddingsGenerate embeddings alongside linguistic analysis:
uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
--enable-embeddings --models-auto-download --robots ignore -o results/cat_analysis.jsonBackends:
miniLM(local; default modelsentence-transformers/all-MiniLM-L6-v2, dim 384)bge-m3(local; modelBAAI/bge-m3, dim 1024)openai-te3(API; modelstext-embedding-3-small[1536],text-embedding-3-large[3072])
Preload to stabilize tail latency (optional): --embeddings-preload.
Performance, memory, and security notes:
- BGE‑M3 defaults to CPU; if CUDA is available it will auto-select GPU. To control devices:
CUDA_VISIBLE_DEVICES=0. - BGE‑M3 size: typical on-disk weights are ~2.1 GB (fp32) according to the model card; fp16/bfloat16 inference can use ~1.1 GB VRAM. Local HF cache size can be larger (e.g., multiple shards/backends), so you may see >2 GB on disk. Start with batch size 1 and mark tests as
@pytest.mark.slow. - OpenAI API: never log API keys; pass via
--openai-api-keyor env (ROOKEEN_OPENAI_API_KEY/OPENAI_API_KEY). - OpenAI timeouts: set
ROOKEEN_OPENAI_TIMEOUTorOPENAI_TIMEOUT(seconds).
- Pluggable backends: miniLM, BGE-M3, OpenAI TE3
- Provenance: results include
backend,model,dim,normalized - Normalization: L2-normalized embeddings for cosine similarity
- Format: JSON-serializable float arrays ready for ML pipelines
- Document Similarity: Compare semantic similarity between documents
- Content Clustering: Group related articles by topic automatically
- Semantic Search: Find content by meaning, not just keywords
- ML Pipelines: Feed embeddings directly into downstream ML models
{
"name": "embeddings",
"results": {
"supported": true,
"backend": "bge-m3",
"model": "BAAI/bge-m3",
"dim": 1024,
"normalized": true,
"vector": [0.0439, -0.0627, 0.0780, 0.0630, 0.0069, ...]
}
}- Processing time: ~1 seconds for typical web articles
- File size impact: ~3KB per document (384 floats × 8 bytes)
- Memory usage: ~23MB model (lazy-loaded, shared across analyses)
--enable-embeddings: Enable sentence embeddings generation--embeddings-backend: Select backend (miniLM, bge-m3, openai-te3)--embeddings-model: Backend-specific model id--embeddings-preload/--no-embeddings-preload: Preload backend/model at startup--openai-api-key: API key for OpenAI backend (or use env)
Rookeen supports sentiment analysis using intelligent library prioritization for fast, accurate results without custom implementations.
uv sync --extra sentimentAnalyze sentiment alongside linguistic analysis:
uv run rookeen analyze "https://en.wikipedia.org/wiki/Cat" \
--enable-sentiment --models-auto-download --robots ignore -o results/cat_sentiment.jsonRookeen uses smart library selection for optimal performance:
- VADER (Primary): Fastest, most reliable for general text (~0.0001s processing)
- TextBlob (Fallback): Good balance of speed and accuracy
- spaCy (Last resort): If available with sentiment extension
- Methods: VADER compound scoring, detailed breakdowns (positive/neutral/negative)
- Labels: positive, negative, neutral with confidence scores
- Performance: Microsecond processing with high accuracy
- Reliability: Uses battle-tested, well-maintained libraries
- Content Analysis: Determine overall sentiment of articles and web pages
- Brand Monitoring: Track sentiment in customer content
- Content Classification: Automatically categorize content by emotional tone
- ML Pipelines: Feed sentiment scores into downstream models
{
"name": "sentiment",
"results": {
"supported": true,
"label": "positive",
"score": 0.9999,
"method": "vader",
"scores": {
"neg": 0.035,
"neu": 0.909,
"pos": 0.056,
"compound": 0.9999
}
}
}- Processing time: ~0.0001-0.0002 seconds (microseconds)
- Accuracy: 99.99% confidence on real content
- Memory usage: Minimal (VADER has no model dependencies)
--enable-sentiment: Enable sentiment analysis (requires--extra sentiment)
Rookeen supports exporting linguistic analysis in CoNLL-U format (Universal Dependencies standard) with two quality levels:
uv sync --extra udExport CoNLL-U alongside JSON analysis:
uv run rookeen analyze "https://example.com/article" \
--export-conllu --conllu-engine stanza --models-auto-download -o results/articleStanza Engine (Recommended)
--conllu-engine stanza: Uses Stanford's Stanza for UD-native parsing- Quality: Passes UD validation Levels 1-2 (format compliance)
- Use case: Production-grade CoNLL-U for research and analysis
- Limitations: Level 3 validation may show content-related errors on complex web text (expected behavior)
Basic Engine (Fallback)
--conllu-engine basic: Heuristic conversion from spaCy parsing- Quality: Level 1 compliant, not UD-validated
- Use case: Simple text analysis when Stanza is unavailable
- Warning: Not recommended for web content or research use
Our implementation achieves industry-standard compliance:
| Content Type | Level 1 | Level 2 | Level 3 |
|---|---|---|---|
| Literary prose (Pride & Prejudice) | PASSED | PASSED | errors* |
| Web content (Wikipedia) | PASSED | PASSED | errors* |
*Level 3 errors are linguistic content issues expected when parsing web navigation and complex text structures. Level 1-2 compliance is production-grade for most NLP applications.
--export-conllu: Enable CoNLL-U export--conllu-engine {auto,stanza,basic}: Choose parsing engine (default: auto)--ud-auto-download: Auto-download Stanza models (default: true)--allow-non-ud-conllu: Allow basic engine fallback when Stanza unavailable
Rookeen supports exporting analyzer summary tables in Apache Parquet format for analytics workflows (Spark, DuckDB, pandas, etc.).
uv sync --extra parquetAdd --export-parquet to any CLI command to write a <base>.parquet file with analyzer-level aggregates:
uv run rookeen analyze "https://example.com/article" \
--export-parquet --models-auto-download -o results/article- The Parquet file contains one row per analyzer, with flat scalar results (int, float, str, bool).
- If present,
metadata.modelandmetadata.languageare included as columns for traceability and analytics. - Optional dependency:
pyarrow>=16(installed via--extra parquet).
Example output columns:
name,processing_time,confidence,metadata.model,metadata.language,results.total_tokens, ...
--export-parquet: Write analyzer summary table to Parquet
Rookeen provides additional CLI utilities specifically designed for Unix pipeline composition and data analysis workflows.
- Pure structured output: JSON/CSV formats for reliable pipeline processing
- Advanced filtering: Filter results by success, time, language, analyzers
- Statistical analysis: Comprehensive performance analysis and comparison
- Comparison tools: Compare benchmark results across different runs
- Exit code handling: Proper error codes for automation and CI/CD
# Analyze benchmark performance statistics
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py analyze
# Filter for slow benchmarks (> 2.5 seconds)
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py filter --min-time 2.5
# Count successful benchmarks only
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py filter --success-only --count
# Filter by specific analyzer
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py filter --analyzer embeddings
# Compare two benchmark runs
uv run python scripts/pipeline_tools.py compare bench/results/latest.json bench/results/previous.json{
"total": 4,
"successful": 4,
"failed": 0,
"success_rate": 1.0,
"avg_time": 2.291,
"min_time": 2.255,
"max_time": 2.329,
"total_time": 9.163,
"fastest_case": "en_wiki_full",
"slowest_case": "en_wiki_embeddings"
}The E2E suite provides comprehensive regression protection across multiple languages and sources with robust error handling and comprehensive assertions.
- 20 tests covering English, German, Spanish, French, and multilingual content
- Stable sources: Wikipedia
oldidpages and Project Gutenberg texts - Comprehensive assertions: Language detection, POS coverage, NER support, analyzer metadata validation
- Robust execution: 5-minute timeouts with exponential backoff retries
Note: For full test coverage (including embeddings, sentiment, and CoNLL-U tests), sync all optional dependencies:
uv sync --group dev --all-extras
# Run full E2E suite - takes some time
uv run pytest -q tests/e2e -v -n auto
# Run smoke tests only (fast CI validation - 2 tests, 18 deselected)
uv run pytest -q -m smoke tests/e2e -n auto
# Run specific test categories
uv run pytest tests/e2e/test_conllu_stanza.py -n auto # CoNLL-U validation
uv run pytest tests/e2e/test_analyzer_selection.py -n auto # Analyzer selection
uv run pytest tests/e2e/test_mixed_batch.py -n auto # Batch processing
uv run pytest tests/e2e/test_testdata_integration.py -n auto # Test data file integration
# Run backend-specific embeddings tests
uv run pytest -q tests/e2e/test_embeddings_minilm.py -n auto
uv run pytest -q tests/e2e/test_embeddings_bge_m3.py -n auto
# Requires OPENAI_API_KEY
uv run pytest -q -m external tests/e2e/test_embeddings_openai_te3.py -n autoNote:
- CoNLL-U validation tests (
test_conllu_local.py,test_conllu_stanza.py) rely on the UD Tools validator and will be skipped unless you provide its path. - Obtain validator: validate.py.
- Configure path via env var
UD_VALIDATE_SCRIPT(orROOKEEN_UD_VALIDATE_SCRIPT), or place a copy attools/validate.py.
After syncing all dependencies, run the demo test suite to see rookeen in action:
# Install all dependencies first
uv sync --group dev --all-extras
# Run the demo test suite
uv run python scripts/run_demo_tests.pyThis script runs 5 real-world usage tests across different scenarios:
- Basic web page analysis (BBC Technology)
- Technical blog with embeddings (Python.org)
- Sentiment analysis (file-based)
- Selective analyzers (Wikipedia)
- Stdin pipeline composition
All test artifacts and a comprehensive report are saved to the results/ folder.
Run fast smoke tests (CI default):
uv run pytest -q -m smoke -n autoRun the main test suite without network:
uv run pytest -q -m "not external and not slow" -n autoRun all tests (including network) (Apple M4 ~ 2:30 minutes):
uv run pytest -q -n autoAdditional validation scripts for testing CLI behavior and edge cases. These scripts are integrated into CI to ensure CLI functionality doesn't regress.
CLI Chain Validation - Tests error handling, validation, and edge cases:
bash scripts/validate_cli_chain_fixed.shThis script tests:
- CLI parsing errors (missing arguments, invalid options)
- File I/O errors (non-existent files)
- Network/fetch errors
- Successful operations with pipeline composition
- Flag combinations and edge cases
Note: This script intentionally tests error conditions and always exits with code 0, as it validates that errors are handled correctly.
Streaming Mode Validation - Tests stdin/stdout functionality:
bash scripts/validate_streaming_mode.shThis script tests:
- Basic streaming validation
- Empty input handling
- Unicode/multilingual support
- Auto language detection
- Multi-line and large input handling
- Analyzer selection with stdin
- Export options with stdin
- Binary input handling
- Error validation for conflicting arguments
Note: This script tests functional behavior and will exit with non-zero code if any test fails, indicating a real regression.
Benchmarks (JSON pipeline):
uv run python bench/run_bench.py --json --quiet | jq '[.[] | {case, seconds, ok: .success}]'Compute-only benchmark (stdin):
cat bench/sample.txt | uv run rookeen analyze --stdin --lang en --stdout \
| jq '{total_s: .timing.total_seconds, analyzers: [.analyzers[] | {name, processing_time}]}'Performance note:
Use the repository's existing Performance Benchmarks section as the source of truth. Avoid hardcoding machine-specific numbers in docs; instead, generate a baseline JSON on your hardware and reference that in PRs if needed.
Rookeen includes a comprehensive performance benchmark harness for tracking latency and detecting performance regressions.
- 6 test scenarios: default analysis, embeddings-only, sentiment-only, full analysis (embeddings + sentiment), embeddings (BGE-M3), embeddings (OpenAI TE3, optional)
- Stable input: Wikipedia article snapshot for consistent, reproducible benchmarking
- Metrics captured: execution time, return codes, and success status
- Results storage: CSV format for trend analysis and JSON for programmatic access
The benchmark harness now supports industry-standard Unix pipeline patterns:
# Pure JSON output for reliable pipeline processing
uv run python bench/run_bench.py --json --quiet | jq '.[0].return_code == 0'
# Output: true
# Validate all benchmarks passed
uv run python bench/run_bench.py --json --quiet | jq 'all(.success)'
# Output: true
# Extract performance metrics
uv run python bench/run_bench.py --json --quiet | jq '[.[] | {case: .case, time: .seconds}]'# Human-readable table format
uv run python bench/run_bench.py --format table
# CSV for data analysis and spreadsheets
uv run python bench/run_bench.py --format csv > benchmark_results.csv
# JSON for programmatic processing (default)
uv run python bench/run_bench.py --json# Suppress progress output for CI/CD and scripts
uv run python bench/run_bench.py --quiet
# Combined with JSON for pure machine-readable output
uv run python bench/run_bench.py --json --quiet | jq '.[] | select(.success == false)'# Performance analysis pipeline
uv run python bench/run_bench.py --json --quiet | \
jq '[.[] | select(.success == true)] | sort_by(.seconds) | reverse | .[0] | {fastest: .case, time: .seconds}'
# Generate performance report
uv run python bench/run_bench.py --json --quiet | \
jq -r '"Benchmark Report:
Total: \(length)
Passed: \([.[] | select(.success)] | length)
Avg Time: \((([.[] | .seconds] | add) / length) | .3f)s
Max Time: \([.[] | .seconds] | max | .3f)s"'# Interactive mode (default - human-friendly)
uv run python bench/run_bench.py
# Pipeline mode (machine-friendly)
uv run python bench/run_bench.py --json --quiet
# CI/CD mode (no output, check exit code)
uv run python bench/run_bench.py --quiet --no-save
# View latest results
cat bench/results/latest.json
# Advanced pipeline processing with analysis tools
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py analyze# Compute-only local stdin benchmark (no network)
cat bench/sample.txt | uv run rookeen analyze --stdin --lang en --stdout \
| jq '{total_s: .timing.total_seconds, analyzers: [.analyzers[] | {name, processing_time}]}'[
{
"timestamp": "2025-09-19T16:06:21.523567",
"case": "en_wiki",
"url": "https://en.wikipedia.org/w/index.php?title=Natural_language_processing&oldid=1201524046",
"language": "en",
"analyzers": "default",
"return_code": 0,
"seconds": 2.332,
"success": true
}
]- Default analysis: ~3.9 seconds
- Embeddings analysis: ~7.5 seconds
- Sentiment analysis: ~3.7 seconds
- Full analysis: ~7.5 seconds
Rookeen provides comprehensive timing and provenance tracking for performance monitoring and SLA compliance. Every analysis includes machine-readable timing data and per-analyzer metadata.
Each JSON output includes overall processing timing:
started_at: Unix timestamp when analysis beganended_at: Unix timestamp when analysis completedtotal_seconds: High-precision total processing duration
Each analyzer result includes detailed metadata:
processing_time: Individual analyzer processing timeconfidence: Analyzer confidence scoremetadata.language: Language code and confidence usedmetadata.model: SpaCy model used for processing
# Get timing information
uv run rookeen analyze "https://example.com" --robots ignore -o results/analysis.json
jq '.timing.total_seconds' results/analysis.json
# Monitor per-analyzer performance
uv run rookeen analyze "https://example.com" --robots ignore -o results/analysis.json
jq '.analyzers[].processing_time' results/analysis.json
# Check analyzer provenance
uv run rookeen analyze "https://example.com" --robots ignore -o results/analysis.json
jq '.analyzers[0].metadata' results/analysis.json- SLA Monitoring: Track processing times against service level agreements
- Performance Optimization: Identify slow analyzers and optimize resource usage
- Cost Analysis: Monitor analyzer usage for cost optimization
- Debugging: Trace processing provenance for issue diagnosis
- Analytics: Build dashboards and reports on processing performance
Validate Rookeen JSON outputs against the official schema for data consistency and contract-first development.
# Validate existing JSON files
uv run python scripts/validate_for_ci.py results/*.json
# Process URLs directly and validate (result-agnostic)
uv run python scripts/validate_for_ci.py "https://example.com" --enable-embeddings --verbose
# Run comprehensive validation test suite
uv run python scripts/test_validation.pyRookeen provides several utility scripts for advanced usage and automation:
Advanced command-line utilities for data processing and analysis:
# Analyze benchmark results
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py analyze
# Filter results by criteria
uv run python bench/run_bench.py --json --quiet | uv run python scripts/pipeline_tools.py filter --success-only
# Compare benchmark runs
uv run python scripts/pipeline_tools.py compare bench/results/latest.json bench/results/previous.jsonInteractive demonstration of Unix pipeline composability patterns:
# Run comprehensive pipeline examples
./scripts/pipeline_demo.sh
# Learn advanced pipeline techniques and best practicesscripts/validate_for_ci.py: JSON schema validation for CI/CD pipelinesscripts/test_validation.py: Comprehensive validation test suitescripts/pipeline_demo.sh: Interactive pipeline examples and tutorials
Rookeen now provides industry-standard Unix pipeline composability with these key improvements:
- stdout: Pure structured data (JSON, CSV, table)
- stderr: Human-readable messages, progress, errors
- Exit codes: 0=success, 1=failure for automation
--format json: Machine-readable JSON (default)--format csv: Spreadsheet-compatible CSV--format table: Human-readable tables
--quiet: Suppress progress for automation--json: Pure JSON output (implies --quiet)--no-save: Skip file operations for CI/CD
scripts/pipeline_tools.py: Statistical analysis and filteringscripts/pipeline_demo.sh: Interactive examples and tutorials- Integration with
jq,awk,sed, and other Unix tools
# CI/CD integration
uv run python bench/run_bench.py --quiet --no-save || exit 1
# Data analysis pipeline
uv run python bench/run_bench.py --json --quiet | \
uv run python scripts/pipeline_tools.py analyze | \
jq '.avg_time'
# Performance monitoring
uv run python bench/run_bench.py --json --quiet | \
uv run python scripts/pipeline_tools.py filter --min-time 3.0 | \
jq length- The pipeline is spaCy-only.
- Models are installed on-demand when
--models-auto-downloadis used (default). A separate download script is not required.