vector-embedder is a flexible, language-agnostic document ingestion and embedding pipeline. It transforms structured and unstructured content from multiple sources into vector embeddings and stores them in your vector database of choice.
It supports Git repositories, web URLs, and file types like Markdown, PDFs, and HTML. Designed for local runs, containers, or OpenShift/Kubernetes jobs.
- π vector-embedder
- β
Multi-DB support:
- Redis (RediSearch)
- Elasticsearch
- PGVector (PostgreSQL)
- SQL Server (preview)
- Qdrant
- Dry Run (no DB required; logs to console)
- β
Flexible input sources:
- Git repositories via glob patterns (
**/*.pdf,*.md, etc.) - Web pages via configurable URL lists
- Git repositories via glob patterns (
- β
Smart chunking with configurable
CHUNK_SIZEandCHUNK_OVERLAP - β
Embeddings via
sentence-transformers - β Parsing via LangChain + Unstructured
- β UBI-compatible container, OpenShift-ready
- β
Fully configurable via
.envor-eenvironment flags
Set your configuration in a .env file at the project root.
# Temporary working directory
TEMP_DIR=/tmp
# Logging
LOG_LEVEL=info
# Sources
REPO_SOURCES=[{"repo": "https://github.com/example/repo.git", "globs": ["docs/**/*.md"]}]
WEB_SOURCES=["https://example.com/docs/", "https://example.com/report.pdf"]
# Chunking
CHUNK_SIZE=2048
CHUNK_OVERLAP=200
# Embeddings
EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
# Vector DB
DB_TYPE=DRYRUNπ§ͺ DB_TYPE=DRYRUN logs chunks to stdout and skips database indexingβgreat for development!
./embed_documents.pypodman build -t embed-job .
podman run --rm --env-file .env embed-jobYou can also pass inline vars:
podman run --rm \
-e DB_TYPE=REDIS \
-e REDIS_URL=redis://localhost:6379 \
embed-jobDry run skips vector DB upload and prints chunk metadata and content to the terminal.
DB_TYPE=DRYRUNRun it:
./embed_documents.pyThis project keeps two dependency files under version control:
| File | Purpose | Edited by |
|---|---|---|
requirements.in |
Short, human-readable list of top-level libraries (no pins) | You |
requirements.txt |
Fully-resolved, pinned lock fileβincluding hashesβfor exact, reproducible builds | pip-compile |
python -m pip install --upgrade pip-tools-
Edit
requirements.in- sentence-transformers + sentence-transformers>=4.1 + llama-index
-
Re-lock the environment
pip-compile --upgrade
-
Synchronise your virtual-env
pip-sync
.
βββ embed_documents.py # Main entrypoint script
βββ config.py # Config loader from env
βββ loaders/ # Git, web, PDF, and text loaders
βββ vector_db/ # Pluggable DB providers
βββ requirements.txt # Python dependencies
βββ redis_schema.yaml # Redis index schema (if used)
βββ .env # Default runtime config
Run a compatible DB locally to test full ingestion + indexing.
podman run --rm -d \
--name pgvector \
-e POSTGRES_USER=user \
-e POSTGRES_PASSWORD=pass \
-e POSTGRES_DB=mydb \
-p 5432:5432 \
docker.io/ankane/pgvectorDB_TYPE=PGVECTOR ./embed_documents.pypodman run --rm -d \
--name elasticsearch \
-p 9200:9200 \
-e "discovery.type=single-node" \
-e "xpack.security.enabled=true" \
-e "ELASTIC_PASSWORD=changeme" \
-e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
docker.io/elastic/elasticsearch:8.11.1DB_TYPE=ELASTIC ./embed_documents.pypodman run --rm -d \
--name redis-stack \
-p 6379:6379 \
docker.io/redis/redis-stack-server:6.2.6-v19DB_TYPE=REDIS ./embed_documents.pypodman run -d \
-p 6333:6333 \
--name qdrant \
docker.io/qdrant/qdrantDB_TYPE=QDRANT ./embed_documents.pypodman run --rm -d \
--name mssql \
-e ACCEPT_EULA=Y \
-e SA_PASSWORD=StrongPassword! \
-p 1433:1433 \
mcr.microsoft.com/mssql/rhel/server:2025-latestDB_TYPE=MSSQL ./embed_documents.pyBuilt with: