@@ -31,7 +31,7 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
3131* ** HTML to Markdown:** Converts extracted HTML to clean Markdown using ` turndown ` , preserving code blocks and basic formatting.
3232 * ** Clean Heading Text:** Automatically removes anchor links (like ` [](#section-id) ` ) from heading text for cleaner hierarchy display.
3333* ** Intelligent Chunking:** Splits Markdown content into manageable chunks based on headings and token limits, preserving context.
34- * ** Vector Embeddings:** Generates embeddings for each chunk using OpenAI's ` text-embedding-3-large ` model .
34+ * ** Vector Embeddings:** Generates embeddings for each chunk using OpenAI or Azure OpenAI (configurable) .
3535* ** Vector Storage:** Supports storing chunks, metadata, and embeddings in:
3636 * ** SQLite:** Using ` better-sqlite3 ` and the ` sqlite-vec ` extension for efficient vector search.
3737 * ** Qdrant:** A dedicated vector database, using the ` @qdrant/js-client-rest ` .
@@ -98,7 +98,7 @@ This ensures that searches for parent topics (like "Installation") will also mat
9898* ** Node.js:** Version 18 or higher recommended (check ` .nvmrc ` if available).
9999* ** npm:** Node Package Manager (usually comes with Node.js).
100100* ** TypeScript:** As the project is written in TypeScript (` ts-node ` is used for execution via ` npm start ` ).
101- * ** OpenAI API Key:** You need an API key from OpenAI to generate embeddings.
101+ * ** OpenAI API Key or Azure OpenAI Credentials :** You need either an OpenAI API key or Azure OpenAI credentials to generate embeddings.
102102* ** GitHub Personal Access Token:** Required for accessing GitHub issues (set as ` GITHUB_PERSONAL_ACCESS_TOKEN ` in your environment).
103103* ** Zendesk API Token:** Required for accessing Zendesk tickets and articles (set as ` ZENDESK_API_TOKEN ` in your environment).
104104* ** (Optional) Qdrant Instance:** If using the ` qdrant ` database type, you need a running Qdrant instance accessible from where you run the script.
@@ -129,8 +129,20 @@ Configuration is managed through two files:
129129 ` ` ` dotenv
130130 # .env
131131
132- # Required: Your OpenAI API Key
132+ # Embedding Provider Configuration
133+ # Optional: Specify which provider to use (defaults to 'openai' if not set)
134+ # Can also be configured in config.yaml
135+ EMBEDDING_PROVIDER=" azure" # or "openai"
136+
137+ # Required: Your OpenAI API Key (if using OpenAI provider)
133138 OPENAI_API_KEY="sk-..."
139+ OPENAI_MODEL="text-embedding-3-large" # Optional, defaults to text-embedding-3-large
140+
141+ # Required: Your Azure OpenAI credentials (if using Azure provider)
142+ AZURE_OPENAI_KEY="your-azure-key"
143+ AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
144+ AZURE_OPENAI_DEPLOYMENT_NAME="text-embedding-3-large"
145+ AZURE_OPENAI_API_VERSION="2024-10-21"
134146
135147 # Required for GitHub sources
136148 GITHUB_PERSONAL_ACCESS_TOKEN="ghp_..."
@@ -206,6 +218,21 @@ Configuration is managed through two files:
206218
207219 ** Example (` config.yaml` ):**
208220 ` ` ` yaml
221+ # Optional: Configure embedding provider
222+ # Can also be set via EMBEDDING_PROVIDER environment variable
223+ # Defaults to OpenAI if not specified
224+ embedding:
225+ provider: ' openai' # or 'azure'
226+ openai:
227+ api_key: '${OPENAI_API_KEY}' # Optional, uses env var by default
228+ model: 'text-embedding-3-large' # Optional, defaults to text-embedding-3-large
229+ # For Azure OpenAI, use this instead:
230+ # azure:
231+ # api_key: '${AZURE_OPENAI_KEY}'
232+ # endpoint: '${AZURE_OPENAI_ENDPOINT}'
233+ # deployment_name: 'text-embedding-3-large'
234+ # api_version: '2024-10-21' # Optional
235+
209236 sources:
210237 # Website source example
211238 - type: 'website'
0 commit comments