When making API requests, use one of the following base URLs:
- Local development:
http://localhost:8000 - Remote access: Use your ngrok URL (e.g.,
https://abcd1234.ngrok.io)
For all examples below, replace {BASE_URL} with your actual base URL.
# For local development
export BASE_URL=http://localhost:8000
# For remote access via ngrok
export BASE_URL=https://your-ngrok-url.ngrok.ioGenerate text using the loaded model.
Request Body:
{
"prompt": "string",
"model_id": "string | null",
"stream": "boolean",
"max_length": "integer | null",
"temperature": "float",
"top_p": "float",
"top_k": "integer",
"repetition_penalty": "float",
"do_sample": "boolean"
}Response Quality Parameters:
| Parameter | Default | Description |
|---|---|---|
max_length |
8192 | Maximum number of tokens in the generated response |
temperature |
0.7 | Controls randomness (higher = more creative, lower = more focused) |
top_p |
0.9 | Nucleus sampling parameter (higher = more diverse responses) |
top_k |
80 | Limits vocabulary to top K tokens (higher = more diverse vocabulary) |
repetition_penalty |
1.15 | Penalizes repetition (higher = less repetition) |
do_sample |
true | Whether to use sampling; if false, uses greedy decoding |
Note: All parameters are optional. If not provided, the server will use the default values shown above.
Response:
{
"response": "string",
"usage": {
"prompt_tokens": "integer",
"completion_tokens": "integer",
"total_tokens": "integer"
}
}Example (curl):
# Basic generation with minimal parameters
curl -X POST "${BASE_URL}/generate" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain quantum computing in simple terms"
}'
# Generation with all parameters
curl -X POST "${BASE_URL}/generate" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain quantum computing in simple terms",
"model_id": null,
"stream": false,
"max_length": 8192,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 80,
"repetition_penalty": 1.15,
"do_sample": true
}'
# Streaming generation
curl -X POST "${BASE_URL}/generate" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain quantum computing in simple terms",
"stream": true
}'Error Responses:
400 Bad Request: Invalid parameters413 Payload Too Large: Input too long429 Too Many Requests: Rate limit exceeded500 Internal Server Error: Model error
Chat completion endpoint similar to OpenAI's API.
Request Body:
{
"messages": [
{
"role": "string",
"content": "string"
}
],
"model_id": "string | null",
"stream": "boolean",
"max_length": "integer | null",
"temperature": "float",
"top_p": "float",
"top_k": "integer",
"repetition_penalty": "float",
"do_sample": "boolean"
}Note: The same response quality parameters from the
/generateendpoint apply here. All parameters are optional and use the same defaults.
Response:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "string"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": "integer",
"completion_tokens": "integer",
"total_tokens": "integer"
}
}Example (curl):
# Basic chat with minimal parameters
curl -X POST "${BASE_URL}/chat" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
]
}'
# Chat with all parameters
curl -X POST "${BASE_URL}/chat" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
],
"model_id": null,
"stream": false,
"max_length": 8192,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 80,
"repetition_penalty": 1.15,
"do_sample": true
}'
# Streaming chat
curl -X POST "${BASE_URL}/chat" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
],
"stream": true
}'Generate text for multiple prompts in parallel.
Request Body:
{
"prompts": ["string", "string", ...],
"model_id": "string | null",
"max_length": "integer | null",
"temperature": "float",
"top_p": "float",
"top_k": "integer",
"repetition_penalty": "float",
"do_sample": "boolean"
}Note: The same response quality parameters from the
/generateendpoint apply here. All parameters are optional and use the same defaults.
Response:
{
"responses": ["string", "string", ...],
"usage": {
"prompt_tokens": "integer",
"completion_tokens": "integer",
"total_tokens": "integer"
}
}Example (curl):
# Basic batch generation with minimal parameters
curl -X POST "${BASE_URL}/generate/batch" \
-H "Content-Type: application/json" \
-d '{
"prompts": [
"Write a haiku about nature",
"Tell a short joke",
"Give a fun fact about space"
]
}'
# Batch generation with all parameters
curl -X POST "${BASE_URL}/generate/batch" \
-H "Content-Type: application/json" \
-d '{
"prompts": [
"Write a haiku about nature",
"Tell a short joke",
"Give a fun fact about space"
],
"model_id": null,
"max_length": 8192,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 80,
"repetition_penalty": 1.15,
"do_sample": true
}'Load a specific model.
Request Body:
{
"model_id": "string"
}Example (curl):
# Load a specific model
curl -X POST "${BASE_URL}/models/load" \
-H "Content-Type: application/json" \
-d '{
"model_id": "microsoft/phi-2"
}'Get information about the currently loaded model.
Example (curl):
# Get current model information
curl -X GET "${BASE_URL}/models/current"List all available models in the registry.
Example (curl):
# List all available models
curl -X GET "${BASE_URL}/models/available"Unload the current model to free up resources.
Example (curl):
# Unload the current model
curl -X POST "${BASE_URL}/models/unload"Get detailed system information.
Example (curl):
# Get system information
curl -X GET "${BASE_URL}/system/info"Check the health status of the server.
Example (curl):
# Check server health
curl -X GET "${BASE_URL}/health"All endpoints return appropriate HTTP status codes:
200: Success400: Bad Request404: Not Found500: Internal Server Error
Error responses include a detail message:
{
"detail": "Error message describing what went wrong"
}- 60 requests per minute
- Burst size of 10 requests
All generation endpoints have sensible defaults for the response quality parameters:
max_length: 8192 tokenstemperature: 0.7top_p: 0.9top_k: 80repetition_penalty: 1.15do_sample: true
You can omit any or all of these parameters in your requests, and the server will use these defaults.
When experimenting with different parameter values, here's what to try:
- For more creative responses: Increase
temperature(0.8-1.0) andtop_p(0.95-1.0) - For more focused responses: Decrease
temperature(0.3-0.5) andtop_p(0.5-0.7) - For less repetition: Increase
repetition_penalty(1.2-1.5) - For longer responses: Increase
max_length(up to 16384)
When using streaming endpoints (stream: true), the response will be sent as a series of Server-Sent Events (SSE). Each event starts with data: followed by the token or chunk. The end of the stream is marked with data: [DONE].
# Example of processing streaming responses with bash
curl -X POST "${BASE_URL}/generate" \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello", "stream": true}' | while read -r line; do
if [[ $line == data:* ]]; then
content=${line#data: }
if [[ $content != "[DONE]" ]]; then
echo -n "$content"
fi
fi
doneMade with ❤️ by Utkarsh Tiwari GitHub: UtkarshTheDev | Twitter: @UtkarshTheDev | LinkedIn: utkarshthedev