The official Python client for LocalLab, providing a simple and powerful interface to interact with your LocalLab server.
Install the client package:
pip install locallab-clientNote: The package name is
locallab-clientbut the import islocallab_client(with an underscore)
The LocalLab client provides two different client classes:
LocalLabClient- An asynchronous client that usesasync/awaitsyntaxSyncLocalLabClient- A synchronous client that doesn't requireasync/await
Choose the one that best fits your needs.
Use SyncLocalLabClient when you don't want to use async/await:
from locallab_client import SyncLocalLabClient
# Initialize client
client = SyncLocalLabClient("http://localhost:8000")
try:
# Basic text generation - no async/await needed!
response = client.generate("Write a story about a robot")
print(response)
except Exception as e:
print(f"Generation failed: {str(e)}")
finally:
# Close the client - no async/await needed!
client.close()Use LocalLabClient when you're working with async code:
import asyncio
from locallab_client import LocalLabClient
async def main():
# Initialize client
client = LocalLabClient("http://localhost:8000")
try:
# Basic text generation
response = await client.generate("Write a story about a robot")
print(response)
except Exception as e:
print(f"Generation failed: {str(e)}")
finally:
await client.close()
# Run the async function
asyncio.run(main())Both clients support context managers for automatic resource cleanup:
# Import the clients
from locallab_client import SyncLocalLabClient, LocalLabClient
# Synchronous context manager
with SyncLocalLabClient("http://localhost:8000") as client:
response = client.generate("Write a story about a robot")
print(response)
# Client is automatically closed
# Asynchronous context manager
async with LocalLabClient("http://localhost:8000") as client:
response = await client.generate("Write a story about a robot")
print(response)
# Client is automatically closedSee Also: For more details, check the Async/Sync Client Guide
from locallab_client import LocalLabClient
client = LocalLabClient(
base_url: str,
timeout: float = 30.0,
auto_close: bool = True # Automatically close inactive sessions
)
# Note: The package name is 'locallab-client' but the import is 'locallab_client'
# pip install locallab-clientfrom locallab_client import SyncLocalLabClient
client = SyncLocalLabClient(
base_url: str,
timeout: float = 30.0
)
# Note: The package name is 'locallab-client' but the import is 'locallab_client'
# pip install locallab-clientFeature: Both clients automatically close inactive sessions and provide proper resource management. The synchronous client handles all the async/await details internally.
# Async usage with LocalLabClient
response = await client.generate(
prompt: str,
model_id: str = None,
temperature: float = 0.7,
max_length: int = 8192, # Default is 8192 tokens for complete responses
top_p: float = 0.9,
top_k: int = 80, # Limits vocabulary to top K tokens
repetition_penalty: float = 1.15, # Reduces repetition
timeout: float = 180.0 # 3 minutes timeout for complete responses
) -> str
# Sync usage with SyncLocalLabClient
response = client.generate(
prompt: str,
model_id: str = None,
temperature: float = 0.7,
max_length: int = 8192, # Default is 8192 tokens for complete responses
top_p: float = 0.9,
top_k: int = 80, # Limits vocabulary to top K tokens
repetition_penalty: float = 1.15, # Reduces repetition
timeout: float = 180.0 # 3 minutes timeout for complete responses
) -> strThe client now supports advanced response quality parameters:
| Parameter | Default | Description |
|---|---|---|
max_length |
8192 | Maximum number of tokens in the generated response |
temperature |
0.7 | Controls randomness (higher = more creative, lower = more focused) |
top_p |
0.9 | Nucleus sampling parameter (higher = more diverse responses) |
top_k |
80 | Limits vocabulary to top K tokens (higher = more diverse vocabulary) |
repetition_penalty |
1.15 | Penalizes repetition (higher = less repetition) |
timeout |
180.0 | Maximum time in seconds allowed for generation |
These parameters work together to produce high-quality, complete responses that are both relevant and diverse, with minimal repetition.
# Async usage with LocalLabClient
async for token in client.stream_generate(
prompt: str,
model_id: str = None,
temperature: float = 0.7,
max_length: int = 8192, # Default is 8192 tokens for complete responses
top_p: float = 0.9,
top_k: int = 80, # Limits vocabulary to top K tokens
repetition_penalty: float = 1.15, # Reduces repetition
timeout: float = 300.0 # 5 minutes timeout for streaming
):
print(token, end="", flush=True)
# Sync usage with SyncLocalLabClient
for token in client.stream_generate(
prompt: str,
model_id: str = None,
temperature: float = 0.7,
max_length: int = 8192, # Default is 8192 tokens for complete responses
top_p: float = 0.9,
top_k: int = 80, # Limits vocabulary to top K tokens
repetition_penalty: float = 1.15, # Reduces repetition
timeout: float = 300.0 # 5 minutes timeout for streaming
):
print(token, end="", flush=True)# Async usage with LocalLabClient
response = await client.chat(
messages: list[dict],
model_id: str = None,
temperature: float = 0.7,
max_length: int = 8192, # Default is 8192 tokens for complete responses
top_p: float = 0.9,
top_k: int = 80, # Limits vocabulary to top K tokens
repetition_penalty: float = 1.15, # Reduces repetition
timeout: float = 180.0 # 3 minutes timeout for complete responses
) -> dict
# Sync usage with SyncLocalLabClient
response = client.chat(
messages: list[dict],
model_id: str = None,
temperature: float = 0.7,
max_length: int = 8192, # Default is 8192 tokens for complete responses
top_p: float = 0.9,
top_k: int = 80, # Limits vocabulary to top K tokens
repetition_penalty: float = 1.15, # Reduces repetition
timeout: float = 180.0 # 3 minutes timeout for complete responses
) -> dict# Async usage with LocalLabClient
async for chunk in client.stream_chat(
messages: list[dict],
model_id: str = None,
temperature: float = 0.7,
max_length: int = 8192, # Default is 8192 tokens for complete responses
top_p: float = 0.9,
top_k: int = 80, # Limits vocabulary to top K tokens
repetition_penalty: float = 1.15, # Reduces repetition
timeout: float = 300.0 # 5 minutes timeout for streaming
):
if "content" in chunk:
print(chunk["content"], end="", flush=True)
# Sync usage with SyncLocalLabClient
for chunk in client.stream_chat(
messages: list[dict],
model_id: str = None,
temperature: float = 0.7,
max_length: int = 8192, # Default is 8192 tokens for complete responses
top_p: float = 0.9,
top_k: int = 80, # Limits vocabulary to top K tokens
repetition_penalty: float = 1.15, # Reduces repetition
timeout: float = 300.0 # 5 minutes timeout for streaming
):
if "content" in chunk:
print(chunk["content"], end="", flush=True)messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
# Async usage with LocalLabClient
response = await client.chat(messages)
print(response["choices"][0]["message"]["content"])
# Sync usage with SyncLocalLabClient
response = client.chat(messages)
print(response["choices"][0]["message"]["content"])# Async usage with LocalLabClient
responses = await client.batch_generate(
prompts: list[str],
model_id: str = None,
temperature: float = 0.7,
max_length: int = 8192, # Default is 8192 tokens for complete responses
top_p: float = 0.9,
top_k: int = 80, # Limits vocabulary to top K tokens
repetition_penalty: float = 1.15, # Reduces repetition
timeout: float = 300.0 # 5 minutes timeout for batch operations
) -> dict
# Sync usage with SyncLocalLabClient
responses = client.batch_generate(
prompts: list[str],
model_id: str = None,
temperature: float = 0.7,
max_length: int = 8192, # Default is 8192 tokens for complete responses
top_p: float = 0.9,
top_k: int = 80, # Limits vocabulary to top K tokens
repetition_penalty: float = 1.15, # Reduces repetition
timeout: float = 300.0 # 5 minutes timeout for batch operations
) -> dictprompts = [
"Write a haiku",
"Tell a joke",
"Give a fun fact"
]
# Async usage with LocalLabClient
responses = await client.batch_generate(prompts)
for prompt, response in zip(prompts, responses["responses"]):
print(f"\nPrompt: {prompt}")
print(f"Response: {response}")
# Sync usage with SyncLocalLabClient
responses = client.batch_generate(prompts)
for prompt, response in zip(prompts, responses["responses"]):
print(f"\nPrompt: {prompt}")
print(f"Response: {response}")All generation methods work with sensible defaults, so you can use them without specifying any parameters:
# Simple generation with all default parameters
response = client.generate("Explain quantum computing")
# Simple chat with all default parameters
messages = [{"role": "user", "content": "Hello, how are you?"}]
response = client.chat(messages)
# Simple batch generation with all default parameters
responses = client.batch_generate(["Write a poem", "Tell a story"])The client will automatically use the optimal parameters for high-quality responses.
If you can't connect to the server, make sure it's running and accessible:
# Async usage with LocalLabClient
try:
await client.generate("Test")
except ConnectionError as e:
print("Server not running:", str(e))
# Sync usage with SyncLocalLabClient
try:
client.generate("Test")
except ConnectionError as e:
print("Server not running:", str(e))For slow responses or large generations, increase the timeout:
# For async client initialization
client = LocalLabClient(
"http://localhost:8000",
timeout=60.0 # 60 seconds timeout for connection
)
# For sync client initialization
client = SyncLocalLabClient(
"http://localhost:8000",
timeout=60.0 # 60 seconds timeout for connection
)
# For individual requests (recommended approach)
response = client.generate(
"Write a detailed essay on quantum computing",
max_length=8192,
timeout=300.0 # 5 minutes timeout for this specific request
)
# For streaming requests
for token in client.stream_generate(
"Write a detailed essay on quantum computing",
max_length=8192,
timeout=300.0 # 5 minutes timeout for streaming
):
print(token, end="")Note: The
timeoutparameter in the client initialization controls the connection timeout, while thetimeoutparameter in the generation methods controls the request timeout. For long generations, it's recommended to use a higher timeout in the generation methods.
For low-resource environments, unload models when not in use:
# Async usage with LocalLabClient
async def memory_warning(usage: float):
print(f"High memory usage: {usage}%")
await client.unload_model()
client.on_memory_warning = memory_warning
# Sync usage with SyncLocalLabClient
def memory_warning(usage: float):
print(f"High memory usage: {usage}%")
client.unload_model()
client.on_memory_warning = memory_warningAlways close clients when done to prevent resource leaks:
# Sessions are automatically closed when the client is garbage collected
# or when the program exits, but you should still close them explicitly:
# Async usage with LocalLabClient
await client.close()
# Sync usage with SyncLocalLabClient
client.close()
# Or better yet, use context managers for automatic closing:
# Async context manager
async with LocalLabClient("http://localhost:8000") as client:
# Do something with the client
pass # Session is automatically closed when exiting the block
# Sync context manager
with SyncLocalLabClient("http://localhost:8000") as client:
# Do something with the client
pass # Session is automatically closed when exiting the blockIf you encounter import errors, make sure you've installed the correct package:
pip install locallab-clientAnd import it correctly:
# The package name is 'locallab-client' but the import is 'locallab_client'
from locallab_client import LocalLabClient
from locallab_client import SyncLocalLabClient- Fork the repository
- Create your feature branch
- Run tests:
pytest - Submit a pull request
MIT License - see LICENSE for details.