Note
OpenArc is under active development.
OpenArc is an inference engine for Intel devices. Serve LLMs, VLMs, Whisper, Kokoro-TTS, Qwen-TTS, Qwen-ASR, Embedding and Reranker models over OpenAI compatible endpoints, powered by OpenVINO on your device. Local, private, open source AI.
Drawing on ideas from llama.cpp, vLLM, transformers, OpenVINO Model Server, Ray, Lemonade, and other projects cited below, OpenArc has been a way for me to learn about inference engines by trying to build one myself.
Along the way a Discord community has formed around this project! If you are interested in using Intel devices for AI and machine learning, feel free to stop by.
Thanks to everyone on Discord for their continued support!
Note
Documentation has been ported to a Zensical site. It's still WIP, and the site isn't live. To build and serve the docs after install:
zensical serve -a localhost:8004
- NEW! Containerization with Docker #60 by @meatposes
- NEW! Speculative decoding support for LLMs #57 by @meatposes
- NEW! Streaming cancellation support for LLMs and VLMs
- Multi GPU Pipeline Paralell
- CPU offload/Hybrid device
- NPU device support
- OpenAI compatible endpoints
/v1/models/v1/completions:llmonly/v1/chat/completions/v1/audio/transcriptions:whisper,qwen3_asr/v1/audio/speech:kokoroonly/v1/embeddings:qwen3-embedding#33 by @mwrothbe/v1/rerank:qwen3-reranker#39 by @mwrothbe
jinjatemplating withAutoTokenizers- OpenAI Compatible tool calls with streaming and paralell
- tool call parser currently reads "name", "argument"
- Fully async multi engine, multi task architecture
- Model concurrency: load and infer multiple models at once
- Automatic unload on inference failure
llama-benchstyle benchmarking forllmw/automatic sqlite database- metrics on every request
- ttft
- prefill_throughput
- decode_throughput
- decode_duration
- tpot
- load time
- stream mode
- More OpenVINO examples
- OpenVINO implementation of hexgrad/Kokoro-82M
- OpenVINO implementation of Qwen3-TTS and Qwen3-ASR
Note
Interested in contributing? Please open an issue before submitting a PR!
Linux
- OpenVINO requires device specifc drivers.
- Visit OpenVINO System Requirments for the latest information on drivers.
-
Install uv from astral
-
After cloning use:
uv sync
- Activate your environment with:
source .venv/bin/activate
Build latest optimum
uv pip install "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel"
Build latest OpenVINO and OpenVINO GenAI from nightly wheels
uv pip install --pre -U openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
- Set your API key as an environment variable:
export OPENARC_API_KEY=<api-key>
- To get started, run:
openarc --help
Windows
- OpenVINO requires device specifc drivers.
- Visit OpenVINO System Requirments to get the latest information on drivers.
-
Install uv from astral
-
Clone OpenArc, enter the directory and run:
uv sync
- Activate your environment with:
.venv\Scripts\activate
Build latest optimum
uv pip install "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel"
Build latest OpenVINO and OpenVINO GenAI from nightly wheels
uv pip install --pre -U openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
- Set your API key as an environment variable:
setx OPENARC_API_KEY openarc-api-key
- To get started, run:
openarc --help
Docker
Instead of fighting with Intel's own docker images, we built our own which is as close to boilerplate as possible. For a primer on docker check out this video.
Build and run the container:
docker-compose up --build -dRun the container:
docker run -d -p 8000:8000 openarc:latestEnter the container:
docker exec -it openarc /bin/bashexport OPENARC_API_KEY="openarc-api-key" # default, set it to whatever you want
export OPENARC_AUTOLOAD_MODEL="model_name" # model_name to load on startup
export MODEL_PATH="/path/to/your/models" # mount your models to `/models` inside the container
docker-compose up --build -dTake a look at the Dockerfile and docker-compose for more details.
Note
Need help installing drivers? Join our Discord or open an issue.
Note
uv has a pip interface which is a drop in replacement for pip, but faster. Pretty cool, and a good place to start learning uv.
OpenArc stands on the shoulders of many other projects:
@article{zhou2024survey,
title={A Survey on Efficient Inference for Large Language Models},
author={Zhou, Zixuan and Ning, Xuefei and Hong, Ke and Fu, Tianyu and Xu, Jiaming and Li, Shiyao and Lou, Yuming and Wang, Luning and Yuan, Zhihang and Li, Xiuhong and Yan, Shengen and Dai, Guohao and Zhang, Xiao-Ping and Dong, Yuhan and Wang, Yu},
journal={arXiv preprint arXiv:2404.14294},
year={2024}
}
Thanks for your work!!
