Skip to content

👁️ Om AI Lab

Open Multimodal AGI Research

Website Hugging Face X (Twitter)

Pioneering the next generation of multimodal AI models for Spatial Intelligence and Embodied AI.


🌌 About Us

At Om AI Lab, we believe the future of AI extends far beyond pure text. We are dedicated to building the "brains" for next-generation systems by focusing on the intersection of Spatial Intelligence, Visual Reasoning, and Embodied Agents.

Our research spans across open-vocabulary perception, reinforced vision-language models, and real-time inference. We aim to bridge the critical gap between high-level logical reasoning and fine-grained visual action—building models that don't just "see" the world, but intuitively understand and interact with it.


🚀 Core Research Tracks

🧠 Reinforced & Advanced VLMs

Models that think, reason, and understand the visual world at a granular level.

  • 🌟 VLM-R1: Solving Visual Understanding with Reinforced VLMs. (Highly active)
  • 🔍 VLM-FO1: Bridging the gap between high-level reasoning and fine-grained perception in Vision-Language Models.
  • 🔎 ZoomEye: Enhancing Multimodal LLMs with human-like zooming capabilities through tree-based image exploration.

👁️ Real-Time Perception & Open-World Detection

Foundational spatial understanding optimized for edge and on-premise speeds.

  • OmDet: Real-time, highly accurate, open-vocabulary end-to-end object detection.
  • 📐 GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training.
  • 🌍 ImageRAG: Enhancing ultrahigh-resolution remote sensing imagery analysis.

🤖 Multimodal Agents & Embodied AI

Action-oriented intelligence for physical and virtual environments.

  • 🛠️ OmAgent: A comprehensive framework to build multimodal language agents for fast prototyping and production.
  • 🎯 OpenTrackVLA: Open and reproducible research for tracking Vision-Language-Action (VLA) models.

📊 Benchmarks & Evaluation

Rigorous standards for the open-source multimodal community.

  • 📏 OVDEval: A comprehensive evaluation benchmark for Open-Vocabulary Detection.
  • 📝 VL-CheckList: Evaluating Vision & Language Pretraining Models with Objects, Attributes, and Relations.
Building the foundational brains for the physical world.

Join us in exploring the spatial frontier.

Pinned Loading

  1. VLM-R1 VLM-R1 Public

    Solve Visual Understanding with Reinforced VLMs

    Python 6k 377

  2. OmDet OmDet Public

    Real-time and accurate open-vocabulary end-to-end object detection

    Python 1.4k 115

  3. OmAgent OmAgent Public

    [EMNLP-2024] Build multimodal language agents for fast prototype and production

    Python 2.6k 288

  4. VLM-FO1 VLM-FO1 Public

    VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

    Python 299 14

  5. OpenTrackVLA OpenTrackVLA Public

    Open & Reproducible Research for Tracking VLAs

    Python 196 10

  6. ZoomEye ZoomEye Public

    [EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

    Python 80 9

Repositories

Showing 10 of 22 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…