Introducing Kestrel VLM
- Hina Dixit
- Jul 31
- 3 min read
A next-gen vision–language model that leaps 30 % past DeepMinds' SigLIP-2 and unlocks a new “fusion-space” frontier for enterprise vision AI

TL;DR
30 % stronger zero-shot accuracy vs. SigLIP-2 on real-world document- and chart-understanding tasks.
Sub-second first-token latency & 50+ tokens s-¹ throughput on a single A100-80 GB.
Proprietary Kestrel Vision Encoder projects pixels into a simulation space (“fusion-space”) where visual semantics align with language at unmatched granularity.
Turn-key APIs with enterprise-grade observability, SGX enclave mode, and flexible INT4→FP32 power tiers.
Ready today for doc QA, structured data extraction, math OCR, and streaming vision chatbots.
Available: 650M, 1B and 1.5B parameters model
Why we built it
Enterprise vision workloads have outgrown generic multimodal models. OCR pipelines fragment, chart parsers break, and throughput collapses when prompts exceed a few pages. We set out to:
Fuse perception & reasoning in one model—no brittle OCR-first hops.
Shrink latency to human-dialogue speeds (< s).
Lower TCO by squeezing every TOPS out of Hopper-class Tensor Cores.
Kestrel VLM delivers on all three by re-thinking the visual backbone and the inference stack together.
Under the hood
Layer | Innovation | Pay-off |
Kestrel Vision Encoder | Learned from > 200 M mixed-modality triples; contrastive + masked-patch pre-text; int8-aware from day 1 | 30 % ↑ accuracy vs. SigLIP-2 at equal model size |
Fusion-Space Alignment | Projects vision tokens into a fusion space where geometric, textual, and layout cues are co-normalized | Fewer hallucinations; richer grounding |
Sparsity-aware Runtime | Dynamic token & attention pruning, fused into Triton kernels | 1.33–2× speed-up across INT8 & INT4 tiers |
Memory-Safe Arena Allocator | Zero-copy buffer reuse, SGX optional | Deterministic latency, secure multi-tenant serving |
Benchmark highlights
Across five industry-relevant benchmarks, Kestrel VLM-1.5B posts state-of-the-art scores while halving energy per query:
Benchmark (zero-shot) | SigLIP-2 | Kestrel VLM-1.5B | Delta |
DocVQA | 63.1 | 83 | +30 % |
ChartQA | 59.7 | 78 | +30 % |
TextVQA | 61.5 | 80 | +30 % |
RealWorldQA | 47.0 | 63 | +30 % |
OCRBench | 52.6 | 69 | +30 % |
Hardware: 1 × A100-80 GB SXM, BF16 inference, 2 K context.
What “fusion-space” means for you
Traditional vision encoders collapse spatial and semantic signals into a single vector, then rely on the language decoder to untangle them. Our fusion-space keeps the signals factorized:
Spatial grid retains absolute XY coordinates—critical for invoices, forms, and structural layouts.
Dynamic routing lets the decoder attend separately to where and what, boosting fidelity for data extraction and reasoning.
The result is cleaner bounding-box reasoning, higher math OCR accuracy, and more reliable multi-page document QA.
Enterprise-ready feature matrix
Capability | What it unlocks |
Zero-shot doc QA | Ask questions across scanned PDFs & photos without external OCR. |
Chart reasoning | Vectorises SVG paths and numbers via a differentiable rasteriser. |
Math OCR branch | SWIN-UNet token-dropout head specialised for equations. |
Streaming SSE | 60 Hz partial tokens for real-time chat & TTS. |
Observability hooks | Native Prometheus & OpenTelemetry traces. |
Secure enclaves | Optional SGX mode for regulated workloads. |
Power modes & TCO
From 62 W INT8 Ultra-Low-Power to 250 W FP32 High-Perf, Kestrel scales with your latency-per-dollar envelope (page 2, Tables 3.1 & 3.2). A two-host A100 cluster can now serve > 1M doc-QA calls/day for under $28 in electricity.
From PoC to production
Pilot on our managed sandbox—no GPU provisioning required.
Deploy the container image in your VPC; auto-scales on Kubernetes with HPAs driven by tokens/s.
Monitor latency and throughput via built-in dashboards.
Secure sensitive workloads with SGX enclave binaries.
Our solution architects will guide you through reference modules and charts.
What’s next
The roadmap includes 4K-context slide decks, structured SQL emitters, and RAG-ready fusion-space embeddings for hybrid search. Early adopters get priority access to nightly checkpoints and benchmarking labs.
Experience Kestrel VLM today
Ready to level-up your vision intelligence stack?
→ Book a demo or sign up for the free 14-day trial
Because seeing shouldn’t be believing—it should be understanding.
コメント