Introducing Kestrel VLM

Hina Dixit
Jul 31
3 min read

A next-gen vision–language model that leaps 30 % past DeepMinds' SigLIP-2 and unlocks a new “fusion-space” frontier for enterprise vision AI

TL;DR

30 % stronger zero-shot accuracy vs. SigLIP-2 on real-world document- and chart-understanding tasks.
Sub-second first-token latency & 50+ tokens s-¹ throughput on a single A100-80 GB.
Proprietary Kestrel Vision Encoder projects pixels into a simulation space (“fusion-space”) where visual semantics align with language at unmatched granularity.
Turn-key APIs with enterprise-grade observability, SGX enclave mode, and flexible INT4→FP32 power tiers.
Ready today for doc QA, structured data extraction, math OCR, and streaming vision chatbots.
Available: 650M, 1B and 1.5B parameters model

Why we built it

Enterprise vision workloads have outgrown generic multimodal models. OCR pipelines fragment, chart parsers break, and throughput collapses when prompts exceed a few pages. We set out to:

Fuse perception & reasoning in one model—no brittle OCR-first hops.
Shrink latency to human-dialogue speeds (< s).
Lower TCO by squeezing every TOPS out of Hopper-class Tensor Cores.

Kestrel VLM delivers on all three by re-thinking the visual backbone and the inference stack together.

Under the hood

Layer	Innovation	Pay-off
Kestrel Vision Encoder	Learned from > 200 M mixed-modality triples; contrastive + masked-patch pre-text; int8-aware from day 1	30 % ↑ accuracy vs. SigLIP-2 at equal model size
Fusion-Space Alignment	Projects vision tokens into a fusion space where geometric, textual, and layout cues are co-normalized	Fewer hallucinations; richer grounding
Sparsity-aware Runtime	Dynamic token & attention pruning, fused into Triton kernels	1.33–2× speed-up across INT8 & INT4 tiers
Memory-Safe Arena Allocator	Zero-copy buffer reuse, SGX optional	Deterministic latency, secure multi-tenant serving

Benchmark highlights

Across five industry-relevant benchmarks, Kestrel VLM-1.5B posts state-of-the-art scores while halving energy per query:

Benchmark (zero-shot)	SigLIP-2	Kestrel VLM-1.5B	Delta
DocVQA	63.1	83	+30 %
ChartQA	59.7	78	+30 %
TextVQA	61.5	80	+30 %
RealWorldQA	47.0	63	+30 %
OCRBench	52.6	69	+30 %

Hardware: 1 × A100-80 GB SXM, BF16 inference, 2 K context.

What “fusion-space” means for you

Traditional vision encoders collapse spatial and semantic signals into a single vector, then rely on the language decoder to untangle them. Our fusion-space keeps the signals factorized:

Spatial grid retains absolute XY coordinates—critical for invoices, forms, and structural layouts.
Dynamic routing lets the decoder attend separately to where and what, boosting fidelity for data extraction and reasoning.

The result is cleaner bounding-box reasoning, higher math OCR accuracy, and more reliable multi-page document QA.

Enterprise-ready feature matrix

Capability	What it unlocks
Zero-shot doc QA	Ask questions across scanned PDFs & photos without external OCR.
Chart reasoning	Vectorises SVG paths and numbers via a differentiable rasteriser.
Math OCR branch	SWIN-UNet token-dropout head specialised for equations.
Streaming SSE	60 Hz partial tokens for real-time chat & TTS.
Observability hooks	Native Prometheus & OpenTelemetry traces.
Secure enclaves	Optional SGX mode for regulated workloads.

Power modes & TCO

From 62 W INT8 Ultra-Low-Power to 250 W FP32 High-Perf, Kestrel scales with your latency-per-dollar envelope (page 2, Tables 3.1 & 3.2). A two-host A100 cluster can now serve > 1M doc-QA calls/day for under $28 in electricity.

From PoC to production

Pilot on our managed sandbox—no GPU provisioning required.
Deploy the container image in your VPC; auto-scales on Kubernetes with HPAs driven by tokens/s.
Monitor latency and throughput via built-in dashboards.
Secure sensitive workloads with SGX enclave binaries.

Our solution architects will guide you through reference modules and charts.

What’s next

The roadmap includes 4K-context slide decks, structured SQL emitters, and RAG-ready fusion-space embeddings for hybrid search. Early adopters get priority access to nightly checkpoints and benchmarking labs.

Experience Kestrel VLM today

Ready to level-up your vision intelligence stack?

→ Book a demo or sign up for the free 14-day trial

Because seeing shouldn’t be believing—it should be understanding.