top of page

Introducing Kestrel VLM

  • Writer: Hina Dixit
    Hina Dixit
  • Jul 31
  • 3 min read

A next-gen vision–language model that leaps 30 % past DeepMinds' SigLIP-2 and unlocks a new “fusion-space” frontier for enterprise vision AI


ree

TL;DR

  • 30 % stronger zero-shot accuracy vs. SigLIP-2 on real-world document- and chart-understanding tasks.

  • Sub-second first-token latency & 50+ tokens s-¹ throughput on a single A100-80 GB.

  • Proprietary Kestrel Vision Encoder projects pixels into a simulation space (“fusion-space”) where visual semantics align with language at unmatched granularity.

  • Turn-key APIs with enterprise-grade observability, SGX enclave mode, and flexible INT4→FP32 power tiers.

  • Ready today for doc QA, structured data extraction, math OCR, and streaming vision chatbots.

  • Available: 650M, 1B and 1.5B parameters model


Why we built it

Enterprise vision workloads have outgrown generic multimodal models. OCR pipelines fragment, chart parsers break, and throughput collapses when prompts exceed a few pages. We set out to:

  1. Fuse perception & reasoning in one model—no brittle OCR-first hops.

  2. Shrink latency to human-dialogue speeds (< s).

  3. Lower TCO by squeezing every TOPS out of Hopper-class Tensor Cores.

Kestrel VLM delivers on all three by re-thinking the visual backbone and the inference stack together.


Under the hood

Layer

Innovation

Pay-off

Kestrel Vision Encoder

Learned from > 200 M mixed-modality triples; contrastive + masked-patch pre-text; int8-aware from day 1

30 % ↑ accuracy vs. SigLIP-2 at equal model size

Fusion-Space Alignment

Projects vision tokens into a fusion space where geometric, textual, and layout cues are co-normalized

Fewer hallucinations; richer grounding

Sparsity-aware Runtime

Dynamic token & attention pruning, fused into Triton kernels

1.33–2× speed-up across INT8 & INT4 tiers

Memory-Safe Arena Allocator

Zero-copy buffer reuse, SGX optional

Deterministic latency, secure multi-tenant serving


Benchmark highlights

Across five industry-relevant benchmarks, Kestrel VLM-1.5B posts state-of-the-art scores while halving energy per query:

Benchmark (zero-shot)

SigLIP-2

Kestrel VLM-1.5B

Delta

DocVQA

63.1

83

+30 %

ChartQA

59.7

78

+30 %

TextVQA

61.5

80

+30 %

RealWorldQA

47.0

63

+30 %

OCRBench

52.6

69

+30 %

Hardware: 1 × A100-80 GB SXM, BF16 inference, 2 K context.


What “fusion-space” means for you

Traditional vision encoders collapse spatial and semantic signals into a single vector, then rely on the language decoder to untangle them. Our fusion-space keeps the signals factorized:

  • Spatial grid retains absolute XY coordinates—critical for invoices, forms, and structural layouts.

  • Dynamic routing lets the decoder attend separately to where and what, boosting fidelity for data extraction and reasoning.

The result is cleaner bounding-box reasoning, higher math OCR accuracy, and more reliable multi-page document QA.


Enterprise-ready feature matrix

Capability

What it unlocks

Zero-shot doc QA

Ask questions across scanned PDFs & photos without external OCR.

Chart reasoning

Vectorises SVG paths and numbers via a differentiable rasteriser.

Math OCR branch

SWIN-UNet token-dropout head specialised for equations.

Streaming SSE

60 Hz partial tokens for real-time chat & TTS.

Observability hooks

Native Prometheus & OpenTelemetry traces.

Secure enclaves

Optional SGX mode for regulated workloads.


Power modes & TCO

From 62 W INT8 Ultra-Low-Power to 250 W FP32 High-Perf, Kestrel scales with your latency-per-dollar envelope (page 2, Tables 3.1 & 3.2). A two-host A100 cluster can now serve > 1M doc-QA calls/day for under $28 in electricity.


From PoC to production

  1. Pilot on our managed sandbox—no GPU provisioning required.

  2. Deploy the container image in your VPC; auto-scales on Kubernetes with HPAs driven by tokens/s.

  3. Monitor latency and throughput via built-in dashboards.

  4. Secure sensitive workloads with SGX enclave binaries.

Our solution architects will guide you through reference modules and charts.


What’s next

The roadmap includes 4K-context slide decksstructured SQL emitters, and RAG-ready fusion-space embeddings for hybrid search. Early adopters get priority access to nightly checkpoints and benchmarking labs.


Experience Kestrel VLM today

Ready to level-up your vision intelligence stack?

→ Book a demo or sign up for the free 14-day trial

Because seeing shouldn’t be believing—it should be understanding.

 
 
 

コメント


Do you need help? Contact Us

We will provide detailed information about our products and services

Decompute

Empowering Innovation, Redefining Possibilities.

© 2023 — Copyright

bottom of page