Skip to content
WhySoGeek.
News

Mistral's OCR 4 Lets Companies Run Document AI on Their Own Servers

Mistral released OCR 4 on June 23, a self-hostable document AI model covering 170 languages with bounding boxes and confidence scores.

Sam Carter 9 min read
Cover image for Mistral's OCR 4 Lets Companies Run Document AI on Their Own Servers
Photo: IBM Research / flickr (BY-ND 2.0)

On June 23, 2026, French AI company Mistral released OCR 4, a document-intelligence model aimed squarely at the businesses that legally cannot send their paperwork to a cloud API. The pitch is blunt: state-of-the-art document extraction you run entirely inside your own walls.

Quick answer

Mistral OCR 4, released June 23, 2026, is a "structure-aware" document AI model that extracts text, layout, and tables from scans and PDFs across 170 languages, returning paragraph-level bounding boxes, block classification, and per-field confidence scores. Its key differentiator is deployment as a single self-hostable container, so regulated industries (finance, healthcare, law, government) can process sensitive documents without the data leaving their servers. API pricing is $4 per 1,000 pages, or $2 with the 50 percent Batch-API discount.

Key takeaways

  • Mistral released OCR 4 on June 23, 2026, focused on structure-aware document extraction.
  • The model covers 170 languages and returns paragraph-level bounding boxes, block classification and inline confidence scores.
  • It deploys as a single container, so regulated organizations can keep sensitive documents on their own servers.
  • API pricing is $4 per 1,000 pages, dropping to $2 with a 50% Batch-API discount; the Document AI offering is $5 per 1,000 pages.
  • Mistral also expanded enterprise Connectors with scoped API keys, multi-account support and a debugger.

What happened

OCR stands for optical character recognition, the technology that turns images of text, scanned forms, PDFs, receipts, into machine-readable data. OCR 4 goes beyond plain text extraction. According to Mistral and coverage from VentureBeat, the model is "structure-aware," meaning it understands document layout. It returns paragraph-level bounding boxes that mark where text sits on the page, classifies blocks of content, and attaches confidence scores to its output so users know how reliable each extraction is.

Coverage spans 170 languages, and the model deploys as a single container. That last detail is the strategic core of the release. Many enterprises in finance, healthcare, law and government handle documents they are legally or contractually barred from sending to third-party cloud services. A self-hostable model lets them get advanced extraction without the data ever leaving their walls.

Note

"Self-hosted" means running the AI model on your own servers rather than calling someone else's cloud. For regulated industries, this keeps sensitive data, like medical records or financial statements, under the organization's own control and compliance regime.

On pricing, Mistral set the API at $4 per 1,000 pages, with a 50% Batch-API discount that cuts the cost to $2 per 1,000 pages for non-urgent workloads. A broader Document AI offering is priced at $5 per 1,000 pages. The company also rolled out richer enterprise Connectors, including scoped API keys, multi-account connectors, a debugger and connector support in its Vibe Code and Workflows tools.

Pricing and deployment options at a glance

OptionPriceBest for
OCR 4 API (standard)$4 per 1,000 pagesReal-time extraction in apps
OCR 4 API (Batch)$2 per 1,000 pagesLarge, non-urgent backlogs
Document AI offering$5 per 1,000 pagesBroader document-processing workflows
Self-hosted containerYour own infrastructure costRegulated data that cannot leave your servers

The self-hosted route swaps a clean per-page price for the cost of running the container yourself, which only pencils out at high volume or where data-residency rules make the cloud a non-starter.

A person reviewing printed documents next to a laptop
Photo: soelin / flickr (BY 2.0)

Why it matters

Document processing is one of the least glamorous but most valuable applications of AI in business. Invoices, contracts, claims forms and statements pile up in every large organization, and turning them into structured data is slow, error-prone work. Tools that extract accurately, in many languages, with confidence scores attached, can save enormous amounts of manual labor.

The self-hosting angle is what makes OCR 4 a competitive play rather than just a model update. It targets exactly the customers that cloud-only providers struggle to serve: regulated enterprises with strict data-residency rules. By shipping as a single container with confidence scoring and layout awareness, Mistral is positioning document AI as a serious enterprise product, not a feature. The emphasis on keeping data in-house also fits a broader movement toward private and on-device AI, which we covered in our look at small models running locally. For the consumer-grade end of document AI, our guide to AI for Excel and spreadsheets with Copilot shows how the same extraction-and-structuring idea shows up in everyday tools.

Why confidence scores change the workflow

Plain OCR gives you text and leaves you to trust or distrust all of it equally. Per-field confidence scores let you automate selectively.

Confidence on a fieldRecommended handling
HighAccept automatically into your system
MediumAuto-accept but log for spot-checking
LowRoute to a human for review before use

That single capability is what lets an organization automate the easy 90 percent of a document pile while flagging only the ambiguous remainder, which is where the labor savings actually come from.

What is next

The competition in document AI is heating up, and accuracy plus deployability will decide adoption. Things to watch:

    1. Extraction accuracy. How OCR 4 performs on messy, real-world documents versus clean test sets.
    2. Self-hosting uptake. Whether regulated industries adopt the on-premises container at scale.
    3. Total cost. How the per-page pricing compares once self-hosting infrastructure is factored in.
    4. Ecosystem fit. How well the new Connectors and workflow tools integrate into enterprise stacks.

Frequently asked questions

What is Mistral OCR 4?

It is a document-intelligence model released June 23, 2026, that extracts structured data from documents in 170 languages, returning bounding boxes, block classification and confidence scores, and can be self-hosted.

Why does self-hosting matter?

Many regulated organizations cannot send sensitive documents to third-party cloud services. Running OCR 4 as a single container on their own servers lets them use advanced extraction while keeping data in-house.

How much does OCR 4 cost?

Via the API it is $4 per 1,000 pages, dropping to $2 with the 50% Batch-API discount. The broader Document AI offering is priced at $5 per 1,000 pages.

What are confidence scores good for?

They tell users how reliable each extracted value is, so organizations can automatically flag low-confidence results for human review instead of trusting every output blindly.

OCR is unglamorous, but it touches nearly every business. By making top-tier document AI something companies can run privately, Mistral is betting that control over data is as valuable a feature as accuracy itself.

#news#ai

Sources & further reading

Keep reading