OCR a Scanned PDF to Markdown for AI | FileDigest

OCR scanned PDFs into clean, AI-ready Markdown. FileDigest auto-detects scans, runs Docling OCR on warm GPUs, and outputs digest.md plus RAG chunks.

To OCR a scanned PDF to Markdown for AI, drop the file into FileDigest. It automatically detects that the pages are scanned, applies OCR through the Docling engine, and returns clean Markdown plus heading-aware RAG chunks you can paste into ChatGPT or Claude, or feed into a retrieval pipeline. There is no separate "process" button and no OCR toggle to hunt for: scanned pages are recognized and handled for you.

How OCR to Markdown works in FileDigest

Upload a scanned PDF by dropping, pasting, or choosing it, and processing starts immediately. The job routes to a live view where you can watch it run. FileDigest inspects the file, detects pages that have no reliable embedded text (photographed pages, image-only scans, older manuals), and applies OCR during conversion.

Conversion runs on the Docling engine on warm Modal L4 GPUs. The converter and its models load once per warm container, so the first job pays the warm-up cost and repeat jobs are fast. The result is a structured digest rather than a flat text dump, with headings, tables, and layout reconstructed into Markdown.

What you get back

Every source produces a full set of artifacts you can view side by side with the original PDF:

A combined digest.md for pasting into an LLM context window or saving as a prompt packet.
A manifest.json recording file metadata, processing outcomes, pages, artifacts, and token estimates.
Per-source Markdown, HTML, Docling DocTags, and Docling JSON.
Heading-contextualized RAG chunks, where each chunk carries the heading path it came from, so retrieval stays accurate instead of returning orphaned fragments.

Because the OCR text lands in real Markdown structure, your model sees headings and tables in context, not a wall of recognized characters.

Beyond plain OCR: enrichments and a VLM tier

Scanned documents often carry more than body text. FileDigest offers optional enrichments on top of OCR: formulas converted to LaTeX, code blocks preserved, and picture descriptions generated for images. For difficult scans, dense tables, or pages where standard OCR struggles, a high-accuracy VLM (vision-language model) tier is available for higher-fidelity extraction.

OCR improves extraction, but it is not magic. For high-stakes work, very poor scans, or intricate tables and figures, plan on a human review pass over the output.

More than scanned PDFs

The same one-step pipeline handles PDF, DOCX, PPTX, XLSX, images, TXT, Markdown, HTML, CSV, and ZIP bundles. You can drop a mixed ZIP of scans and native files, and each source gets its own artifacts and entry in the manifest. If you build agents or automations, the agentic REST API mirrors the UI: POST /v1/parse to submit a job and GET /v1/jobs/{id} to poll it, with Bearer key authentication, idempotency keys, an OpenAPI 3.1 spec at /openapi.json, RFC 9457 problem+json errors, and agent docs at /llms.txt.

Your files stay in private per-user storage behind authenticated ownership checks, and downloads come through private signed links.

FAQ

Do I have to turn OCR on manually?

No. FileDigest detects scanned PDFs automatically and applies OCR as part of conversion. You upload the file and the engine decides what each page needs, so you do not set a flag or pick a mode.

Is the OCR output ready for ChatGPT, Claude, or a RAG pipeline?

Yes. You get a clean digest.md for direct paste into an LLM, plus heading-contextualized RAG chunks for retrieval. A manifest.json with token estimates helps you fit content into a context window before sending it.

What if standard OCR misses tables or formulas?

Turn on enrichments to convert formulas to LaTeX, preserve code, and describe pictures, or use the high-accuracy VLM tier for difficult scans and dense tables. For critical documents, review the Markdown against the original, which you can open side by side.

Which plan includes OCR?

OCR and larger jobs are on the paid Pro and Business plans, which also raise token quotas. The Free plan lets you test the upload-to-digest workflow before committing to OCR-heavy processing.