manifest.json for Processed Documents | FileDigest
A manifest.json describing processed documents is a structured index of every source, output artifact, and job outcome. See what FileDigest writes and why.
A manifest.json describing processed documents is a structured index that records what was converted, which artifacts were produced for each source, and how the job turned out. In FileDigest, every conversion job writes a manifest.json alongside a human-readable digest.md, so software and agents can read the result without parsing prose.
What FileDigest writes into manifest.json
When you upload a file (drop, paste, or choose, with processing starting automatically and no separate process button), FileDigest converts it and emits a machine-readable record of the run. The manifest.json captures the job-level facts that downstream code needs:
- Job status and per-source processing outcomes.
- Source files with their file sizes and MIME types.
- Page counts where available.
- The output artifacts generated for each source.
- Token estimates for the produced text.
- Warnings and failures, so partial results are explicit rather than silent.
Where digest.md is the artifact a person reads or pastes into a model, manifest.json is the audit and automation layer that a pipeline reads to know exactly what it received.
The artifacts a manifest points to
FileDigest does not produce a single flattened file. For each source it generates a set of artifacts, and the manifest is the index that ties them together: a combined digest.md, the manifest.json itself, plus per-source Markdown, HTML, Docling DocTags, Docling JSON, and heading-contextualized RAG chunks. In the app you can view these side by side with the original PDF, so you can confirm the conversion matches the source before anything is indexed or sent to a model.
Behind the scenes the conversion runs on Docling using warm Modal L4 GPUs. The converter and models load once per warm container, so repeat jobs in a session are fast. Scanned PDFs are detected automatically and OCR is applied, and optional enrichments can turn formulas into LaTeX, label code, and add picture descriptions, with a high-accuracy VLM tier available for harder documents.
Why a manifest matters for RAG and agents
A folder of one-off conversions is hard to test and debug. A manifest turns a batch into something a RAG pipeline, evaluator, or agent workflow can reason about programmatically: it can see which files succeeded, which failed, what artifacts exist, and roughly how many tokens each one represents before it spends a single embedding call. That makes ingestion repeatable and reviewable instead of a guessing game.
Because the manifest enumerates outcomes per source, it also makes human review targeted. You route only the sources flagged with warnings or failures to a person, and let the clean ones flow straight into chunking and indexing.
Accepted inputs and how to retrieve the manifest
FileDigest accepts PDF, DOCX, PPTX, XLSX, images, TXT, Markdown, HTML, CSV, and ZIP bundles. Drop any of them in and the job starts on its own, then routes you to a live job view.
For automated workflows there is an agentic REST API: POST /v1/parse to submit work and GET /v1/jobs/{id} to retrieve the result, including the manifest. The API uses Bearer key authentication, publishes an OpenAPI 3.1 spec at /openapi.json, supports idempotency keys, and returns RFC 9457 problem+json errors. Agent-oriented documentation lives at /llms.txt. Storage is private per user, with authenticated ownership checks and private signed downloads, so a manifest and its artifacts are only reachable by their owner.
FAQ
What is a manifest.json for processed documents?
It is a structured JSON file that describes the outcome of a document-processing job: the source files, their sizes and MIME types, page counts, the output artifacts produced, token estimates, and any warnings or failures. In FileDigest it ships with every job next to the readable digest.md.
How is manifest.json different from digest.md?
digest.md is the human-readable, paste-into-a-model artifact. manifest.json is the automation layer: it is the structured record that lets pipelines, evaluators, and agents understand what was processed and what still needs review.
Can I get the manifest through an API?
Yes. Submit a job with POST /v1/parse and fetch the result, including the manifest, with GET /v1/jobs/{id}. The API uses Bearer key auth and publishes an OpenAPI 3.1 spec at /openapi.json.
Which file types produce a manifest?
Any supported input does: PDF, DOCX, PPTX, XLSX, images, TXT, Markdown, HTML, CSV, and ZIP bundles. Every job, regardless of input type, writes a manifest.json indexing the artifacts generated for each source.