Document to Markdown: Docling vs MarkitDown vs Marker

AI-Ready Markdown: Comparison of Document Converters for Generative Applications

5 min read2 days ago

Summary of document parsing software.

File format support: Marker mainly for PDF/images, Docling/MarkitDown expands to DOCX, XLSX, PPTX, HTML, etc. (MarkitDown even has audio/youtube support)

OCR: Docling, Marker good

tables in document: Docling, Marker good, MarkitDown loses format (but integration with Azure Document intelligence may help)

images in document: Docling, Marker good, MarkitDown mainly plain text (but integration with Azure Document intelligence may help)

Docling

llamaindex DoclingReader and MarkdownNodeParser

Introduced PdfPipelineOptions with tesseract OCR, embedded image/base64, figure export

4 levels of difficulty of PDF parsing: normal PDF with table, scanned PDF with table, scanned PDF with more complex tables, PDF with mixed content like text, images, tables.

🗂️ Parsing of multiple document formats incl. PDF, DOCX, XLSX, HTML, images, and more
📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
🧬 Unified, expressive DoclingDocument representation format
↪️ Various export formats and options, including Markdown, HTML, and lossless JSON
🔒 Local execution capabilities for sensitive data and air-gapped environments
🤖 Plug-and-play integrations incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
🔍 Extensive OCR support for scanned PDFs and images

MarkitDown

MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc). It supports:

PDF
PowerPoint
Word
Excel
Images (EXIF metadata and OCR)
Audio (EXIF metadata and speech transcription)
HTML
Text-based formats (CSV, JSON, XML)
ZIP files (iterates over contents)
… and more!

Marker vs. Nougat, faster and more accurate, more format preserving

Marker

Marker converts PDFs and images to markdown, JSON, and HTML quickly and accurately.

Supports a range of documents in all languages
Formats tables, forms, equations, inline math, links, references, and code blocks
Extracts and saves images
Removes headers/footers/other artifacts
Extensible with your own formatting and logic
Optionally boost accuracy with LLMs
Works on GPU, CPU, or MPS

Comparison

a table to compare Docling vs MarkitDown vs Marker features on table, processing time (my experience MarkitDown > Docling >> Marker)

Test

Original document, table

docling table

Preserves table format

TABLE I. Benchmark instances used in this work . Apart from the the number of vertices, edges, and edge weights, we also include the type of graph as well as its use.

| Graph    |    m |   | E | | W ij   | Type           | Use        |
|----------|------|---------|--------|----------------|------------|
| pm3-8-50 |  512 |    1536 | ± 1    | 3 D torus grid | Experiment |
| G1       |  800 |   19176 | 1      | random         | Experiment |
| G14      |  800 |    4694 | 1      | planar         | Numerics   |
| G23      | 2000 |   19990 | 1      | random         | Numerics   |
| G35      | 2000 |   11778 | 1      | planar         | Experiment |
| G60      | 7000 |   17148 | 1      | random         | Numerics   |

markitdown table

Loses table format

Graph
pm3-8-50
G1
G14
G23
G35
G60

|E| Wij
m
1536 ±1
512
1
19176
800
1
800
4694
1
2000 19990
1
2000 11778
1
7000 17148

Type

Use

3D torus grid Experiment
Experiment
Numerics
Numerics
Experiment
Numerics

random
planar
random
planar
random

TABLE I. Benchmark instances used in this work.
Apart from the the number of vertices, edges, and edge
weights, we also include the type of graph as well as its use.

marker

| Graph    | m    | E     | Wij | Type          | Use        |
|----------|------|-------|-----|---------------|------------|
| pm3-8-50 | 512  | 1536  | ±1  | 3D torus grid | Experiment |
| G1       | 800  | 19176 | 1   | random        | Experiment |
| G14      | 800  | 4694  | 1   | planar        | Numerics   |
| G23      | 2000 | 19990 | 1   | random        | Numerics   |
| G35      | 2000 | 11778 | 1   | planar        | Experiment |
| G60      | 7000 | 17148 | 1   | random        | Numerics   |
|          |      |       |     |               |            |

<span id="page-6-2"></span>TABLE I. Benchmark instances used in this work. Apart from the the number of vertices, edges, and edge weights, we also include the type of graph as well as its use.

can see several models used

layout model datalab-to/surya_layout on device cpu with dtype torch.float32
Loaded texify model datalab-to/texify on device cpu with dtype torch.float32
Loaded recognition model vikp/surya_rec2 on device cpu with dtype torch.float32
Loaded table recognition model datalab-to/surya_tablerec on device cpu with dtype torch.float32
Loaded detection model vikp/surya_det3 on device cpu with dtype torch.float32
Loaded detection model datalab-to/inline_math_det0 on device cpu with dtype torch.float32

Appendix

Mastering Data Ingestion for RAG Pipelines: A Deep Dive into PDF Loaders in LangChain

In my last post, I explored the power of Retrieval-Augmented Generation (RAG) pipelines and how they help combat AI…

www.linkedin.com

Deep Dive into Microsoft MarkItDown

What is MarkItDown? MarkItDown is a Python package developed by Microsoft, designed to... Tagged with webdev, python…

dev.to

Pdf Parsing Techniques

Tools/Libraries for Pdf parsing

medium.com

docling, unstructured.io, llamaparse

Docling v2

Docling v2 introduces several new features: Understands and converts PDF, MS Word, MS Powerpoint, HTML and several…

ds4sd.github.io

Docling

Docling simplifies document processing, parsing diverse formats - including advanced PDF understanding - and providing…

ds4sd.github.io

Docling Technical Report

This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF…

arxiv.org

Force full page OCR

def main(): input_doc = Path("./tests/data/pdf/2206.01062.pdf") pipeline_options = PdfPipelineOptions()…

ds4sd.github.io

Figure export

def main(): logging.basicConfig(level=logging.INFO) input_doc_path = Path("./tests/data/pdf/2206.01062.pdf") output_dir…

ds4sd.github.io

Commercial

unstructured io (with open-source version)

llamaparse

Document to Markdown: Docling vs MarkitDown vs Marker

AI-Ready Markdown: Comparison of Document Converters for Generative Applications

Test

Appendix

Mastering Data Ingestion for RAG Pipelines: A Deep Dive into PDF Loaders in LangChain

In my last post, I explored the power of Retrieval-Augmented Generation (RAG) pipelines and how they help combat AI…

Deep Dive into Microsoft MarkItDown

What is MarkItDown? MarkItDown is a Python package developed by Microsoft, designed to... Tagged with webdev, python…

Pdf Parsing Techniques

Tools/Libraries for Pdf parsing

Docling v2

Docling v2 introduces several new features: Understands and converts PDF, MS Word, MS Powerpoint, HTML and several…

Docling

Docling simplifies document processing, parsing diverse formats - including advanced PDF understanding - and providing…

Docling Technical Report

This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF…

Force full page OCR

def main(): input_doc = Path("./tests/data/pdf/2206.01062.pdf") pipeline_options = PdfPipelineOptions()…

Figure export

def main(): logging.basicConfig(level=logging.INFO) input_doc_path = Path("./tests/data/pdf/2206.01062.pdf") output_dir…

Written by Xin Cheng

No responses yet