ActiveSLR logo

PDF Parsing

PDF parsing extracts structured text content from uploaded PDF files, converting unstructured page images and raw text streams into organized sections that can be searched, indexed, and used by downstream AI features like PICO extraction.

Why PDF parsing matters

Standard PDF files often contain text that is difficult to process accurately - columns, footnotes, figures with captions, and tables all create ambiguous reading orders. AI models trained on clean prose struggle with poorly extracted PDF text.

ActiveSLR uses two specialized parsing engines to handle the variety of PDF formats encountered in systematic reviews:

  • Docling - An open-source document understanding model that reconstructs document structure, identifies headings, tables, and figures, and outputs clean section-level text.
  • LlamaParse - A cloud-based parsing service that excels at complex multi-column layouts and scanned documents.
Screenshot needed
PDF parsing result showing the original PDF on the left and the extracted structured text sections on the right

How parsing is triggered

PDF parsing runs automatically when you upload a PDF to a study record. You do not need to manually trigger it. The parsing job is queued immediately after upload and typically completes within one to three minutes depending on document length and complexity.

A status indicator on the study card shows whether parsing is pending, in progress, or complete.

Info

Parsing runs once per PDF upload. If you replace a PDF with a new version, parsing runs again on the new file.

Which engine is used

ActiveSLR selects the parsing engine automatically based on file characteristics:

PDF typeEngine used
Digital/born-digital with clear structureDocling
Complex layouts, multi-column, or scannedLlamaParse
Scanned without embedded textLlamaParse (with OCR)

You do not need to configure the engine selection - it is handled automatically.

Benefits of parsed text

Once a PDF is parsed, the extracted text:

  • Enables PICO extraction - The AI reads structured section text rather than raw PDF characters, improving extraction accuracy.
  • Improves in-document search - You can search within PDFs using the parsed text rather than relying on the PDF's embedded text layer.
  • Powers AI-assisted screening - Embedding generation for similarity-based screening suggestions uses parsed text.
  • Supports annotations - Annotations and highlights in the PDF viewer are anchored to parsed text positions.
Tip

For the best downstream AI results, upload high-quality PDF files. Scanned PDFs processed through OCR have lower text accuracy than born-digital PDFs, which affects PICO extraction and similarity search quality.

Handling parsing failures

If parsing fails - for example, due to a corrupt PDF or an unsupported file format - the study card shows a parsing error status. You can:

  • Re-upload the PDF if the file may have been corrupted during upload
  • Try uploading a different version of the same document (for example, the publisher PDF instead of a preprint)
  • Proceed without parsed text - the PDF is still viewable in the PDF viewer, and you can screen or extract data manually