PDF Parsing
PDF parsing extracts structured text content from uploaded PDF files, converting unstructured page images and raw text streams into organized sections that can be searched, indexed, and used by downstream AI features like PICO extraction.
Why PDF parsing matters
Standard PDF files often contain text that is difficult to process accurately - columns, footnotes, figures with captions, and tables all create ambiguous reading orders. AI models trained on clean prose struggle with poorly extracted PDF text.
ActiveSLR uses two specialized parsing engines to handle the variety of PDF formats encountered in systematic reviews:
- Docling - An open-source document understanding model that reconstructs document structure, identifies headings, tables, and figures, and outputs clean section-level text.
- LlamaParse - A cloud-based parsing service that excels at complex multi-column layouts and scanned documents.
How parsing is triggered
PDF parsing runs automatically when you upload a PDF to a study record. You do not need to manually trigger it. The parsing job is queued immediately after upload and typically completes within one to three minutes depending on document length and complexity.
A status indicator on the study card shows whether parsing is pending, in progress, or complete.
Parsing runs once per PDF upload. If you replace a PDF with a new version, parsing runs again on the new file.
Which engine is used
ActiveSLR selects the parsing engine automatically based on file characteristics:
| PDF type | Engine used |
|---|---|
| Digital/born-digital with clear structure | Docling |
| Complex layouts, multi-column, or scanned | LlamaParse |
| Scanned without embedded text | LlamaParse (with OCR) |
You do not need to configure the engine selection - it is handled automatically.
Benefits of parsed text
Once a PDF is parsed, the extracted text:
- Enables PICO extraction - The AI reads structured section text rather than raw PDF characters, improving extraction accuracy.
- Improves in-document search - You can search within PDFs using the parsed text rather than relying on the PDF's embedded text layer.
- Powers AI-assisted screening - Embedding generation for similarity-based screening suggestions uses parsed text.
- Supports annotations - Annotations and highlights in the PDF viewer are anchored to parsed text positions.
For the best downstream AI results, upload high-quality PDF files. Scanned PDFs processed through OCR have lower text accuracy than born-digital PDFs, which affects PICO extraction and similarity search quality.
Handling parsing failures
If parsing fails - for example, due to a corrupt PDF or an unsupported file format - the study card shows a parsing error status. You can:
- Re-upload the PDF if the file may have been corrupted during upload
- Try uploading a different version of the same document (for example, the publisher PDF instead of a preprint)
- Proceed without parsed text - the PDF is still viewable in the PDF viewer, and you can screen or extract data manually