ActiveSLR logo

Duplicate Detection

Duplicate detection removes redundant records from your project before screening begins. When you import references from multiple databases, the same study often appears more than once - with slightly different titles, author orderings, or journal abbreviations. The Medusa algorithm identifies these matches so you can review and remove them.

How Medusa works

Medusa uses embedding-based similarity matching rather than simple text comparison. Each study record is converted into a vector embedding that captures its semantic meaning. Medusa then computes similarity scores between all pairs of records and groups them by confidence level:

  • True duplicates - high-confidence matches where Medusa is certain the two records refer to the same study
  • Possible duplicates - lower-confidence matches that share significant similarity but require a human decision

This approach catches duplicates even when the title or author list differs between database exports, because the underlying semantic content is the same.

Screenshot needed
Duplicate detection dashboard showing counts for true duplicates, possible duplicates, and discarded pairs

Running deduplication

Deduplication runs after you upload your reference files and before screening starts. From the project dashboard, navigate to Deduplication and click Run Medusa. Processing time depends on the number of records in your project.

Info

You can re-run deduplication after uploading additional reference files. Medusa processes all records in the project, including ones from previous uploads.

Duplicate counts

The deduplication dashboard shows three counts:

CountMeaning
True duplicatesPairs Medusa classified as definite duplicates
Possible duplicatesPairs requiring human review
DiscardedPairs you have dismissed as not being duplicates

After you have reviewed and acted on all pairs, the remaining unique records proceed to screening.

Guides in this section