ActiveSLR logo

Running Deduplication

When you collect references from multiple databases, you inevitably get duplicates - the same study appearing multiple times under slightly different metadata. Deduplication removes these before screening so you do not waste time on the same paper twice. ActiveSLR uses the Medusa algorithm to detect both exact and near-duplicate references.

How Medusa works

Medusa compares studies across multiple fields - title, abstract, DOI, authors, publication year, and journal - and groups references that are likely to be the same paper. It classifies each detected pair as:

  • True duplicate - high confidence that both records refer to the same study. One record is kept, the other is marked as discarded.
  • Possible duplicate - the records are similar but not identical. These require your review before a decision is made.

Studies that are not matched to any other record remain as independent references and proceed to screening.

Info

Medusa uses a combination of exact hash matching (for truly identical records) and fuzzy similarity scoring (for near-duplicates with minor metadata differences). This catches duplicates even when titles differ slightly between databases.

When to run deduplication

Run deduplication after you have finished uploading all your reference files for the project. If you upload additional files later, you will need to run deduplication again to process the new records.

Warning

Do not start screening until deduplication is complete. Studies that are later identified as duplicates cannot be retroactively removed from screening queues.

Checking readiness

Before running deduplication, the deduplication queue page shows a readiness status indicating whether your files have been fully processed. All uploaded files must have a Processed status before you can proceed.

Screenshot needed
Deduplication queue page showing file readiness status and the Run Deduplication button

Running deduplication

Open the Deduplication queue

In the project-level sidebar, click Deduplication queue.

Confirm readiness

Check that all uploaded files show a processed status. If any file is still processing, wait for it to complete.

Click Run Deduplication

Click the Run Deduplication button to start the Medusa algorithm. Processing time depends on the number of studies - a set of 5,000 references typically completes within a few minutes.

Wait for results

The page updates when deduplication finishes, showing a summary of true duplicates found, possible duplicates requiring review, and the remaining unique study count.

Screenshot needed
Deduplication results summary showing counts of true duplicates, possible duplicates, and unique studies

Reviewing results

True duplicates

True duplicates are handled automatically. One record is kept as the canonical reference; the others are discarded. You can view the list of discarded duplicates at any time from the deduplication results page, but they do not require any action from you.

Possible duplicates

Possible duplicates are listed for your review. For each pair, you can see both records side by side with their metadata. You decide whether to:

  • Confirm as duplicate - discard one record
  • Keep both - treat them as distinct studies that both proceed to screening
Screenshot needed
Possible duplicates review panel showing two study records side by side with metadata highlighted
Tip

When reviewing possible duplicates, focus on the DOI and abstract. If the DOI matches exactly, the records are almost certainly duplicates even if the titles differ slightly. If there is no DOI, compare the abstract text.

After deduplication

Once deduplication is complete and you have resolved any possible duplicates, your project is ready for the title/abstract screening phase. The screening queue will contain only unique, non-discarded studies.