BioFinderLM V2.0: Europe PMC and Semantic Ranking

BioFinderLM V2.0: Europe PMC and Semantic Ranking

This is the second post in the BioFinderLM series. For context and the full comparison across versions, see the project overview. The first version of the tool is covered here.

What Changed, and Why

Two limitations motivated v2:

PubMed misses a lot. Many important biology papers, especially computational methods, preprints, and interdisciplinary work, aren’t indexed in PubMed, or show up weeks late. Europe PMC covers a significantly broader corpus: 38M+ publications including preprints from bioRxiv, medRxiv, and institutional repositories.

Classification order matters. With up to 500+ papers to classify per run and LLM API costs that add up, I was wasting budget on low-relevance papers that happened to be recent. I needed a way to rank papers by semantic similarity before classification, so the most relevant ones get processed first.

The solution was Dense Passage Retrieval (DPR): compute an embedding for each paper’s abstract using Gemini’s text-embedding-004 model, then rank all papers by cosine similarity to the query embedding before passing them to the LLM classifier.

Workflow

BioFinderLM v2 workflow

BioFinderLM v2 pipeline. DPR ranking sits between retrieval and LLM classification.

Key Improvements Over v1

Featurev1.0v2.0
Data sourcePubMed + PMC (NCBI)Europe PMC REST API
Coverage~38M citations38M+ incl. preprints
Paper orderingChronologicalDPR cosine similarity
Articles vs. preprintsMixedSeparated
Output formatsJSONJSON + DPR-ranked CSV
Semantic ranking✅ Gemini embeddings

→ Continue to BioFinderLM v2.5, which extends the pipeline with weekly automation and an email digest.