BioFinderLM V2.0: Europe PMC and Semantic Ranking
This is the second post in the BioFinderLM series. For context and the full comparison across versions, see the project overview. The first version of the tool is covered here.
What Changed, and Why
Two limitations motivated v2:
PubMed misses a lot. Many important biology papers, especially computational methods, preprints, and interdisciplinary work, aren’t indexed in PubMed, or show up weeks late. Europe PMC covers a significantly broader corpus: 38M+ publications including preprints from bioRxiv, medRxiv, and institutional repositories.
Classification order matters. With up to 500+ papers to classify per run and LLM API costs that add up, I was wasting budget on low-relevance papers that happened to be recent. I needed a way to rank papers by semantic similarity before classification, so the most relevant ones get processed first.
The solution was Dense Passage Retrieval (DPR): compute an embedding for each paper’s abstract using Gemini’s text-embedding-004 model, then rank all papers by cosine similarity to the query embedding before passing them to the LLM classifier.
Workflow
BioFinderLM v2 pipeline. DPR ranking sits between retrieval and LLM classification.
Key Improvements Over v1
| Feature | v1.0 | v2.0 |
|---|---|---|
| Data source | PubMed + PMC (NCBI) | Europe PMC REST API |
| Coverage | ~38M citations | 38M+ incl. preprints |
| Paper ordering | Chronological | DPR cosine similarity |
| Articles vs. preprints | Mixed | Separated |
| Output formats | JSON | JSON + DPR-ranked CSV |
| Semantic ranking | ❌ | ✅ Gemini embeddings |
→ Continue to BioFinderLM v2.5, which extends the pipeline with weekly automation and an email digest.