BioFinderLM V2.0: Europe PMC and Semantic Ranking

Project LLM Bioinformatics NLP Embeddings

Published on Nov 10, 2025

BioFinderLM V2.0: Europe PMC and Semantic Ranking

This is the second post in the BioFinderLM series. For context and the full comparison across versions, see the project overview. The first version of the tool is covered here.

What Changed, and Why

Two limitations motivated v2:

PubMed misses a lot. Many important biology papers, especially computational methods, preprints, and interdisciplinary work, aren’t indexed in PubMed, or show up weeks late. Europe PMC covers a significantly broader corpus: 38M+ publications including preprints from bioRxiv, medRxiv, and institutional repositories.

Classification order matters. With up to 500+ papers to classify per run and LLM API costs that add up, I was wasting budget on low-relevance papers that happened to be recent. I needed a way to rank papers by semantic similarity before classification, so the most relevant ones get processed first.

The solution was Dense Passage Retrieval (DPR): compute an embedding for each paper’s abstract using Gemini’s text-embedding-004 model, then rank all papers by cosine similarity to the query embedding before passing them to the LLM classifier.

Workflow

Key Improvements Over v1

Feature	v1.0	v2.0
Data source	PubMed + PMC (NCBI)	Europe PMC REST API
Coverage	~38M citations	38M+ incl. preprints
Paper ordering	Chronological	DPR cosine similarity
Articles vs. preprints	Mixed	Separated
Output formats	JSON	JSON + DPR-ranked CSV
Semantic ranking	❌	✅ Gemini embeddings

→ Continue to BioFinderLM v2.5, which extends the pipeline with weekly automation and an email digest.

BioFinderLM V2.0: Europe PMC and Semantic Ranking

What Changed, and Why

Workflow

Key Improvements Over v1

About the Author