Scientific Literature Skill

Multi-source scientific literature search, ingestion, and semantic analysis.

Purpose

The scientific-literature skill lets the agent search across four major biomedical literature sources, ingest papers into the knowledge graph, and perform embedding-based semantic search and thematic clustering. It’s the foundation for building research corpora and tracing hypothesis evolution (see Skills: Literature Trends).

Sources

Source	Type	Coverage
Europe PMC	Full-text search	Life sciences; open-access full text available
PubMed (NCBI)	Full-text search	Biomedical; authoritative MEDLINE index
OpenAlex	Full-text search	Cross-disciplinary; open access; includes abstracts reconstructed from inverted index
bioRxiv / medRxiv	Preprints	Biology and medicine preprints; real-time feed

Prerequisites

TypeDB running (make db-start)
uv installed

Optional (for semantic commands):

Qdrant running (make qdrant-start)
VOYAGE_API_KEY environment variable set (from dash.voyageai.com)

Commands

uv run python .claude/skills/scientific-literature/scientific_literature.py <command> [args]

Command	What it does
`count`	Count papers matching a query (without ingesting)
`search`	Search and display results from one or more sources
`ingest`	Search and ingest matching papers into TypeDB
`list`	List papers already ingested into TypeDB
`embed`	Generate Voyage AI embeddings for ingested papers
`search-semantic`	Find papers by semantic similarity to a query
`cluster`	Cluster ingested papers by embedding similarity (HDBSCAN + UMAP)
`export-corpus`	Export a collection of papers to JSON or CSV

Typical Workflow

Build a research corpus:

You: Search PubMed for papers about CRISPR base editing published since 2022
You: How many results are there?
You: Ingest the top 50 papers into my knowledge graph
You: Now embed those papers
You: Cluster them thematically and summarize what you find

Find semantically related papers:

You: I have a paper about prime editing using PE3max.
     Find other papers in my corpus that are semantically similar.

Schema

Papers are stored as scilit-paper entities (a domain-thing subtype) with attributes:

id (key), title, abstract, doi, pmid, year, journal
content (full text, when available)
cache-path (local file reference for large content)

Papers can be tagged, added to collections (scilit-corpus), and annotated with notes.

Notes

Deduplication: Papers are deduplicated by DOI and PMID across sources.
Semantic search requires Qdrant: Start with make qdrant-start before embedding.
OpenAlex abstracts: OpenAlex returns abstracts as an inverted index; the skill reconstructs them as plain text automatically.