Scientific Literature

Scientific Literature Skill

Multi-source scientific literature search, ingestion, and semantic analysis.

Purpose

The scientific-literature skill lets the agent search across four major biomedical literature sources, ingest papers into the knowledge graph, and perform embedding-based semantic search and thematic clustering. It’s the foundation for building research corpora and tracing hypothesis evolution (see Skills: Literature Trends).

Sources

Source Type Coverage
Europe PMC Full-text search Life sciences; open-access full text available
PubMed (NCBI) Full-text search Biomedical; authoritative MEDLINE index
OpenAlex Full-text search Cross-disciplinary; open access; includes abstracts reconstructed from inverted index
bioRxiv / medRxiv Preprints Biology and medicine preprints; real-time feed

Prerequisites

  • TypeDB running (make db-start)
  • uv installed

Optional (for semantic commands):

  • Qdrant running (make qdrant-start)
  • VOYAGE_API_KEY environment variable set (from dash.voyageai.com)

Commands

uv run python .claude/skills/scientific-literature/scientific_literature.py <command> [args]
Command What it does
count Count papers matching a query (without ingesting)
search Search and display results from one or more sources
ingest Search and ingest matching papers into TypeDB
list List papers already ingested into TypeDB
embed Generate Voyage AI embeddings for ingested papers
search-semantic Find papers by semantic similarity to a query
cluster Cluster ingested papers by embedding similarity (HDBSCAN + UMAP)
export-corpus Export a collection of papers to JSON or CSV

Typical Workflow

Build a research corpus:

You: Search PubMed for papers about CRISPR base editing published since 2022
You: How many results are there?
You: Ingest the top 50 papers into my knowledge graph
You: Now embed those papers
You: Cluster them thematically and summarize what you find

Find semantically related papers:

You: I have a paper about prime editing using PE3max.
     Find other papers in my corpus that are semantically similar.

Schema

Papers are stored as scilit-paper entities (a domain-thing subtype) with attributes:

  • id (key), title, abstract, doi, pmid, year, journal
  • content (full text, when available)
  • cache-path (local file reference for large content)

Papers can be tagged, added to collections (scilit-corpus), and annotated with notes.

Notes

  • Deduplication: Papers are deduplicated by DOI and PMID across sources.
  • Semantic search requires Qdrant: Start with make qdrant-start before embedding.
  • OpenAlex abstracts: OpenAlex returns abstracts as an inverted index; the skill reconstructs them as plain text automatically.