Design Concepts

Design Concepts

The architecture and design principles behind Skillful Alhazen.

Core Philosophy: Curation Over Collection

The system exists to help you make sense of material, not just store it:

  • Collection = passively accumulating information
  • Curation = actively interrogating, extracting meaning, building structured understanding

Every component serves the curation mission. We embody Alhazen’s philosophy: be an enemy of all you read.


The 6-Phase System Design

Skills are designed around a 6-phase framework that traces the full lifecycle from system purpose to reporting. The curation-skill-builder skill tracks design decisions through all six phases using TypeDB.

Phase 1: GOAL           → Define what the system is for and how success is measured
Phase 2: ENTITY-SCHEMA  → TypeDB entity/relation/attribute types for the domain
Phase 3: SOURCE-SCHEMA  → External data sources and how artifacts are captured
Phase 4: DERIVATION     → Ingestion functions (artifact → structured entities)
Phase 5: ANALYSIS       → Query and sensemaking functions (entities → insight)
Phase 6: REPORTING      → Dashboard views and synthesized outputs

In practice, skill development moves through these phases iteratively. Schema gaps discovered during Phase 4 or 5 drive revisions to Phase 2. See Gap Architecture.

The curation workflow (Phases 3–6 in action)

FORAGING → INGESTION → SENSEMAKING → ANALYSIS → REPORTING
   │             │            │           │          │
Discover      Capture      Extract      Reason     Present
sources       raw          meaning      over       actionable
              content                  knowledge  insight
   ↓             ↓            ↓           ↓          ↓
URLs,         Artifacts,   Fragments,  Synthesis  Dashboards,
APIs,         provenance,  notes,      notes,     answers,
feeds         timestamps   relations   trends     recommendations

The TypeDB Schema Hierarchy

The core schema defines three branches from an abstract root:

identifiable-entity (abstract)         — id, name, description, provenance
├── domain-thing                        — real-world objects
│   ├── scilit-paper                    (scientific-literature namespace)
│   ├── apt-disease                     (alg-precision-therapeutics namespace)
│   ├── jobhunt-position                (jobhunt namespace)
│   └── ...                            (one or more types per skill)
├── collection                          — typed sets of domain objects
│   ├── scilit-corpus
│   ├── jobhunt-search
│   └── ...
└── information-content-entity (abstract) — content-bearing entities
    ├── artifact                        — raw captured content (PDF, HTML, API response)
    ├── fragment                        — extracted piece of an artifact
    └── note                            — agent analysis or annotation

A gene or job posting is not information content. Only artifacts, fragments, and notes carry content (content, cache-path, format attributes). Domain objects are what you reason about; ICEs are what you reason with.

The aboutness relation links notes to any identifiable-entity (the subject role). This is how the agent attaches its analysis to a specific paper, disease, or job posting.


Separation of Concerns

Layer What it knows What it doesn’t know
Schema Types, attributes, relations What data is stored
CLI script How to read/write TypeDB What the agent is trying to accomplish
SKILL.md/USAGE.md What the agent should do and why Implementation details
Dashboard How to display data How it was produced

This separation means skills can be developed and tested at each layer independently. Schema changes don’t require rewriting the agent instructions; agent instruction updates don’t require schema migrations.


Artifact Cache

Large content artifacts (PDFs, HTML pages, images) are stored on disk rather than inline in TypeDB:

  • Content < 50 KB → stored inline in the TypeDB content attribute
  • Content ≥ 50 KB → stored in ~/.alhazen/cache/<type>/, referenced via cache-path attribute

Cache directories: html/, pdf/, image/, json/, text/, github/

The cache is shared across all skills. A PDF ingested by jobhunt (a resume) uses the same pdf/ directory as papers ingested by scientific-literature.


Gap-Driven Schema Evolution

When the agent tries to represent a concept that has no place in the current schema, it encounters a schema gap. These are not failures — they are signals:

A schema gap means the knowledge work has outgrown the model.

The skilllog system detects gaps automatically via a PostToolUse hook that scans for TypeDB error codes ([SYR1], [TYR01], [FEX1], etc.). Gaps are filed as structured GitHub issues, fixed locally against the running TypeDB instance, and merged via PR with human review.

This gives the knowledge graph a mechanism for organic growth driven by actual use, not by top-down schema planning. See Gap Architecture for the full workflow.


TypeDB vs. Other Databases

TypeDB is the ontological foundation — not just a store. Key properties:

  • Schema-first: types, relations, and constraints are defined before data is inserted. TypeDB will reject writes that violate the schema, making gaps detectable at insertion time.
  • Pattern matching: queries express structural patterns across the graph, not just key lookups. The agent can ask “find all diseases whose mechanisms involve gene X and are associated with HPO phenotype Y” in a single TypeQL query.
  • Cross-skill queries: because all skills share the same database and schema hierarchy, the agent can reason across skill boundaries. A job posting that mentions a disease can be connected to the agent’s disease mechanism research.
  • Ontological hierarchy: the identifiable-entitydomain-thing → namespace type hierarchy means generic operations (tagging, noting, collecting) work uniformly across all domain types.