jml 9bfa626203 docs: reorganize documentation structure for clarity

Major documentation restructure to improve discoverability and reduce duplication.

## Changes

**Deleted (Archived/Consolidated)**:
- Removed duplicate getting started guides
- Archived outdated planning documents
- Consolidated corpus and configuration docs
- Removed obsolete vision/spec files (superseded by vision.md)
- Cleaned up scrapyard and old PDFs

**New Structure**:
- docs/about/ - Project overview and introduction
- docs/guides/ - User guides (moved from root)
- docs/specs/ - Technical specifications
- docs/sdk/ - SDK documentation (Go)
- docs/references/ - API references
- docs/archive/ - Archived historical docs
- applications/aphoria/docs/advanced/ - Advanced topics
- applications/aphoria/docs/reference/ - CLI reference
- applications/aphoria/docs/archive/ - Archived aphoria docs

**Updated**:
- README.md - New root README with clear navigation
- CONTRIBUTING.md - Contribution guidelines
- CLAUDE.md - Updated paths to new structure
- roadmap.md - Added recent completions

## Files Changed
- 57 files changed
- 1,977 insertions(+)
- 961 deletions(-)

**Net change**: +1,016 lines (added CONTRIBUTING.md, README.md, reorganized content)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-11 07:33:40 +00:00

17 KiB

Raw Blame History

Ontology Layer + Medical Vertical Implementation Plan

Goal: Build the ontology layer that defines how claims are structured, extracted, and resolved — using Pharma/Medical as the first vertical to pressure-test every decision.

Philosophy: Build medical end-to-end first, extract the reusable ontology layer as patterns emerge. Don't abstract prematurely.

Outcome: A working claim extraction pipeline that can ingest FDA labels, clinical trial data, and eventually social media — producing properly structured assertions that conflict-detect correctly.

Context: What Exists Today

Component	Status	Gap
Core StemeDB	✅ Complete	Storage, lenses, conflict detection work
SkepticLens	✅ Complete	Can detect conflicts if subjects/predicates match
`latent/ingest-fda`	🟡 Prototype	Fetches FDA labels, outputs flat JSONL (not integrated)
Aphoria extractors	✅ Complete	Pattern-based extraction for code (14 extractors)
Disputed LLM extraction	🟡 Early	Generic SPO extraction, no domain schema
Ontology definitions	❌ Missing	No formal way to define subject patterns, predicate schemas

The gap: We can store claims. We cannot yet structure claims for a domain in a way that guarantees conflicts will collide correctly.

The Core Insight

When Journal A says "Semaglutide reduced HbA1c by 1.5%" and Journal B says "Semaglutide reduced HbA1c by 0.8%", we need:

Subject:   Semaglutide:Type2Diabetes   # Drug:Indication compound key
Predicate: hba1c_change_percent        # Reusable across drug:indication pairs
Object:    -1.5                         # The conflicting value

The subject granularity depends on the predicate type:

Efficacy predicates → Subject is {Drug}:{Indication}
Safety predicates → Subject is {Drug} (indication-agnostic)
Mechanism predicates → Subject is {Drug} or {Drug}:{Pathway}

This is what the ontology layer needs to express.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Ontology Layer (New)                        │
├─────────────────────────────────────────────────────────────────┤
│  Domain Definition (YAML/TOML)                                  │
│  ├── Entity Types (Drug, Condition, Biomarker...)               │
│  ├── Predicate Schemas (subject pattern → predicates)           │
│  ├── Source Hierarchy (Tier 0-5)                                │
│  └── Default Lens per predicate type                            │
├─────────────────────────────────────────────────────────────────┤
│  Extraction Pipeline                                            │
│  ├── Source Adapters (FDA API, PubMed, Reddit...)               │
│  ├── Claim Extractor (LLM-based, schema-guided)                 │
│  ├── Normalizer (maps raw → ontology subjects/predicates)       │
│  └── Validator (checks claim conforms to schema)                │
├─────────────────────────────────────────────────────────────────┤
│  StemeDB Core (Existing)                                        │
│  ├── Assertion Storage                                          │
│  ├── Conflict Detection (SkepticLens)                           │
│  └── Query/Resolution                                           │
└─────────────────────────────────────────────────────────────────┘

Week-by-Week Implementation Plan

Week 1: Domain Definition Schema

Goals:

Define how domains express their ontology
Create the pharma domain definition as the first instance
No extraction yet — just the schema that extraction will target

Tasks:

Design domain definition format (applications/ontology/)
- Choose format: YAML or TOML (TOML aligns with Rust ecosystem)
- Define schema for entity types, predicate schemas, source tiers

Create pharma domain definition (applications/ontology/domains/pharma.toml)

[domain]
name = "pharma"
version = "0.1.0"

[entity_types]
Drug = { aliases = ["medication", "compound", "molecule"] }
Indication = { aliases = ["condition", "disease", "disorder"] }
Biomarker = { aliases = ["endpoint", "measure"] }
Population = { aliases = ["cohort", "patient_group"] }

[predicate_schemas.efficacy]
subject_pattern = "{Drug}:{Indication}"
predicates = ["hba1c_change_percent", "weight_change_percent", "remission_rate"]
default_lens = "Skeptic"

[predicate_schemas.safety]
subject_pattern = "{Drug}"
predicates = ["nausea_incidence", "discontinuation_rate", "has_boxed_warning"]
default_lens = "Authority"

[predicate_schemas.mechanism]
subject_pattern = "{Drug}"
predicates = ["target_receptor", "half_life_hours", "bioavailability_percent"]
default_lens = "Recency"

[source_hierarchy]
tier0 = ["FDA_Label", "EMA_Approval"]
tier1 = ["Phase3_RCT", "Meta_Analysis"]
tier2 = ["Observational_Study", "Real_World_Evidence"]
tier3 = ["Case_Report", "Expert_Opinion"]
tier4 = ["Patient_Forum", "Social_Media"]

Implement domain parser (applications/ontology/src/domain.rs)
- Parse TOML into Rust structs
- Validate schema consistency (no circular refs, valid patterns)
- Unit tests for parsing
Subject builder utility
- Given entity values + predicate schema, build correct subject string
- build_subject("efficacy", {"Drug": "Semaglutide", "Indication": "Type2Diabetes"}) → "Semaglutide:Type2Diabetes"

Deliverables:

applications/ontology/ crate with domain definition parsing
domains/pharma.toml as the reference implementation
Subject builder that enforces schema compliance

Foundation this enables:

Extractors know what shape claims should have
Validators can check claims against schema
Future domains just add a new .toml file

Week 2: FDA Label Extraction (Tier 0)

Goals:

Upgrade latent/ingest-fda to produce schema-compliant assertions
Extract structured claims, not raw text blobs
Write directly to StemeDB (not JSONL files)

Tasks:

Refactor FDA ingestor (latent/ingest-fda/)
- Load pharma domain definition
- Use LLM (Claude) to extract structured claims from label sections
- Map extracted claims to ontology predicates

LLM extraction prompt for FDA labels

Given this FDA label section for {drug_name}, extract claims as structured data.

For ADVERSE REACTIONS sections, extract:
- Predicate: {symptom}_incidence
- Object: decimal (0.0-1.0)
- Quote: exact text supporting this

For BOXED WARNINGS, extract:
- Predicate: has_boxed_warning
- Object: boolean (true)
- Quote: warning text

Return JSON array of claims.

Normalizer module (latent/ingest-fda/normalizer.py)
- Map drug names to canonical form (Ozempic → semaglutide)
- Map symptom names to canonical predicates (nausea, vomiting → distinct)
- Use drug synonym database (RxNorm or similar)
StemeDB client integration
- Replace JSONL output with HTTP calls to StemeDB API
- Sign assertions with ingestor's Ed25519 key
- Set source_class: Regulatory (Tier 0)
Integration test
- Ingest semaglutide FDA label
- Query StemeDB for Semaglutide safety predicates
- Verify structured claims exist with correct schema

Deliverables:

FDA ingestor writes schema-compliant assertions to StemeDB
At least 3 drugs (semaglutide, tirzepatide, liraglutide) ingested
Integration test proving round-trip

Foundation this enables:

Tier 0 (regulatory) baseline established
Lower-tier sources can now conflict with FDA data
Pattern for other structured source adapters

Week 3: Clinical Trial Extraction (Tier 1)

Goals:

Extract efficacy claims from clinical trial publications
These will conflict with each other (the whole point of Episteme)
Demonstrate SkepticLens showing real disagreement

Tasks:

PubMed/PMC source adapter (latent/ingest-pubmed/)
- Fetch abstracts + full text for GLP-1 trials
- Filter by clinical trial registration (NCT numbers)
- Extract study metadata (sample size, design, journal)

LLM extraction prompt for efficacy claims

Given this clinical trial abstract/results for {drug_name} in {indication}:

Extract efficacy claims:
- Subject: {drug}:{indication}
- Predicate: {biomarker}_change_percent | remission_rate | etc.
- Object: numeric value
- Confidence: based on sample size and study design
- Quote: exact text

Include comparator arm if mentioned (vs placebo, vs {other_drug}).

Confidence scoring heuristics
- Phase 3 RCT: base confidence 0.9
- Phase 2: base confidence 0.7
- Observational: base confidence 0.5
- Adjust by sample size, blinding, journal impact factor
Conflict demonstration
- Ingest 5+ trials with varying HbA1c results for semaglutide
- Query with SkepticLens
- Show conflict_score > 0 and multiple competing claims

Deliverables:

PubMed ingestor producing Tier 1 assertions
At least 10 clinical trial papers ingested
SkepticLens query showing real conflict in GLP-1 efficacy data

Foundation this enables:

Proves the ontology enables conflict detection
Real "Trust but Verify" data for demos
Pattern for other publication sources (biorxiv, medrxiv)

Week 4: Ontology Validation & Query Patterns

Goals:

Validate that extracted claims conform to ontology
Build query helpers that leverage domain knowledge
CLI tool for exploring the pharma knowledge graph

Tasks:

Claim validator (applications/ontology/src/validator.rs)
- Check subject matches predicate schema's subject_pattern
- Check predicate is defined for its schema
- Check object type is valid for predicate
- Return validation errors or Ok
Query builder with domain awareness
- query_efficacy("Semaglutide", "Type2Diabetes") → correct subject + all efficacy predicates
- query_safety("Semaglutide") → correct subject + all safety predicates
- Uses domain definition to know which predicates to include

CLI exploration tool (applications/ontology/src/bin/steme-pharma.rs)

steme-pharma efficacy semaglutide type2-diabetes
# Shows all efficacy claims with conflict scores

steme-pharma safety semaglutide
# Shows all safety claims by tier

steme-pharma compare semaglutide tirzepatide --indication type2-diabetes
# Side-by-side comparison

Domain-aware API endpoints (optional, if time)
- GET /v1/pharma/efficacy?drug=semaglutide&indication=type2_diabetes
- GET /v1/pharma/safety?drug=semaglutide
- Thin wrapper over existing query API with domain knowledge

Deliverables:

Validator catches schema-violating claims before ingestion
CLI tool for exploring pharma data
Query helpers that make domain queries easy

Foundation this enables:

Quality gate for extraction pipelines
Developer experience for exploring data
Pattern for domain-specific APIs

Goals:

Extract anecdotal claims from Reddit/social media
These are low-confidence but high-volume
Demonstrate the full tier spectrum working together

Tasks:

Reddit source adapter (latent/ingest-reddit/)
- Fetch posts from r/Ozempic, r/Semaglutide, r/diabetes
- Extract post text, upvotes, comment count
- Rate limit appropriately

LLM extraction for anecdotal claims

Given this Reddit post about {drug_name}:

Extract personal experience claims:
- Predicate: reported_{symptom} | reported_effectiveness
- Object: Text (the claim) or Boolean
- Confidence: 0.2-0.4 based on detail and specificity
- Quote: relevant excerpt

Skip if: asking questions, discussing others' experiences, promotional.

Anecdote clustering (stretch goal)
- Group similar anecdotal claims
- Escalate if cluster size exceeds threshold
- "50+ users reporting gastroparesis" becomes a signal

Full-tier query demonstration

steme-pharma safety semaglutide --all-tiers
# Shows:
# Tier 0 (FDA): has_boxed_warning = true (thyroid)
# Tier 1 (RCTs): nausea_incidence = 0.44
# Tier 4 (Reddit): 127 reports of "gastroparesis" (pre-label)

Deliverables:

Reddit ingestor producing Tier 4-5 assertions
At least 100 anecdotal claims ingested
Query showing full tier spectrum for a safety topic

Foundation this enables:

"Early signal detection" — social media flags issues before FDA
Real consumer health use case data
Validates source_class hierarchy is working

Week 6: Ontology Layer Extraction & Documentation

Goals:

Extract reusable patterns from pharma implementation
Document how to create a new domain
Prepare for second vertical (SEC/financial or code/security)

Tasks:

Generalize domain definition schema
- Review what was pharma-specific vs reusable
- Document the schema formally
- Create domains/template.toml with comments
Extraction pipeline abstraction
- Common interface for source adapters
- Common LLM prompt patterns
- Common normalization utilities
"Add a Domain" guide (docs/guides/adding-a-domain.md)
- Step-by-step: define entities, predicates, sources
- Example: how pharma was built
- Checklist: what you need before first extraction
Second domain sketch (no implementation)
- Draft domains/sec.toml or domains/security.toml
- Identify where patterns differ from pharma
- Document what would need to change
Integration with CLAUDE.md
- Add ontology layer to "Find Your Guide" table
- Document new crates and tools
- Update specialized agents if needed

Deliverables:

Reusable ontology crate extracted from pharma-specific code
Documentation for adding new domains
Draft of second domain showing generalization

Foundation this enables:

Other teams can add domains without core changes
Clear separation of domain logic from storage logic
Path to "platform" if that's the direction

Risks & Mitigations

Risk	Likelihood	Impact	Mitigation
LLM extraction quality varies	High	Medium	Start with structured sources (FDA), add human review loop
Ontology schema needs revision	High	Low	Keep Week 1 scope small, iterate based on Week 2-3 learnings
Drug name normalization is hard	Medium	Medium	Use existing resources (RxNorm), accept some manual mapping
Reddit rate limits	Medium	Low	Cache aggressively, use Reddit API properly
Scope creep into full consumer app	Medium	High	Stay focused on extraction + storage, not UX

Success Criteria

Week 3 checkpoint: Can we ingest FDA labels + clinical trials and see real conflicts via SkepticLens?

Week 6 checkpoint: Is the ontology layer reusable enough that sketching a second domain feels like "filling in a template" rather than "starting over"?

End state:

# This should work and show meaningful data:
steme-pharma efficacy semaglutide type2_diabetes

# Output:
Subject: Semaglutide:Type2Diabetes
Predicate: hba1c_change_percent
Status: Contested (conflict_score: 0.42)

Claims:
  -1.5% (45% weight) — NEJM Phase 3, n=1847
  -1.2% (35% weight) — Lancet Phase 3, n=892
  -0.8% (20% weight) — JAMA Observational, n=12000

Sources span Tier 0-2. 47 anecdotal reports (Tier 4) mention efficacy.

Open Questions

Where does the ontology crate live?
- Option A: applications/ontology/ (domain-agnostic utility)
- Option B: crates/stemedb-ontology/ (core crate)
- Recommendation: Start in applications/, promote to crates/ if it proves domain-agnostic
Python vs Rust for extractors?
- Python: faster iteration, better LLM libraries, existing latent/ code
- Rust: type safety, integration with core
- Recommendation: Python for extractors (they're ETL jobs), Rust for ontology validation
How do we handle ontology versioning?
- Predicates may change as we learn
- Old assertions might not conform to new schema
- Recommendation: Version in domain definition, allow "legacy" predicates with deprecation flags

Next Steps

Review this plan and adjust scope/timeline
Decide on Question 1 (crate location)
Start Week 1: domain definition schema

17 KiB Raw Blame History

Ontology Layer + Medical Vertical Implementation Plan

Context: What Exists Today

The Core Insight

High-Level Architecture

Week-by-Week Implementation Plan

Week 1: Domain Definition Schema

Week 2: FDA Label Extraction (Tier 0)

Week 3: Clinical Trial Extraction (Tier 1)

Week 4: Ontology Validation & Query Patterns

Week 5: Social Signal Extraction (Tier 4-5)

Week 6: Ontology Layer Extraction & Documentation

Risks & Mitigations

Success Criteria

Open Questions

Next Steps

17 KiB

Raw Blame History