stemedb/latent/architecture.md
jordan b3e8a9a058 feat: Multi-application expansion with chaos testing and community UI
Major additions:
- Community Next.js app (port 18187) for browsing claims with API docs
- stemedb-chaos crate: Fault injection, chaos testing, CRDT properties
- Latent ingestion system: Reddit/FDA ingesters with ADK-Go agents
- Disputed claims handling: Manual review workflows and validation
- Aphoria security scanner: New extractors (SQL injection, command
  injection, weak crypto, TLS version), policy-based ignores, UAT reports
- Docker infrastructure: Dockerfile, docker-compose.yml for full stack
- VulnBank demo: Intentionally vulnerable multi-language test corpus

SDK & API enhancements:
- Source registry handlers for tracking data provenance
- Metrics endpoint
- Skeptic filtering improvements

Code quality:
- Split 14 large files (>500 lines) into focused modules
- All files now under 500-line limit per project guidelines

Documentation:
- Chaos testing guide, circuit breakers, observability docs
- Phase 7 UAT documentation updates
- Martin Kleppmann technical writer agent

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 01:24:14 -07:00

4.0 KiB

Latent: System Architecture

Latent is an intelligence layer built on top of StemeDB. It transforms raw unstructured health data into a knowledge graph of conflicting safety assertions.

1. High-Level Architecture

  [ EXTERNAL SOURCES ]
          │
          ▼
  ┌──────────────────┐      ┌──────────────────┐
  │  Ingestion Pods  │      │  Extraction      │
  │  (The Sensors)   │─────►│  (LLM Pipeline)  │
  └──────────────────┘      └────────┬─────────┘
                                     │ (Signed Assertions)
                                     ▼
  ┌──────────────────┐      ┌──────────────────┐
  │  StemeDB Spine   │◄─────┤  Assertion       │
  │  (Storage/WAL)   │      │  Manager         │
  └────────┬─────────┘      └──────────────────┘
           │
           ▼
  ┌──────────────────┐      ┌──────────────────┐
  │  Lens Engine     │      │  Divergence      │
  │  (Resolution)    │◄─────┤  Analyzer        │
  └────────┬─────────┘      └────────┬─────────┘
           │                         │
           ▼                         ▼
  ┌──────────────────┐      ┌──────────────────┐
  │  Web Dashboard   │      │  Alerting        │
  │  (Next.js)       │      │  (Slack/Email)   │
  └──────────────────┘      └──────────────────┘

2. Component Breakdown

2.1. Ingestion Pods (Sensors)

Distributed workers responsible for pulling data from the latent/sources.md catalog.

  • Regulatory Sensor: Polls OpenFDA and DailyMed for label changes (SPL XML/JSON).
  • Clinical Sensor: Tracks CT.gov for trial completions and PubMed for case reports.
  • Social Sensor: Utilizes headless browsers or API bridges (Apify) to monitor Reddit/Twitter clusters.

2.2. Extraction Pipeline (The Brain)

Converts raw text/PDFs into structured Assertions.

  • Model: GPT-4o-mini (Cloud) or Llama-3-70B (Local) for PII-sensitive paths.
  • Process:
    1. Entity Recognition (Molecules, Symptoms).
    2. Relation Extraction (Mechanism of action, Adverse event).
    3. Sentiment/Magnitude normalization.
  • Output: A StemeDB-compatible Assertion object.

2.3. Assertion Manager

The gatekeeper for the knowledge graph.

  • Signing: Every assertion extracted by the pipeline is cryptographically signed by the Latent Extraction Agent.
  • Deduplication: Uses content-addressing (StemeDB hashes) to ensure the same Reddit post isn't ingested twice.

2.4. Divergence Analyzer

A specialized background service that queries StemeDB using the Skeptic Lens.

  • Logic: Compares Tier 0 (Regulatory) against Tier 5 (Social).
  • Score Calculation: Divergence = (SocialMagnitude * SocialConfidence) / RegulatorySilence
  • Indexing: Updates materialized views in StemeDB for O(1) molecule status lookups.

3. Data Privacy & Compliance

  • De-identification: All Social data (Tier 5) is stripped of usernames and PII before being written to the permanent StemeDB ledger.
  • Auditability: Every divergence alert carries a "Lineage Hash" back to the raw source snippet.

4. Scalability Strategy

  • Rust Core: The Extraction and Assertion managers are written in Rust for high-concurrency ingestion.
  • Vector Search: Uses StemeDB's vector index to find "Semantic Clusters" of side effects across different languages (e.g., "stomach paralysis" vs "gastric stasis").