tidaldb/applications/forage/architecture.md
2026-02-23 22:41:16 -07:00

20 KiB
Raw Blame History

Forage — Architecture

Overview

Forage has two layers: a reusable engine and a demo server. The engine is the thing that transfers to other applications. The server is the demo that proves it.

plan.md is the canonical build spec when details conflict.

applications/forage/
├── engine/     ← library crate — tidalDB wrapper + MAB + signal schema
└── server/     ← binary crate — Axum HTTP server + feed page (depends on engine)

Any application that wants a foraging loop embeds forage-engine directly. The Axum server and the feed page are one instantiation of that engine, not the thing itself.

Runtime default for the demo server:

  • Persistent state at ~/.forage/data
  • Optional --ephemeral mode for throwaway sessions

System Diagram

┌──────────────────────────────────────────────┐
│  Feed Page (browser, localhost:4242)          │
│                                              │
│  User (or Claude) clicks, skips, saves       │
│  JS posts signals directly via fetch()        │
│  Page polls /feed every 5s, re-renders       │
└──────────────────┬───────────────────────────┘
                   │  HTTP (localhost:4242)
                   │  POST /signal  (from page JS)
                   │  GET  /feed
                   ▼
┌──────────────────────────────────────────────┐
│  forage-server (Axum, thin)                  │
│  routes → handlers → ForageEngine            │
└──────────────────┬───────────────────────────┘
                   │  Rust function calls
                   ▼
┌──────────────────────────────────────────────┐
│  forage-engine  (library crate)              │
│                                              │
│  ForageEngine { db: TidalDb }               │
│  fn signal(user, item, type) -> Result<()>  │
│  fn feed(user, limit) -> Result<Vec<Item>>  │
│  fn seed(corpus) -> Result<()>              │
│                                              │
│  MAB layer (epsilon-greedy, labels)          │
│  Signal schema (view/dwell/save/skip/share)  │
│  Ranking profiles (default/explore/converge) │
└──────────────────┬───────────────────────────┘
                   │  embedded
                   ▼
┌──────────────────────────────────────────────┐
│  tidalDB                                     │
│  Entities · Signals · Profiles · HNSW · BM25│
└──────────────────────────────────────────────┘

Chrome Extension Role

In P0, the Chrome extension is a light observer, not a driver. The feed page handles its own signal posting via plain JS fetch() — no MCP tools needed for every click. Claude uses the extension to check in occasionally:

  • Once at session start: navigate to the feed page
  • Periodically: read_page to snapshot the current feed state (one call, not per-interaction)
  • At the end: compare snapshots, report what shifted

This keeps token usage low. The interesting loop — signal → re-rank → new feed — runs entirely in the browser and server without any Claude involvement. Claude's role is observer and reporter, not puppeteer in P0.


Data Flow

Write Path (Signal)

User clicks an item card on the feed page
  → POST /signal { user_id: 1, item_id: 42, signal_type: "view" }
  → forage-server receives request
  → forage-engine::signal(user, item, SignalType::View)
  → db.signal("view", EntityId::new(42), 1.0, Timestamp::now())   // value derived in engine
  → tidalDB writes to hot-tier SignalLedger (in-memory DashMap)
  → tidalDB updates user PreferenceVector (EMA blend toward item embedding)
  → tidalDB persists WAL entry (fjall, durability)
  → HTTP 200
  ← total: < 5ms

Read Path (Feed)

Feed page requests feed
  → GET /feed?user=1&limit=7
  → forage-server calls engine.feed(user_id, 7)
  → forage-engine MAB layer:
      exploit_pool = db.retrieve(
        Retrieve::builder()
          .for_user(user_id)
          .using_profile("forage_default")
          .filter(FilterExpr::unseen(user_id))
          .diversity(max_per_category: 2)
          .limit(20)
          .build()
      )
      explore_candidates = items where category_signal_count(user, cat) < EXPLORE_THRESHOLD
      final_7 = interleave(exploit_pool[0..6], explore_candidates[0..1], label each)
  → serialize to JSON with label, score, why_reason
  → HTTP 200  { items: [...] }
  ← total: < 50ms

Preference Evolution

tidalDB's apply_session_preference_update is called on session close, not per-signal. Forage uses a periodic flush pattern: a background task closes and reopens each user's session every 60 seconds, triggering the EMA blend of signaled item embeddings into the preference vector.

// forage-engine background task (spawned at startup)
every 60s:
  for each active user:
    db.close_session(user_id, session_id)  → triggers apply_session_preference_update
    db.open_session(user_id)               → fresh session for next window

Effective learning rates per signal type (via update_with_custom_rate):

"view"  → lr=0.05  (mild positive)
"dwell" → lr=0.10  (stronger — reading time is intent)
"save"  → lr=0.20  (strong intent)
"skip"  → lr=-0.02 (mild negative)
"share" → lr=0.30  (strongest positive)

The 60s flush means preference vectors lag signals by up to 60s — acceptable for a foraging engine where the feed refreshes every 5s but deep preference shifts evolve over sessions, not seconds. The adaptive learning rate (tidalDB M6p6: alpha = base / (1 + ln(n+1))) means early signals have more influence; later signals refine without overcorrecting.


Signal Schema

// Declared on startup in forage-engine/src/schema.rs

let schema = SchemaBuilder::new()
    .signal("view",  EntityKind::Item, DecaySpec::Exponential { half_life: days(7) })
        .windows(&[Window::TwentyFourHours, Window::AllTime])
        .velocity(true)
        .add()
    .signal("dwell", EntityKind::Item, DecaySpec::Exponential { half_life: days(3) })
        .windows(&[Window::TwentyFourHours, Window::AllTime])
        .velocity(false)
        .add()
    .signal("save",  EntityKind::Item, DecaySpec::Exponential { half_life: days(30) })
        .windows(&[Window::AllTime])
        .velocity(false)
        .add()
    .signal("skip",  EntityKind::Item, DecaySpec::Exponential { half_life: days(1) })
        .windows(&[Window::TwentyFourHours])
        .velocity(false)
        .add()
    .signal("share", EntityKind::Item, DecaySpec::Exponential { half_life: days(14) })
        .windows(&[Window::AllTime])
        .velocity(false)
        .add()
    .build()?;

Signal semantics:

Signal Half-life Meaning Learning Rate
view 7 days User opened the item 0.05
dwell 3 days User read for ≥30s (proxy for completion) 0.10
save 30 days User explicitly bookmarked it 0.20
skip 1 day User dismissed it 0.02
share 14 days User sent it to someone 0.30

Ranking Profiles

Three profiles covering the exploration/exploitation spectrum:

forage_default (primary)

  • Personalized blend: preference_match 0.5, signal_recency 0.3, quality 0.2
  • Exploration budget: 14% (roughly 1 in 7 items)
  • Diversity: max_per_category: 2
  • Unseen filter: always on

forage_explore (cold start / adventurous users)

  • Exploration budget: 35%
  • Boosts hidden_gems profile weighting (high quality, low view count)
  • Wider diversity: max_per_category: 1

forage_converge (power users with strong preferences)

  • Exploration budget: 5%
  • Pure preference match + recency
  • Tighter diversity: max_per_category: 3 (allows depth in known interests)

MAB Layer

The epsilon-greedy MAB lives in forage-engine/src/mab.rs. It wraps tidalDB queries — it does not replace them.

pub struct MabConfig {
    pub exploration_ratio: f32,       // default 0.14
    pub explore_threshold: u64,       // categories with < N user signals = exploration eligible
}

pub fn rank(db: &TidalDb, user_id: u64, limit: usize, cfg: &MabConfig)
    -> Result<Vec<ForageItem>>
{
    // Step 1: Get exploit pool (2× limit so we have headroom)
    let exploit_count = ((1.0 - cfg.exploration_ratio) * limit as f32).ceil() as usize;
    let explore_count = limit - exploit_count;

    let exploit_pool = db.retrieve(
        Retrieve::builder()
            .for_user(EntityId::new(user_id))
            .using_profile("forage_default")
            .filter(FilterExpr::unseen_by(user_id))
            .diversity(DiversityConstraints { max_per_category: Some(2), ..Default::default() })
            .limit(limit * 2)
            .build()?
    )?;

    // Step 2: Find exploration candidates (categories with < threshold signals)
    let explore_pool = exploit_pool.iter()
        .filter(|item| category_signal_count(db, user_id, item.category()) < cfg.explore_threshold)
        .take(explore_count * 3)  // more candidates = better exploration variety
        .collect::<Vec<_>>();

    // Step 3: Interleave, label, return
    let mut result = Vec::with_capacity(limit);
    let mut exploit_iter = exploit_pool.iter().filter(|i| !is_explore_candidate(i));
    let mut explore_iter = explore_pool.iter();

    for i in 0..limit {
        let is_explore_slot = (i + 1) % (limit / explore_count.max(1)) == 0;
        if is_explore_slot {
            if let Some(item) = explore_iter.next() {
                result.push(label(item, ItemLabel::Exploring));
                continue;
            }
        }
        if let Some(item) = exploit_iter.next() {
            result.push(label(item, determine_label(item, user_id, db)));
        }
    }

    Ok(result)
}

Labels assigned at ranking time, returned in the feed response:

  • "match" — cosine similarity to preference vector above threshold
  • "exploring" — from underexplored category bucket
  • "trending" — high velocity regardless of personalization
  • "resurfaced" — prior low engagement, being re-evaluated after decay

HTTP API

POST /signal

{
  "user_id": 1,
  "item_id": 42,
  "signal_type": "view",
  "duration_ms": null
}

Response: 200 OK { "ok": true }

For dwell signals, duration_ms is used internally to scale signal strength: value = min(duration_ms / 30000.0, 3.0).

GET /feed?user=X&limit=7

{
  "user_id": 1,
  "items": [
    {
      "id": 42,
      "title": "Toward a Theory of Generative Systems",
      "source": "mitpress.mit.edu",
      "category": "science",
      "reading_time_min": 8,
      "description": "You have engaged with complexity theory and emergent systems. This paper bridges those interests with formal generative grammar.",
      "label": "match",
      "score": 0.847,
      "url": "https://..."
    }
  ],
  "generated_at_ms": 1708720000000
}

GET /items

Returns all seed items. Used by the feed page for initial render and by Claude for browsing context.

GET /

Serves static/index.html — the feed page.


Seed Data

100 items, 8 categories, reproducible via seeded RNG (seed = 42).

Category Count Sample titles
tech 15 "Consistent Hashing and Load Distribution", "CRDT Primer: Convergent Data Structures", "Why Your Database Lies About Durability"
music 10 "Brian Eno's Oblique Strategies", "Sidechaining as Musical Grammar", "Why Lo-Fi Works"
jazz 15 "Coltrane Changes: Why They Work", "West African Rhythm and American Jazz", "The Harmony of Ornette Coleman"
cooking 12 "The Chemistry of Sourdough", "Miso in Three Steps", "Lacto-Fermentation Without Fear"
fitness 10 "Loaded Carries and Their Underuse", "Joint Mobility vs. Flexibility", "Walking Is Enough"
travel 10 "Night Trains Through Central Europe", "Walking Cities by Sound", "Markets, Routes, and Street Cartography"
science 15 "Emergence: From Cells to Consciousness", "Small Worlds and Scale-Free Networks", "Power Laws in Nature"
literature 13 "Joan Didion on Self-Respect", "Montaigne's Recursive Method", "David Foster Wallace on Attention"

Items include realistic metadata: created_at, reading_time, word_count, source_domain, author.

P0 Embeddings Strategy

P0 uses category-axis vectors — no embedding service required. Each category is assigned a basis vector in 8-dimensional space (one dimension per category). Items within the same category get similar vectors; items in different categories get orthogonal ones. A small random offset (seeded, deterministic) gives intra-category variation.

// forage-engine/src/seed.rs
fn category_vector(category: &str, item_offset: u64) -> Vec<f32> {
    let mut v = vec![0.0f32; 8];
    let dim = category_index(category);  // 0..7
    v[dim] = 0.9;
    // small deterministic noise from item_id seed
    add_seeded_noise(&mut v, item_offset, 0.1);
    l2_normalize(&mut v)
}

This makes semantic similarity actually demonstrate something in P0: items in the same category cluster together, cross-category exploration is genuinely "far" from the user's centroid. When preference vectors form, they point toward the user's engaged categories and similar_to queries return items from those categories.

P2 replaces these with real embeddings from an external service. The seed corpus entries and the vector shape in tidalDB are identical — only the values change.


Feed Page

Minimal. Static HTML, no framework. Under 200 lines.

┌──────────────────────────────────────────────────────────────┐
│  ◦ forage     [user: 1 ▾]   [7 items]   last updated: 2s ago │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ [match]     │  │ [exploring] │  │ [match]     │         │
│  │             │  │             │  │             │         │
│  │ Title       │  │ Title       │  │ Title       │         │
│  │ source · 8m │  │ source · 4m │  │ source · 12m│         │
│  │             │  │             │  │             │         │
│  │ Description │  │ Description │  │ Description │         │
│  │ paragraph   │  │ paragraph   │  │ paragraph   │         │
│  │             │  │             │  │             │         │
│  │ [skip]  [▸] │  │ [skip]  [▸] │  │ [skip]  [▸] │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ [trending]  │  │ [match]     │  │ [match]     │         │
│  │  ...        │  │  ...        │  │  ...        │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                              │
│  ┌─────────────┐                                            │
│  │ [exploring] │                                            │
│  │  ...        │                                            │
│  └─────────────┘                                            │
└──────────────────────────────────────────────────────────────┘

Interactions:

  • Click cardPOST /signal view + open URL in new tab
  • Hover ≥3sPOST /signal dwell (JS timer, fires on mouseleave if threshold met)
  • [skip]POST /signal skip + animate card out + pull next item
  • [▸] (save) → POST /signal save + animate bookmark indicator
  • Auto-refresh → polls /feed every 5s, diffs result, animates re-ordering

Project Layout

applications/forage/
├── vision.md                   # What it is and why
├── plan.md                     # Phased build plan
├── architecture.md             # This file
├── readme.md                   # How to run it
│
├── engine/                     # Library crate — the reusable core
│   ├── Cargo.toml
│   └── src/
│       ├── lib.rs              # ForageEngine public API
│       ├── schema.rs           # tidalDB schema declaration
│       ├── seed.rs             # Deterministic seed corpus builder
│       ├── mab.rs              # Epsilon-greedy MAB wrapper
│       └── labels.rs           # Label assignment logic
│
└── server/                     # Binary crate — the demo
    ├── Cargo.toml
    └── src/
        ├── main.rs             # Axum startup
        ├── handlers.rs         # HTTP handlers (signal, feed, items)
        └── static/
            └── index.html      # Feed page (plain HTML/JS, ~150 lines)

Crate dependencies

forage-engine/Cargo.toml:

[lib]
name = "forage_engine"

[dependencies]
tidaldb = { path = "../../../tidal" }
serde = { version = "1", features = ["derive"] }

forage-server/Cargo.toml:

[[bin]]
name = "forage-server"

[dependencies]
forage-engine = { path = "../engine" }
axum = "0.7"
tokio = { version = "1", features = ["full"] }
serde_json = "1"
tower-http = { version = "0.5", features = ["cors", "fs"] }

CORS headers are required on the Axum server so the feed page's fetch() calls to /signal and /feed work without browser errors.

Embedding in another application

Any Rust application that wants the foraging loop:

[dependencies]
forage-engine = { path = "path/to/forage/engine" }
use forage_engine::ForageEngine;
use std::path::Path;

let engine = ForageEngine::persistent(Path::new("/home/you/.forage/data"))?;
engine.seed_default_corpus()?;

// Write a signal
engine.signal(user_id, item_id, SignalType::View)?;

// Get a ranked feed with MAB labels
let feed = engine.feed(user_id, 7)?;

The Axum server is optional. The engine is the thing that transfers.


What tidalDB Handles (Nothing to Reimplement)

  • Preference vector maintenance and EMA updates
  • Signal decay, velocity, windowed aggregation
  • HNSW vector index (semantic similarity)
  • BM25 full-text index (keyword search)
  • Diversity constraints (max per category, max per creator)
  • Cold-start exploration budget (items with no signals)
  • Session persistence and WAL durability
  • Filter evaluation (unseen, category, signal threshold)
  • The hidden_gems profile (high quality, low reach)

The MAB layer is the only thing Forage adds on top of tidalDB. Everything else is a query.