487 lines
20 KiB
Markdown
487 lines
20 KiB
Markdown
# Forage — Architecture
|
||
|
||
## Overview
|
||
|
||
Forage has two layers: a reusable **engine** and a demo **server**. The engine is the thing that transfers to other applications. The server is the demo that proves it.
|
||
|
||
`plan.md` is the canonical build spec when details conflict.
|
||
|
||
```
|
||
applications/forage/
|
||
├── engine/ ← library crate — tidalDB wrapper + MAB + signal schema
|
||
└── server/ ← binary crate — Axum HTTP server + feed page (depends on engine)
|
||
```
|
||
|
||
Any application that wants a foraging loop embeds `forage-engine` directly. The Axum server and the feed page are one instantiation of that engine, not the thing itself.
|
||
|
||
Runtime default for the demo server:
|
||
- Persistent state at `~/.forage/data`
|
||
- Optional `--ephemeral` mode for throwaway sessions
|
||
|
||
### System Diagram
|
||
|
||
```
|
||
┌──────────────────────────────────────────────┐
|
||
│ Feed Page (browser, localhost:4242) │
|
||
│ │
|
||
│ User (or Claude) clicks, skips, saves │
|
||
│ JS posts signals directly via fetch() │
|
||
│ Page polls /feed every 5s, re-renders │
|
||
└──────────────────┬───────────────────────────┘
|
||
│ HTTP (localhost:4242)
|
||
│ POST /signal (from page JS)
|
||
│ GET /feed
|
||
▼
|
||
┌──────────────────────────────────────────────┐
|
||
│ forage-server (Axum, thin) │
|
||
│ routes → handlers → ForageEngine │
|
||
└──────────────────┬───────────────────────────┘
|
||
│ Rust function calls
|
||
▼
|
||
┌──────────────────────────────────────────────┐
|
||
│ forage-engine (library crate) │
|
||
│ │
|
||
│ ForageEngine { db: TidalDb } │
|
||
│ fn signal(user, item, type) -> Result<()> │
|
||
│ fn feed(user, limit) -> Result<Vec<Item>> │
|
||
│ fn seed(corpus) -> Result<()> │
|
||
│ │
|
||
│ MAB layer (epsilon-greedy, labels) │
|
||
│ Signal schema (view/dwell/save/skip/share) │
|
||
│ Ranking profiles (default/explore/converge) │
|
||
└──────────────────┬───────────────────────────┘
|
||
│ embedded
|
||
▼
|
||
┌──────────────────────────────────────────────┐
|
||
│ tidalDB │
|
||
│ Entities · Signals · Profiles · HNSW · BM25│
|
||
└──────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Chrome Extension Role
|
||
|
||
In **P0**, the Chrome extension is a **light observer**, not a driver. The feed page handles its own signal posting via plain JS `fetch()` — no MCP tools needed for every click. Claude uses the extension to check in occasionally:
|
||
|
||
- **Once at session start**: `navigate` to the feed page
|
||
- **Periodically**: `read_page` to snapshot the current feed state (one call, not per-interaction)
|
||
- **At the end**: compare snapshots, report what shifted
|
||
|
||
This keeps token usage low. The interesting loop — signal → re-rank → new feed — runs entirely in the browser and server without any Claude involvement. Claude's role is observer and reporter, not puppeteer in P0.
|
||
|
||
---
|
||
|
||
## Data Flow
|
||
|
||
### Write Path (Signal)
|
||
|
||
```
|
||
User clicks an item card on the feed page
|
||
→ POST /signal { user_id: 1, item_id: 42, signal_type: "view" }
|
||
→ forage-server receives request
|
||
→ forage-engine::signal(user, item, SignalType::View)
|
||
→ db.signal("view", EntityId::new(42), 1.0, Timestamp::now()) // value derived in engine
|
||
→ tidalDB writes to hot-tier SignalLedger (in-memory DashMap)
|
||
→ tidalDB updates user PreferenceVector (EMA blend toward item embedding)
|
||
→ tidalDB persists WAL entry (fjall, durability)
|
||
→ HTTP 200
|
||
← total: < 5ms
|
||
```
|
||
|
||
### Read Path (Feed)
|
||
|
||
```
|
||
Feed page requests feed
|
||
→ GET /feed?user=1&limit=7
|
||
→ forage-server calls engine.feed(user_id, 7)
|
||
→ forage-engine MAB layer:
|
||
exploit_pool = db.retrieve(
|
||
Retrieve::builder()
|
||
.for_user(user_id)
|
||
.using_profile("forage_default")
|
||
.filter(FilterExpr::unseen(user_id))
|
||
.diversity(max_per_category: 2)
|
||
.limit(20)
|
||
.build()
|
||
)
|
||
explore_candidates = items where category_signal_count(user, cat) < EXPLORE_THRESHOLD
|
||
final_7 = interleave(exploit_pool[0..6], explore_candidates[0..1], label each)
|
||
→ serialize to JSON with label, score, why_reason
|
||
→ HTTP 200 { items: [...] }
|
||
← total: < 50ms
|
||
```
|
||
|
||
### Preference Evolution
|
||
|
||
tidalDB's `apply_session_preference_update` is called on session close, not per-signal. Forage uses a **periodic flush** pattern: a background task closes and reopens each user's session every 60 seconds, triggering the EMA blend of signaled item embeddings into the preference vector.
|
||
|
||
```
|
||
// forage-engine background task (spawned at startup)
|
||
every 60s:
|
||
for each active user:
|
||
db.close_session(user_id, session_id) → triggers apply_session_preference_update
|
||
db.open_session(user_id) → fresh session for next window
|
||
```
|
||
|
||
Effective learning rates per signal type (via `update_with_custom_rate`):
|
||
```
|
||
"view" → lr=0.05 (mild positive)
|
||
"dwell" → lr=0.10 (stronger — reading time is intent)
|
||
"save" → lr=0.20 (strong intent)
|
||
"skip" → lr=-0.02 (mild negative)
|
||
"share" → lr=0.30 (strongest positive)
|
||
```
|
||
|
||
The 60s flush means preference vectors lag signals by up to 60s — acceptable for a foraging engine where the feed refreshes every 5s but deep preference shifts evolve over sessions, not seconds. The adaptive learning rate (tidalDB M6p6: `alpha = base / (1 + ln(n+1))`) means early signals have more influence; later signals refine without overcorrecting.
|
||
|
||
---
|
||
|
||
## Signal Schema
|
||
|
||
```rust
|
||
// Declared on startup in forage-engine/src/schema.rs
|
||
|
||
let schema = SchemaBuilder::new()
|
||
.signal("view", EntityKind::Item, DecaySpec::Exponential { half_life: days(7) })
|
||
.windows(&[Window::TwentyFourHours, Window::AllTime])
|
||
.velocity(true)
|
||
.add()
|
||
.signal("dwell", EntityKind::Item, DecaySpec::Exponential { half_life: days(3) })
|
||
.windows(&[Window::TwentyFourHours, Window::AllTime])
|
||
.velocity(false)
|
||
.add()
|
||
.signal("save", EntityKind::Item, DecaySpec::Exponential { half_life: days(30) })
|
||
.windows(&[Window::AllTime])
|
||
.velocity(false)
|
||
.add()
|
||
.signal("skip", EntityKind::Item, DecaySpec::Exponential { half_life: days(1) })
|
||
.windows(&[Window::TwentyFourHours])
|
||
.velocity(false)
|
||
.add()
|
||
.signal("share", EntityKind::Item, DecaySpec::Exponential { half_life: days(14) })
|
||
.windows(&[Window::AllTime])
|
||
.velocity(false)
|
||
.add()
|
||
.build()?;
|
||
```
|
||
|
||
Signal semantics:
|
||
| Signal | Half-life | Meaning | Learning Rate |
|
||
|--------|-----------|---------|---------------|
|
||
| `view` | 7 days | User opened the item | 0.05 |
|
||
| `dwell` | 3 days | User read for ≥30s (proxy for completion) | 0.10 |
|
||
| `save` | 30 days | User explicitly bookmarked it | 0.20 |
|
||
| `skip` | 1 day | User dismissed it | −0.02 |
|
||
| `share` | 14 days | User sent it to someone | 0.30 |
|
||
|
||
---
|
||
|
||
## Ranking Profiles
|
||
|
||
Three profiles covering the exploration/exploitation spectrum:
|
||
|
||
### `forage_default` (primary)
|
||
- Personalized blend: preference_match 0.5, signal_recency 0.3, quality 0.2
|
||
- Exploration budget: 14% (roughly 1 in 7 items)
|
||
- Diversity: `max_per_category: 2`
|
||
- Unseen filter: always on
|
||
|
||
### `forage_explore` (cold start / adventurous users)
|
||
- Exploration budget: 35%
|
||
- Boosts `hidden_gems` profile weighting (high quality, low view count)
|
||
- Wider diversity: `max_per_category: 1`
|
||
|
||
### `forage_converge` (power users with strong preferences)
|
||
- Exploration budget: 5%
|
||
- Pure preference match + recency
|
||
- Tighter diversity: `max_per_category: 3` (allows depth in known interests)
|
||
|
||
---
|
||
|
||
## MAB Layer
|
||
|
||
The epsilon-greedy MAB lives in `forage-engine/src/mab.rs`. It wraps tidalDB queries — it does not replace them.
|
||
|
||
```rust
|
||
pub struct MabConfig {
|
||
pub exploration_ratio: f32, // default 0.14
|
||
pub explore_threshold: u64, // categories with < N user signals = exploration eligible
|
||
}
|
||
|
||
pub fn rank(db: &TidalDb, user_id: u64, limit: usize, cfg: &MabConfig)
|
||
-> Result<Vec<ForageItem>>
|
||
{
|
||
// Step 1: Get exploit pool (2× limit so we have headroom)
|
||
let exploit_count = ((1.0 - cfg.exploration_ratio) * limit as f32).ceil() as usize;
|
||
let explore_count = limit - exploit_count;
|
||
|
||
let exploit_pool = db.retrieve(
|
||
Retrieve::builder()
|
||
.for_user(EntityId::new(user_id))
|
||
.using_profile("forage_default")
|
||
.filter(FilterExpr::unseen_by(user_id))
|
||
.diversity(DiversityConstraints { max_per_category: Some(2), ..Default::default() })
|
||
.limit(limit * 2)
|
||
.build()?
|
||
)?;
|
||
|
||
// Step 2: Find exploration candidates (categories with < threshold signals)
|
||
let explore_pool = exploit_pool.iter()
|
||
.filter(|item| category_signal_count(db, user_id, item.category()) < cfg.explore_threshold)
|
||
.take(explore_count * 3) // more candidates = better exploration variety
|
||
.collect::<Vec<_>>();
|
||
|
||
// Step 3: Interleave, label, return
|
||
let mut result = Vec::with_capacity(limit);
|
||
let mut exploit_iter = exploit_pool.iter().filter(|i| !is_explore_candidate(i));
|
||
let mut explore_iter = explore_pool.iter();
|
||
|
||
for i in 0..limit {
|
||
let is_explore_slot = (i + 1) % (limit / explore_count.max(1)) == 0;
|
||
if is_explore_slot {
|
||
if let Some(item) = explore_iter.next() {
|
||
result.push(label(item, ItemLabel::Exploring));
|
||
continue;
|
||
}
|
||
}
|
||
if let Some(item) = exploit_iter.next() {
|
||
result.push(label(item, determine_label(item, user_id, db)));
|
||
}
|
||
}
|
||
|
||
Ok(result)
|
||
}
|
||
```
|
||
|
||
Labels assigned at ranking time, returned in the feed response:
|
||
- `"match"` — cosine similarity to preference vector above threshold
|
||
- `"exploring"` — from underexplored category bucket
|
||
- `"trending"` — high velocity regardless of personalization
|
||
- `"resurfaced"` — prior low engagement, being re-evaluated after decay
|
||
|
||
---
|
||
|
||
## HTTP API
|
||
|
||
### `POST /signal`
|
||
|
||
```json
|
||
{
|
||
"user_id": 1,
|
||
"item_id": 42,
|
||
"signal_type": "view",
|
||
"duration_ms": null
|
||
}
|
||
```
|
||
|
||
Response: `200 OK { "ok": true }`
|
||
|
||
For `dwell` signals, `duration_ms` is used internally to scale signal strength: `value = min(duration_ms / 30000.0, 3.0)`.
|
||
|
||
### `GET /feed?user=X&limit=7`
|
||
|
||
```json
|
||
{
|
||
"user_id": 1,
|
||
"items": [
|
||
{
|
||
"id": 42,
|
||
"title": "Toward a Theory of Generative Systems",
|
||
"source": "mitpress.mit.edu",
|
||
"category": "science",
|
||
"reading_time_min": 8,
|
||
"description": "You have engaged with complexity theory and emergent systems. This paper bridges those interests with formal generative grammar.",
|
||
"label": "match",
|
||
"score": 0.847,
|
||
"url": "https://..."
|
||
}
|
||
],
|
||
"generated_at_ms": 1708720000000
|
||
}
|
||
```
|
||
|
||
### `GET /items`
|
||
|
||
Returns all seed items. Used by the feed page for initial render and by Claude for browsing context.
|
||
|
||
### `GET /`
|
||
|
||
Serves `static/index.html` — the feed page.
|
||
|
||
---
|
||
|
||
## Seed Data
|
||
|
||
100 items, 8 categories, reproducible via seeded RNG (`seed = 42`).
|
||
|
||
| Category | Count | Sample titles |
|
||
|----------|-------|---------------|
|
||
| `tech` | 15 | "Consistent Hashing and Load Distribution", "CRDT Primer: Convergent Data Structures", "Why Your Database Lies About Durability" |
|
||
| `music` | 10 | "Brian Eno's Oblique Strategies", "Sidechaining as Musical Grammar", "Why Lo-Fi Works" |
|
||
| `jazz` | 15 | "Coltrane Changes: Why They Work", "West African Rhythm and American Jazz", "The Harmony of Ornette Coleman" |
|
||
| `cooking` | 12 | "The Chemistry of Sourdough", "Miso in Three Steps", "Lacto-Fermentation Without Fear" |
|
||
| `fitness` | 10 | "Loaded Carries and Their Underuse", "Joint Mobility vs. Flexibility", "Walking Is Enough" |
|
||
| `travel` | 10 | "Night Trains Through Central Europe", "Walking Cities by Sound", "Markets, Routes, and Street Cartography" |
|
||
| `science` | 15 | "Emergence: From Cells to Consciousness", "Small Worlds and Scale-Free Networks", "Power Laws in Nature" |
|
||
| `literature` | 13 | "Joan Didion on Self-Respect", "Montaigne's Recursive Method", "David Foster Wallace on Attention" |
|
||
|
||
Items include realistic metadata: `created_at`, `reading_time`, `word_count`, `source_domain`, `author`.
|
||
|
||
### P0 Embeddings Strategy
|
||
|
||
P0 uses **category-axis vectors** — no embedding service required. Each category is assigned a basis vector in 8-dimensional space (one dimension per category). Items within the same category get similar vectors; items in different categories get orthogonal ones. A small random offset (seeded, deterministic) gives intra-category variation.
|
||
|
||
```rust
|
||
// forage-engine/src/seed.rs
|
||
fn category_vector(category: &str, item_offset: u64) -> Vec<f32> {
|
||
let mut v = vec![0.0f32; 8];
|
||
let dim = category_index(category); // 0..7
|
||
v[dim] = 0.9;
|
||
// small deterministic noise from item_id seed
|
||
add_seeded_noise(&mut v, item_offset, 0.1);
|
||
l2_normalize(&mut v)
|
||
}
|
||
```
|
||
|
||
This makes semantic similarity actually demonstrate something in P0: items in the same category cluster together, cross-category exploration is genuinely "far" from the user's centroid. When preference vectors form, they point toward the user's engaged categories and `similar_to` queries return items from those categories.
|
||
|
||
P2 replaces these with real embeddings from an external service. The seed corpus entries and the vector shape in tidalDB are identical — only the values change.
|
||
|
||
---
|
||
|
||
## Feed Page
|
||
|
||
Minimal. Static HTML, no framework. Under 200 lines.
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────┐
|
||
│ ◦ forage [user: 1 ▾] [7 items] last updated: 2s ago │
|
||
├──────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||
│ │ [match] │ │ [exploring] │ │ [match] │ │
|
||
│ │ │ │ │ │ │ │
|
||
│ │ Title │ │ Title │ │ Title │ │
|
||
│ │ source · 8m │ │ source · 4m │ │ source · 12m│ │
|
||
│ │ │ │ │ │ │ │
|
||
│ │ Description │ │ Description │ │ Description │ │
|
||
│ │ paragraph │ │ paragraph │ │ paragraph │ │
|
||
│ │ │ │ │ │ │ │
|
||
│ │ [skip] [▸] │ │ [skip] [▸] │ │ [skip] [▸] │ │
|
||
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||
│ │
|
||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||
│ │ [trending] │ │ [match] │ │ [match] │ │
|
||
│ │ ... │ │ ... │ │ ... │ │
|
||
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||
│ │
|
||
│ ┌─────────────┐ │
|
||
│ │ [exploring] │ │
|
||
│ │ ... │ │
|
||
│ └─────────────┘ │
|
||
└──────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
Interactions:
|
||
- **Click card** → `POST /signal view` + open URL in new tab
|
||
- **Hover ≥3s** → `POST /signal dwell` (JS timer, fires on mouseleave if threshold met)
|
||
- **[skip]** → `POST /signal skip` + animate card out + pull next item
|
||
- **[▸]** (save) → `POST /signal save` + animate bookmark indicator
|
||
- **Auto-refresh** → polls `/feed` every 5s, diffs result, animates re-ordering
|
||
|
||
---
|
||
|
||
## Project Layout
|
||
|
||
```
|
||
applications/forage/
|
||
├── vision.md # What it is and why
|
||
├── plan.md # Phased build plan
|
||
├── architecture.md # This file
|
||
├── readme.md # How to run it
|
||
│
|
||
├── engine/ # Library crate — the reusable core
|
||
│ ├── Cargo.toml
|
||
│ └── src/
|
||
│ ├── lib.rs # ForageEngine public API
|
||
│ ├── schema.rs # tidalDB schema declaration
|
||
│ ├── seed.rs # Deterministic seed corpus builder
|
||
│ ├── mab.rs # Epsilon-greedy MAB wrapper
|
||
│ └── labels.rs # Label assignment logic
|
||
│
|
||
└── server/ # Binary crate — the demo
|
||
├── Cargo.toml
|
||
└── src/
|
||
├── main.rs # Axum startup
|
||
├── handlers.rs # HTTP handlers (signal, feed, items)
|
||
└── static/
|
||
└── index.html # Feed page (plain HTML/JS, ~150 lines)
|
||
```
|
||
|
||
### Crate dependencies
|
||
|
||
`forage-engine/Cargo.toml`:
|
||
```toml
|
||
[lib]
|
||
name = "forage_engine"
|
||
|
||
[dependencies]
|
||
tidaldb = { path = "../../../tidal" }
|
||
serde = { version = "1", features = ["derive"] }
|
||
```
|
||
|
||
`forage-server/Cargo.toml`:
|
||
```toml
|
||
[[bin]]
|
||
name = "forage-server"
|
||
|
||
[dependencies]
|
||
forage-engine = { path = "../engine" }
|
||
axum = "0.7"
|
||
tokio = { version = "1", features = ["full"] }
|
||
serde_json = "1"
|
||
tower-http = { version = "0.5", features = ["cors", "fs"] }
|
||
```
|
||
|
||
CORS headers are required on the Axum server so the feed page's `fetch()` calls to `/signal` and `/feed` work without browser errors.
|
||
|
||
### Embedding in another application
|
||
|
||
Any Rust application that wants the foraging loop:
|
||
|
||
```toml
|
||
[dependencies]
|
||
forage-engine = { path = "path/to/forage/engine" }
|
||
```
|
||
|
||
```rust
|
||
use forage_engine::ForageEngine;
|
||
use std::path::Path;
|
||
|
||
let engine = ForageEngine::persistent(Path::new("/home/you/.forage/data"))?;
|
||
engine.seed_default_corpus()?;
|
||
|
||
// Write a signal
|
||
engine.signal(user_id, item_id, SignalType::View)?;
|
||
|
||
// Get a ranked feed with MAB labels
|
||
let feed = engine.feed(user_id, 7)?;
|
||
```
|
||
|
||
The Axum server is optional. The engine is the thing that transfers.
|
||
|
||
---
|
||
|
||
## What tidalDB Handles (Nothing to Reimplement)
|
||
|
||
- Preference vector maintenance and EMA updates
|
||
- Signal decay, velocity, windowed aggregation
|
||
- HNSW vector index (semantic similarity)
|
||
- BM25 full-text index (keyword search)
|
||
- Diversity constraints (max per category, max per creator)
|
||
- Cold-start exploration budget (items with no signals)
|
||
- Session persistence and WAL durability
|
||
- Filter evaluation (unseen, category, signal threshold)
|
||
- The `hidden_gems` profile (high quality, low reach)
|
||
|
||
The MAB layer is the only thing Forage adds on top of tidalDB. Everything else is a query.
|