jordan f4cfd6c81f feat: complete M8 replication primitives + forage enhancements + docs

Milestone 8 (phases 1-4):
- Shard-aware WAL segment naming, BatchHeader v2, ShardRouter
- Transport trait, InProcessTransport, WalShipper, FollowerDb
- HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine
- Session replication bridge with SeqNo/HWM, idempotency store

Forage application:
- Multi-source discovery engine with MAB exploration
- Embedding-based label system, server handlers, UI refresh

Other:
- QUICKSTART.md, README.md, milestone-8 planning docs
- Hard negative union semantics, RLHF export enhancements
- Recovery benchmark and visibility test expansions
- Split 8 oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-24 13:17:19 -07:00

12 KiB

Raw Blame History

iknowyou — Vision

The Problem

Every system that talks to people talks to all of them the same way.

Chatbots, assistants, notification systems, CRMs, onboarding flows — they generate language aimed at a statistical median. They don't know that Jordan prefers direct questions over explanations. They don't know that Sarah goes quiet after 10pm and resents being pinged. They don't know that Marcus engages deeply with technical specifics but shuts down when you get abstract.

The current state of "personalization" in communication is prompt stuffing — a static bio paragraph, maybe a few preference flags, injected into context and hoped for the best. It doesn't learn. It doesn't decay. It doesn't notice that someone's interests shifted last week or that they respond to humor on Fridays but not Mondays.

Real personalization requires a system that observes, remembers, forgets, and adapts — continuously, per-person, across every dimension of how a human communicates.

The tools to do this exist but they're scattered across six systems: a vector database for style embeddings, a feature store for behavioral signals, a time-series store for temporal patterns, a key-value store for preference state, an event bus for real-time observation, and application code that tries to glue it all together. The seams between these systems are where the learning breaks down.

The Thesis

Communication is a personalized ranking problem.

"What should I say to this person, in what way, at what time?" is structurally identical to "What content should this user see, in what order?" The same primitives that solve content discovery — signals with decay, preference vectors with adaptive learning, temporal windowing, cohort priors, exploration/exploitation — solve communication personalization when pointed at a different surface.

iknowyou is a communication learning engine built on tidalDB. It doesn't generate language — it learns how language lands, and tells the generator what it knows.

What It Is

A closed-loop system that sits between a language model and the people it talks to. Every message sent is an experiment. Every response (or silence) is a measurement. The system observes, extracts structured signals, writes them into tidalDB's signal ledger, and watches preference vectors converge on how each person actually communicates.

Before the LM generates its next message, iknowyou assembles a communication brief — a structured profile of everything the system has learned about this person, weighted by recency, confidence, and context.

First-Class Primitives

Messages are items. Every message the system generates is stored with metadata (topic, tone, length, structure, time sent) and an embedding. The person's response is a signal on that item. tidalDB's preference vectors automatically evolve toward "the kind of message this person engages with."

Observations are items. Natural-language statements about a person's communication patterns, stored with embeddings and confidence signals that decay over time. Retrieved semantically before each generation. "Jordan redirects away from process topics within 1-2 messages" is an observation. It has a 30-day half-life. If it stops being true, it fades.

Persons are users. Each has a preference vector (learned from message engagement), a signal ledger (all interaction history, decayed), metadata (timezone, role, context), and cohort memberships.

Conversations are sessions. Each has a start and end, a policy, an audit trail, and a set of signals that aggregate into the person's global profile on close.

The Signal Schema

Communication produces a richer signal surface than content consumption. A person doesn't just "view" a message — they respond to it, and how they respond encodes multiple dimensions:

Signal	What it measures	Decay
`replied`	They responded at all	7d
`replied_fast`	Latency < 2 min	3d
`replied_substantively`	Word count, depth, engagement	7d
`positive_sentiment`	Affirmative, enthusiastic, building-on	14d
`negative_sentiment`	Dismissive, frustrated, redirecting	3d
`topic_engaged`	Stayed on or deepened a topic	14d
`topic_dropped`	Changed subject or went brief	3d
`initiated`	They brought this up unprompted	30d
`went_silent`	No response after timeout	1d
`explicit_feedback`	Direct correction or praise	60d

Short half-lives on negative signals: the system forgets your bad days quickly. Long half-lives on explicit feedback: when someone tells you something directly, remember it.

The Closed Loop

Conversation
  → Person responds (or doesn't)
    → Observer extracts structured signals
      → Signals written to tidalDB (decay, window, velocity — automatic)
        → Preference vectors update (EMA blend — automatic)
          → Communication brief assembled (query tidalDB)
            → LM generates next message, conditioned on brief
              → Conversation continues

No batch jobs. No retraining. No feature pipelines. The loop is continuous and the learning is incremental — every single exchange makes the system slightly better at talking to this person.

The Observer

A small, fast LM call that extracts structured data from each exchange. Not the conversation model — a dedicated analyst. It produces:

Engagement metrics: did they reply, how fast, how much
Style cues: formality, emoji usage, sentence structure, jargon level
Topic extraction: what the conversation is about, at what specificity
Conversation dynamics: who's leading, did they redirect, did they ask or answer
Temporal context: time of day, day of week, response latency pattern

This is the classifier. It's not a separate ML model — it's a structured-output LM call. One inference, deterministic schema.

The Brief

Before generating any message, the system queries tidalDB and assembles:

Top decayed topics — what this person cares about right now (velocity separates "always liked Rust" from "suddenly interested in replication")
Style preference — formality, length, structure preferences, weighted by recency
Timing patterns — windowed counts over hours-of-day reveal when they're active, responsive, and receptive
What works — messages with high positive-response signals, retrieved by preference vector similarity
What doesn't — patterns that correlate with silence or negative sentiment
Relevant observations — semantic retrieval of natural-language observations matching the current context
Cohort priors — for dimensions where individual data is sparse, fall back to what works for people like them

The brief is structured JSON. The LM reads it as a system prompt. It never touches the database directly.

Cohorts

Cohorts solve three problems:

Cold start. A new person has no signal history. But if you know they're a developer in Pacific time who came from a technical community, the developers and us_pacific cohort signal ledgers already contain aggregate patterns. The system starts with reasonable defaults instead of random guessing.

Cross-pollination. When 50 developers all respond well to direct, concise, technical messages — that learning propagates to the next developer automatically through the cohort ledger. Individual learning is still primary, but cohort signal is the prior.

Drift detection. When a person's individual signals diverge sharply from their cohort, that's itself a signal. An engineer who prefers casual non-technical conversation is interesting precisely because they're atypical for their cohort. The delta between individual and cohort signals is information.

Cohorts are defined by predicates over person metadata:

"developers":     role == "engineer"
"us_pacific":     timezone == "America/Los_Angeles"
"morning_active": peak_hour in [6, 11]
"formal_pref":    observed_formality == "high"

Predicates are evaluated at signal-write time. A person can belong to multiple cohorts. Cohort membership can change as metadata evolves.

What It Is NOT

Not a chatbot. iknowyou doesn't generate language. It learns how language lands and produces structured briefs for a generator that does.
Not a CRM. It doesn't store contact records, deal pipelines, or business relationships. It stores communication patterns.
Not a sentiment analysis tool. Sentiment extraction is one input signal among many. The system learns multidimensional communication preferences, not a happiness score.
Not a profile page. The communication brief is optimized for LM consumption, not human reading. (Though an inspection UI is valuable for trust and debugging.)
Not a replacement for the LM's own capabilities. A good LM already adapts within a conversation. iknowyou provides the cross-conversation memory that context windows can't.

Design Principles

The response is the ground truth. Don't ask people what they prefer — watch what they do. A fast, substantive reply is a stronger signal than any preference checkbox. Silence is data.

Decay is not optional. People change. A preference observed six months ago is not the same as one observed yesterday. Every signal has a half-life. Nothing is permanent except explicit, direct corrections — and even those fade slowly.

Learn fast, stabilize late. Early interactions should have outsized influence — the system should feel like it's paying attention from the first exchange. As confidence builds, the learning rate drops. New observations refine rather than overwrite.

Observe, don't interrogate. Never ask "do you prefer formal or casual language?" Infer it from how they write. The best personalization is invisible — the person just notices that conversations feel easier over time.

Cohorts are priors, not destiny. Use what you know about similar people to bootstrap. Overwrite it with direct evidence immediately. Never let group patterns override individual signals.

The brief is the interface. The communication model doesn't talk to tidalDB. It reads a brief. This keeps the LM stateless, the learning layer independent, and the whole system testable — you can inspect and modify the brief at any point in the loop.

Negative signals decay fast. Everyone has bad days. A short, dismissive reply on a Tuesday night shouldn't poison the model for weeks. Short half-lives on negative signals; long half-lives on positive ones. The system is forgiving by default.

Silence is a signal, not an absence. When someone doesn't respond, that's information. After a configurable timeout, went_silent fires as a negative signal on the sent message. But its half-life is short — maybe they were just busy.

Who This Is For

Any system that talks to people repeatedly and wants to get better at it:

AI assistants that communicate with the same users across sessions
Notification systems that want to reach people at the right time, in the right tone, about the right things
Onboarding flows that adapt to how each person learns
Customer communication that remembers how someone prefers to be addressed
Collaborative tools that adjust their language to match the team's communication culture

The common thread: repeated interaction with the same person, where the quality of communication compounds over time.

The Name

iknowyou. Because the goal isn't to talk at people — it's to know them well enough that the conversation feels natural. Not surveillance. Not profiling. Just the kind of knowing that comes from paying attention.

12 KiB Raw Blame History