jordan 413b712c0a chore: initialize tidalDB repository with schema foundation and standards

- Schema phase 1 (tasks 01-02): EntityId, EntityKind, Timestamp, Score, SignalTypeDef, DecayModel, Window, WindowSet — all with property tests and benchmarks scaffolding
- Stub modules for storage, signals, query, ranking
- Full documentation suite: VISION, USE_CASES, SEQUENCE, API, CODING_GUIDELINES, ai-lookup, research docs, specs, roadmap, planning docs
- Marketing site (Next.js) with blog infrastructure
- .claude/ agents and skills for the tidalDB development workflow
- Foundation standards enforced: thiserror + tracing declared as dependencies, clippy::unwrap_used = deny added to lint config
- .gitignore hardened: .next/, node_modules/, .env, secrets, logs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-20 12:52:20 -07:00

15 KiB

Raw Blame History

Coding Guidelines

Engineering standards for tidalDB. Derived from the research in docs/research/, the architectural patterns in thoughts.md, and the roadmap's dependency chain.

These are not aspirational. They are load-bearing constraints. Violating them creates bugs that are expensive to find and painful to fix in a ranking system.

1. Memory Layout and Performance

Cache-line alignment on hot-path structs

Any struct touched during candidate scoring must be #[repr(C, align(64))] — exactly one L1 cache line. This prevents false sharing under concurrent access and keeps scoring loops cache-friendly.

Hot-path structs include: per-entity signal state, entity metadata summaries, user preference vectors, relationship weights.

#[repr(C, align(64))]
struct EntitySignalState {
    entity_id: u64,
    decay_scores: [f64; 3],      // one per decay rate
    last_update_ns: u64,
    window_counts: BucketedCounter,
    // ... pad to 64-byte boundary if needed
}

Lock-free on the hot path

Signal counters, decay scores, and windowed aggregates must use atomic operations — never mutexes. A like event increments an atomic counter. A ranking query reads it without blocking writers.

AtomicU64 with Relaxed ordering for counters
AtomicF64 (via AtomicU64 + f64::from_bits) with CAS loops for decay scores
Acquire/Release ordering only at synchronization boundaries (checkpoint, flush)
DashMap or sharded maps for concurrent entity state access

Mutexes are acceptable for cold-path operations: schema changes, profile definitions, background compaction coordination.

Allocation discipline

Pre-allocate result buffers. Ranking queries should not allocate per-candidate.
Reuse Vec capacity across query executions where possible.
Avoid String in hot-path structs — use interned IDs or u64 hashes.
Embedding vectors are &[f32] slices backed by mmap or arena, never Vec<f32> copies.

2. Storage Architecture

WAL is the source of truth

Every write — entity, signal, relationship — goes through the Write-Ahead Log before any processing. The entity store, signal aggregates, and search index are derived state. If they are lost, they can be rebuilt from the WAL.

Signal events are durably logged (fsync'd) before aggregation occurs
The aggregation system can crash, restart, and replay from the WAL
Content-addressed events (BLAKE3 hash of payload) for automatic deduplication of retries

Trait-abstract the storage backend

The storage engine (fjall initially, potentially RocksDB later) must sit behind a trait boundary. No storage engine types should leak into the signal, query, or ranking modules.

pub trait EntityStore: Send + Sync {
    fn get(&self, id: &EntityId) -> Result<Option<Entity>>;
    fn put(&self, entity: &Entity) -> Result<()>;
    fn scan_prefix(&self, prefix: &[u8]) -> Result<Box<dyn Iterator<Item = Entity>>>;
}

Per-entity-type storage isolation

Item signal ledgers, user preference vectors, and creator profiles live in separate storage namespaces (column families or keyspaces). A burst of signal events for a viral item must not slow down user profile reads.

Key encoding

Follow the subject-prefix pattern: {entity_id}\x00{TAG}:{suffix}. All data for one entity is co-located. Big-endian encoding so byte-lexicographic ordering matches numeric ordering.

[entity_id: u64 BE][0x00][SIG:view:24h]  → windowed aggregate
[entity_id: u64 BE][0x00][META]           → entity metadata
[entity_id: u64 BE][0x00][REL:follows]    → relationship edge

3. Signal System

Decay is a type, not a formula you call

The application never computes trending_score = views_24h / (age_hours + 2)^1.8. That logic lives in a named ranking profile. The application writes SIGNAL view and queries USING PROFILE trending.

Running decay scores — O(1) update, O(1) read

Use the forward-decay formula. It is mathematically exact, not an approximation.

Write: S(t) = S(t_prev) * exp(-lambda * dt) + weight Read: current = stored * exp(-lambda * dt_since_last)

Cost: 3 exp() calls per write (~36ns), 1 exp() per read per entity per lambda (~15ns). For 200 candidates, that's ~3-4 microseconds total.

Do not scan raw events to compute decay at read time. That path costs 160+ microseconds at 50 events/entity and breaks the budget at 500+.

Out-of-order events are handled correctly

When t_event < last_update, pre-decay the weight: score += weight * exp(-lambda * (last_update - t_event)). Do not update last_update — it already reflects a more recent time.

Immutable events, mutable aggregates

Signal events (a user liked an item at time T) are immutable facts. Signal aggregates (this item has 1,247 likes in the last 24h) are mutable derived state. Keep these layers distinct. Aggregates can always be recomputed from events.

4. Vector Index

USearch is the HNSW engine

Do not build HNSW from scratch. USearch provides 126K+ QPS, predicate callbacks during traversal, mmap persistence, and quantization. The FFI boundary via CXX is thin.

f16 quantization as default

10M vectors at 1536D: ~31.5 GB (f16) vs ~60 GB (float32). Less than 1% recall loss. Use float32 only when benchmarks prove f16 is insufficient for a specific embedding model.

Normalize embeddings at insertion time

For cosine similarity, normalize vectors to unit length and use L2 distance (equivalent for unit vectors, more SIMD-friendly). Store normalized vectors — never re-normalize at query time.

Adaptive filtered search

Never hardcode a single filtering strategy. Estimate selectivity, then branch:

<2% selectivity: Pre-filter (roaring bitmap intersection) then brute-force L2
2-100% selectivity: filtered_search with predicate callback (in-graph filtering)
Fallback: Widen ef_search or degrade to pre-filter + brute-force

5. Text Search

Tantivy is a derived index, not a source of truth

The entity store is the source of truth. Tantivy is a materialized view. If the Tantivy index is corrupted or lost, it can be rebuilt from the entity store.

Consistency pattern:

Write to entity store (within transaction / WAL)
Background indexer reads outbox and feeds Tantivy
On each Tantivy commit, store last-processed sequence number in commit payload
On crash recovery, replay from that sequence number

Hybrid fusion starts with RRF

RRF(d) = 1/(60 + rank_bm25) + 1/(60 + rank_ann). Rank-based, no score normalization needed, robust across query types. Graduate to tuned linear combination only after relevance labels exist to tune alpha.

6. Query and Ranking

Ranking profiles are data, not code

Profiles are schema-level declarations — parsed, validated, versioned, stored in the database. They are not Rust functions compiled into the binary. The query optimizer reasons about profile structure to plan execution.

A profile change should never require recompiling or redeploying.

Diversity is a post-scoring pass

After candidates are scored, apply diversity constraints as a separate reordering pass. Diversity does not reduce result count — it reorders to enforce constraints (max_per_creator, format_mix) while maintaining the target count.

Negative signals are structurally equal to positive signals

Skips, hides, blocks, mutes, downvotes are not the absence of engagement. They are data. They carry the same weight, precision, and update immediacy as likes. A hide creates a permanent hard-negative. A skip within 3 seconds is a strong quality signal. The ranking function treats these as first-class inputs.

Graceful degradation, never failure

Under load, return slightly less precise rankings — not errors. Degrade in this order:

Reduce candidate set size (top_k: 500 -> 200)
Use coarser signal aggregates (skip velocity, use windowed counts)
Skip diversity enforcement
Return results from materialized cache

Never return an empty result set or an error for a well-formed query.

7. Error Handling

`Result<T>` everywhere, `unwrap()` nowhere

Every fallible operation returns Result. No unwrap(), no expect() outside of tests and initialization. Panics in a database corrupt state.

Errors are typed and actionable

pub enum TidalError {
    /// Storage engine failure — retry may succeed.
    Storage(StorageError),
    /// Entity not found — caller should handle.
    NotFound { entity: EntityId },
    /// Schema violation — caller's fault, fix the input.
    Schema(SchemaError),
    /// Signal write failed durability check — retry required.
    Durability(DurabilityError),
    /// Query malformed — parse error with position.
    Query(QueryError),
    /// Internal invariant violated — this is a bug, log and degrade.
    Internal(String),
}

Internal errors trigger graceful degradation, not crashes. Log them loudly. Return approximate results if possible.

8. Testing

Property tests for invariants

Use proptest for properties that must hold regardless of input:

Decay scores monotonically decrease when no new events arrive
Windowed aggregates equal the sum of events within the window
Diversity constraints hold in every result set
WAL replay produces identical state to uninterrupted execution
Filter composition is commutative (order of filters doesn't change results)
Blocked/hidden items never appear in query results

Crash recovery tests

Simulate crashes at every point in the write path:

Mid-WAL-write
After WAL commit, before entity store update
After entity store, before signal aggregation
After signal aggregation, before Tantivy index
During background materialization

Verify: the system recovers to a consistent state. No lost events. No phantom state.

Benchmark from day one

Use criterion for micro-benchmarks. Track these numbers continuously:

Signal write latency (target: <100 microseconds including WAL fsync amortized)
Decay score read per candidate (target: ~15ns)
200-candidate scoring pass (target: <5 microseconds)
ANN retrieval at 1M vectors (target: <10ms p99)
BM25 query at 1M documents (target: <10ms)
End-to-end RETRIEVE query (target: <50ms)

Regressions in these numbers are bugs. Treat them like test failures.

9. Code Organization

Module boundaries match the dependency chain

storage/     → knows nothing about signals, queries, or ranking
signals/     → depends on storage, knows nothing about queries or ranking
query/       → depends on storage + signals, knows nothing about ranking internals
ranking/     → depends on signals, invoked by query executor
schema/      → standalone, depended on by everything

Circular dependencies between these modules are architectural bugs. If ranking needs to call into storage directly, that call goes through a trait the query executor provides.

Public API is minimal

Expose the smallest possible surface. Internal types stay internal. The public API is:

TidalDB::open(), TidalDB::shutdown()
define_entity(), define_signal(), define_profile()
write_item(), write_user(), write_creator()
write_relationship()
signal()
retrieve(), search(), suggest()

Everything else is pub(crate) or module-private.

One concern per file

A file that handles both signal ingestion and signal aggregation will grow into a 2000-line mess. Split early: signals/ingest.rs, signals/decay.rs, signals/aggregation.rs, signals/materialization.rs.

10. Dependencies

Minimal, intentional, auditable

Every dependency must justify its existence against "could we write this in 200 lines?"

Approved dependencies (from research):

fjall — storage engine (pure Rust, embeddable)
usearch — HNSW vector index (C++ FFI via cxx)
tantivy — full-text search / BM25
blake3 — content-addressed hashing
roaring — bitmap indexes for filtered search
thiserror — derive Display and From for typed error enums; eliminates boilerplate without hiding structure
tracing — structured spans for query execution, WAL writes, and signal ingestion; embedders choose their own subscriber
criterion — benchmarking
proptest — property testing
serde / serde_json — serialization (at API boundaries only, not in hot paths)
chrono or time — timestamp handling
dashmap — concurrent hash map for hot-path entity state

Do not add dependencies for things the standard library or a 50-line util handles: argument parsing, builder pattern macros, derive-everything crates.

No `unsafe` without a comment explaining why

Every unsafe block must have a // SAFETY: comment explaining:

What invariant the compiler can't verify
Why this specific usage is sound
What would make it unsound (for future maintainers)

Prefer #![forbid(unsafe_code)] at the crate level where possible. The storage engine and FFI boundaries (USearch) are the only modules that should need unsafe.

11. Observability

`tracing` spans on every public operation

Every public function that crosses a subsystem boundary gets a #[tracing::instrument] attribute. This is non-negotiable — it is how query latency, signal write throughput, and WAL sync times are measured in production without any additional instrumentation work later.

#[tracing::instrument(skip(self), fields(entity_id = %id))]
pub fn get_entity(&self, id: EntityId) -> Result<Option<Entity>> {
    // ...
}

The skip attribute prevents large or sensitive arguments from being logged by default. Add fields(...) to surface the key identifiers that make traces navigable.

Instrument at subsystem entry points, not every helper

Instrument the public API and the major internal stage boundaries:

EntityStore::{get, put, scan_prefix}
SignalLedger::{record, decay_score}
QueryExecutor::execute
RankingEngine::score
Wal::{append, flush}

Do not add spans to private helpers called within a single instrumented function. The overhead accumulates.

tidalDB is a library — embedders choose their subscriber

Do not initialize a tracing subscriber anywhere in this crate. The subscriber is the embedder's responsibility. Import tracing = "0.1" only; never tracing-subscriber in the main crate.

Error events

Use tracing::error! for TidalError::Internal (a bug occurred), tracing::warn! for recoverable degradation, tracing::debug! for query planning decisions, tracing::trace! for per-candidate scoring.

Never use println! or eprintln! in production code.

12. Commit and Review Standards

Commits are atomic and purposeful

One logical change per commit. "Add signal decay scoring" is a commit. "Add decay scoring and also fix a typo and refactor entity store" is three commits.

Every PR must include

What changed and why (not how — the diff shows how)
Benchmark results if touching hot-path code
Property test or crash recovery test if touching write path or state management
No regressions in existing benchmarks

No TODO without an issue

// TODO: comments are allowed only with a link to a tracking issue. Orphan TODOs rot. If it's worth noting, it's worth tracking.

15 KiB Raw Blame History