# 13 -- Concurrency Specification **Status:** Draft **Authors:** tidalDB Engineering **Date:** 2026-02-20 **Depends on:** [Storage Engine](01-storage-engine.md), [Signal System](03-signal-system.md), [Feedback Loop](10-feedback-loop.md) **References:** [CODING_GUIDELINES.md](../../CODING_GUIDELINES.md), [thoughts.md](../../thoughts.md), [Text Retrieval](06-text-retrieval.md), [Vector Retrieval](07-vector-retrieval.md) --- ## Table of Contents 1. [Overview](#1-overview) 2. [Thread Model](#2-thread-model) 3. [Lock-Free Signal Updates](#3-lock-free-signal-updates) 4. [Group Commit](#4-group-commit) 5. [Read-Write Isolation](#5-read-write-isolation) 6. [Deadlock Prevention](#6-deadlock-prevention) 7. [Graceful Degradation Ladder](#7-graceful-degradation-ladder) 8. [Background Task Scheduling](#8-background-task-scheduling) 9. [Shutdown Protocol](#9-shutdown-protocol) 10. [Memory Management](#10-memory-management) 11. [Invariants and Property Tests](#11-invariants-and-property-tests) --- ## 1. Overview tidalDB is a single-process, multi-threaded Rust database. It must handle hundreds of thousands of signal writes per second concurrent with ranking queries that complete in under 50ms. The concurrency model is the mechanism that makes both workloads coexist in the same address space without interference. The fundamental tension: signal writers must update shared state (decay scores, windowed counters, relationship weights) that ranking queries must read simultaneously. Mutexes on the hot path are not an option. At sustained signal write rates and 10K ranking queries/sec, a mutex on any shared counter would serialize the system to the throughput of a single core. The solution, validated by Engram's spreading activation engine, Citadel's per-tenant quota tracking, and StemeDB's concurrent vote counting, is a layered concurrency model: 1. **Atomics and CAS loops** for all per-entity signal state on the hot path. 2. **Epoch-based reclamation** for concurrent data structure mutations (entity metadata updates, relationship graph changes). 3. **Channel-serialized writes** for the WAL (one writer, many producers). 4. **Lock-free reads everywhere** -- no ranking query ever acquires a lock on the scoring path. ### Design Principles **Writers never block readers. Readers never block writers.** This is not an aspiration. It is a structural invariant enforced by the choice of data structures and memory ordering. **Correctness over throughput.** A lock-free counter that silently loses updates is worse than a mutex that is slow. Every atomic operation in this specification has a correctness proof: the memory ordering is sufficient to prevent torn reads, and the CAS retry loop guarantees no lost updates. **The compiler is the first concurrency reviewer.** Rust's ownership system prevents data races at compile time. `Send` and `Sync` bounds on thread-shared types are not annotations -- they are proof obligations. If a type does not implement `Send + Sync`, it cannot cross thread boundaries, period. --- ## 2. Thread Model ### 2.1 Thread Architecture tidalDB uses a fixed set of thread pools, each dedicated to a workload class. Threads within a pool are interchangeable. Threads across pools interact only through atomic state and channels. ``` Thread Architecture +------------------------------------------------------------------+ | Application | | db.signal() db.retrieve() db.search() db.define_*() | +------+---------------+----------------+----------------+----------+ | | | | v v v v +------+------+ +------+-------+ +------+-------+ +-----+--------+ | Signal | | Query | | Query | | Schema | | Writer Pool | | Executor Pool| | Executor Pool| | (main thread)| | N threads | | M threads | | (shared) | | | +------+------+ +------+-------+ +------+-------+ +-----+--------+ | | | | | (channel) | (atomics) | (atomics) | (mutex, cold) v v v v +------+------+ +------+---------------------------------------+---+ | WAL Commit | | Shared State | | Thread (1) | | | +------+------+ | DashMap (atomics) | | | DashMap<(EntityId,Sig), WarmState> (atomics) | | | Entity Metadata (epoch/COW) | | | Relationship Graph (append-only) | | | HNSW Vector Index (node locks) | | | Tantivy Segments (immutable) | | +---+----------+----------+----------+-------+-----+ | | | | | | v v v v v v +------+------+ +---+---+ +---+---+ +----+----+ +--+--+ +--+---+ | WAL on disk | |fjall | |redb | |Tantivy | |USearch| |Bloom | | | |(LSM) | |(B-tree)| |(text) | |(HNSW) | |filter| +-------------+ +-------+ +-------+ +---------+ +------+ +------+ Background Threads (not shown above for clarity): +------------------+ +--------------------+ +------------------+ | Materializer | | Index Maintenance | | Tier Migration | | Pool (B threads) | | Pool (I threads) | | Thread (1) | | - bucket rotate | | - HNSW insert | | - hot/cold evict | | - rollup compute | | - Tantivy merge | | - promote on | | - checkpoint | | - segment flush | | access | | - segment recomp | | | | | +------------------+ +--------------------+ +------------------+ ``` ### 2.2 Thread Pool Definitions | Pool | Purpose | Default Size | Scaling Rule | |------|---------|-------------|--------------| | **Signal Writers** | Accept signal events, hash for dedup, update hot-tier atomics, enqueue WAL records | `min(4, cores / 4)` | Scale with signal ingestion rate. Each thread sustains ~250K signals/sec. | | **WAL Commit** | Single thread. Drains the WAL batch queue, issues `writev()` + `fdatasync()`, notifies waiters. | 1 (always) | Never more than 1. The WAL is a sequential write stream. Parallelizing it would require synchronization that negates the benefit. | | **Query Executors** | Execute RETRIEVE/SEARCH/SUGGEST queries. Read from hot tier, vector index, text index. Score candidates. Enforce diversity. | `min(cores / 2, 16)` | Scale with query concurrency. Each executor handles one query at a time. The pool size bounds concurrent queries. | | **Materializers** | Background aggregation: bucket rotation, rollup computation, checkpointing, behavioral segment recomputation. | `min(2, cores / 8)` | Rarely needs more than 2. The materializer is I/O-bound on disk writes, not CPU-bound. | | **Index Maintenance** | HNSW vector insertions, Tantivy segment merges, Tantivy document indexing. | `min(2, cores / 8)` | HNSW insertion is CPU-bound (graph traversal). Tantivy segment merge is I/O-bound. 2 threads covers both. | | **Tier Migration** | Evict cold entities from hot tier, promote on access. | 1 | Single thread is sufficient. Migration is periodic and low-volume. | ### 2.3 Thread Pool Sizing For a reference deployment on a 16-core machine: ``` 16 cores allocation: Signal Writers: 4 threads (sustains ~1M signals/sec aggregate) WAL Commit: 1 thread (sequential writes, one fdatasync at a time) Query Executors: 8 threads (8 concurrent ranking queries) Materializers: 2 threads (bucket rotation + rollup generation) Index Maintenance: 2 threads (HNSW inserts + Tantivy merges) Tier Migration: 1 thread (periodic eviction/promotion) -- Total: 18 threads (slight oversubscription is intentional) ``` The slight oversubscription (18 threads on 16 cores) is deliberate. The WAL commit thread and materializer threads are often blocked on I/O (`fdatasync`, disk reads for rollups), so their cores are available for query executors. Under sustained load, the OS scheduler handles the overlap. Under burst load, signal writers and query executors compete for cores -- the graceful degradation system (Section 7) sheds load before this becomes a problem. For smaller machines (4 cores): ``` 4 cores allocation: Signal Writers: 1 thread (sustains ~250K signals/sec) WAL Commit: 1 thread Query Executors: 2 threads Materializers: 1 thread Index Maintenance: 1 thread (shared: HNSW + Tantivy alternate) Tier Migration: 0 (runs on materializer thread) -- Total: 6 threads ``` ### 2.4 CPU Affinity Considerations tidalDB does not pin threads to cores by default. OS scheduler placement is sufficient for most deployments. However, two patterns benefit from affinity when available: 1. **WAL commit thread.** Pinning to a core near the NVMe controller's NUMA node reduces fdatasync latency by avoiding cross-NUMA memory access. On NUMA systems, measure fdatasync latency before and after pinning. 2. **Signal writer threads.** These access the same `DashMap` shards repeatedly. Pinning writers to adjacent cores on the same NUMA node reduces cache-coherency traffic (MESI invalidations) for the DashMap's internal `RwLock`-per-shard. Affinity is configured via `ThreadConfig`, not hardcoded. The default is `None` (OS-scheduled). ```rust pub struct ThreadConfig { pub signal_writers: usize, pub query_executors: usize, pub materializers: usize, pub index_maintenance: usize, /// Optional CPU affinity for the WAL commit thread. /// Set to a core ID near the NVMe NUMA node for best fsync latency. pub wal_commit_affinity: Option, /// Optional CPU set for signal writer threads. /// Adjacent cores on the same NUMA node reduce coherency traffic. pub signal_writer_affinity: Option>, } impl Default for ThreadConfig { fn default() -> Self { let cores = num_cpus::get(); Self { signal_writers: (cores / 4).max(1).min(4), query_executors: (cores / 2).max(1).min(16), materializers: (cores / 8).max(1).min(2), index_maintenance: (cores / 8).max(1).min(2), wal_commit_affinity: None, signal_writer_affinity: None, } } } ``` --- ## 3. Lock-Free Signal Updates Signal counters are the hottest shared state in the system. Every signal write updates them. Every ranking query reads them. They must be lock-free with carefully chosen memory ordering. ### 3.1 AtomicF64: Bit-Pattern Encoding Rust's standard library provides `AtomicU64` but not `AtomicF64`. tidalDB encodes floating-point values in `AtomicU64` using bit transmutation: ```rust /// AtomicF64 via bit-pattern encoding in AtomicU64. /// /// f64::to_bits() and f64::from_bits() are lossless round-trip /// conversions. The bit pattern is not meaningful as an integer -- /// it is only used for atomic load/store/CAS operations. /// /// This is the same technique used by Engram for activation levels /// and StemeDB for aggregate weights. pub struct AtomicF64(AtomicU64); impl AtomicF64 { pub fn new(val: f64) -> Self { Self(AtomicU64::new(val.to_bits())) } pub fn load(&self, order: Ordering) -> f64 { f64::from_bits(self.0.load(order)) } pub fn store(&self, val: f64, order: Ordering) { self.0.store(val.to_bits(), order); } /// Compare-and-swap on the bit pattern. /// Returns Ok(current_f64) on success, Err(actual_f64) on failure. pub fn compare_exchange_weak( &self, current: f64, new: f64, success: Ordering, failure: Ordering, ) -> Result { match self.0.compare_exchange_weak( current.to_bits(), new.to_bits(), success, failure, ) { Ok(bits) => Ok(f64::from_bits(bits)), Err(bits) => Err(f64::from_bits(bits)), } } } ``` ### 3.2 Memory Ordering Table Every atomic operation in tidalDB uses the minimum ordering sufficient for correctness. This table is the authoritative reference. Any atomic operation not listed here is a bug. | Operation | Type | Ordering | Justification | |-----------|------|----------|---------------| | **Counters** | | | | | `view_count.fetch_add(1)` | `AtomicU64` | `Relaxed` | Pure accumulator. No other operation depends on seeing this specific increment. Ranking queries read a recent-enough value. | | `minute_bucket[i].fetch_add(1)` | `AtomicU32` | `Relaxed` | Bucket increments are independent. Bucket rotation uses Acquire/Release to synchronize the bucket pointer. | | `all_time_count.fetch_add(1)` | `AtomicU64` | `Relaxed` | Same as view_count. | | **Decay Scores** | | | | | `decay_scores[i].load()` (writer) | `AtomicU64` (f64 bits) | `Acquire` | Writer must see the latest score before computing the new value. Without Acquire, the CAS could succeed against a stale value, effectively dropping a concurrent writer's update. | | `decay_scores[i].compare_exchange_weak()` | `AtomicU64` (f64 bits) | `AcqRel` / `Acquire` | AcqRel on success: the new score is visible to subsequent Acquire loads. Acquire on failure: reload the latest value for retry. | | `decay_scores[i].load()` (reader) | `AtomicU64` (f64 bits) | `Acquire` | Reader must see a score consistent with the `last_update_ns` loaded immediately before. Acquire pairs with the Release on `last_update_ns.store()`. | | **Timestamps** | | | | | `last_update_ns.load()` (writer) | `AtomicU64` | `Acquire` | Pairs with the Release store. Writer must see the most recent timestamp to correctly compute `dt`. | | `last_update_ns.store()` (writer) | `AtomicU64` | `Release` | Makes the updated timestamp (and all preceding score updates) visible to readers that load with Acquire. This is the synchronization point between writers and readers. | | `last_update_ns.load()` (reader) | `AtomicU64` | `Acquire` | Establishes a happens-before with the writer's Release store. After this load, all score updates that preceded the writer's timestamp store are visible. | | **Bucket Pointers** | | | | | `current_minute.store()` | `AtomicU8` | `Release` | After zeroing the new bucket and storing the rotated pointer, Release ensures readers see the zeroed bucket and the new pointer consistently. | | `current_minute.load()` | `AtomicU8` | `Acquire` | Reader must see the pointer consistent with the bucket contents. Pairs with the materializer's Release store. | | **State Transitions** | | | | | `entity_tier.compare_exchange()` | `AtomicU8` | `AcqRel` / `Acquire` | Tier transitions (cold->warm->hot) must be atomic and visible before tier-specific data is accessed. | | `entity_status.store()` | `AtomicU8` | `Release` | Status transitions (live, archived, deleted) gate query inclusion. Pairs with Acquire loads in query executors. | | **Shutdown / Control** | | | | | `shutdown_flag.store(true)` | `AtomicBool` | `Release` | All pending writes must be visible before threads observe the shutdown flag. | | `shutdown_flag.load()` | `AtomicBool` | `Acquire` | Threads must see all state updates that preceded the shutdown signal. | **Why SeqCst is never used.** Sequential consistency (`SeqCst`) establishes a single total order across all atomic operations on all variables. This is unnecessary for tidalDB because no operation requires global ordering -- each synchronization point involves at most two variables (a timestamp and a score, or a pointer and a bucket). The Acquire/Release pairs provide sufficient ordering at lower cost. On x86-64, Acquire and Release compile to plain loads and stores (TSO provides these for free). On ARM64, they compile to `ldar`/`stlr` instructions, which are cheaper than the full barriers required for SeqCst. ### 3.3 CAS Loop Pattern Compound updates (decay score = f(old_score, new_event)) use a compare-and-swap loop: ```rust /// Update a running decay score atomically. /// /// Correctness argument: /// - The CAS loop retries until the compare succeeds. /// - Each retry reloads the current value, so no concurrent update is lost. /// - The loop terminates because: (a) only signal writer threads execute this /// code, (b) the number of writer threads is bounded, and (c) CAS on x86-64 /// uses a hardware lock prefix that guarantees forward progress (no livelock). /// - On ARM64, compare_exchange_weak may spuriously fail, but the retry loop /// handles this -- weak is preferred over strong because it avoids the /// load-linked/store-conditional retry penalty on ARM. fn update_decay_score( score: &AtomicF64, dt_seconds: f64, lambda: f64, weight: f64, ) { loop { let prev = score.load(Ordering::Acquire); let decayed = prev * (-lambda * dt_seconds).exp(); let new_val = decayed + weight; match score.compare_exchange_weak( prev, new_val, Ordering::AcqRel, Ordering::Acquire, ) { Ok(_) => break, Err(_) => continue, // Another writer updated; retry with new value. } } } ``` **Retry bound.** With N signal writer threads, the maximum number of CAS retries for a single update is N-1 (each concurrent writer succeeds once). At N=4 writers, the worst case is 3 retries, each costing ~15 ns (load + exp + CAS). Total worst case: ~60 ns. **ABA prevention.** ABA is not a concern for decay scores because the score is a floating-point value that changes monotonically during a write sequence. If writer A reads 5.0, writer B changes it to 5.7, and by coincidence some third operation changes it back to 5.0, writer A's CAS succeeds -- but this scenario is impossible because the score is always `old_score * decay_factor + weight`, which is a different value every time. The bit pattern of a decayed-and-incremented score is astronomically unlikely to match any previous bit pattern. ### 3.4 Contention Analysis CAS contention occurs when multiple signal events for the same entity arrive on different writer threads simultaneously. The probability depends on the entity's signal rate and the number of writer threads. | Entity Activity | Events/sec | P(contention per CAS) | Expected Retries | |----------------|-----------|----------------------|-----------------| | Average item (50 events/day) | 0.0006/sec | ~0.000000002% | 0 | | Active item (5K events/day) | 0.058/sec | ~0.00002% | 0 | | Viral item (500K events/day) | 5.8/sec | ~0.002% | 0 | | Extreme burst (50K events/sec) | 50K/sec | ~20% | 0.6 | Even under extreme burst conditions on a single entity, CAS retries remain bounded by the writer count (max 3 retries at 4 writers) and cost ~60 ns total -- negligible. ### 3.5 False Sharing Prevention False sharing occurs when two threads write to different fields that share a cache line (64 bytes on x86-64 and ARM64). tidalDB prevents false sharing by aligning every per-entity signal struct to a 64-byte cache line boundary: ```rust /// One entity's hot-tier signal state for one signal type. /// Exactly one L1 cache line. Never shares a cache line with another entity. /// /// Layout verified by static assertion. #[repr(C, align(64))] pub struct HotSignalState { entity_id: u64, // 8 bytes [0..8] last_update_ns: AtomicU64, // 8 bytes [8..16] signal_type_id: u16, // 2 bytes [16..18] flags: u16, // 2 bytes [18..20] _pad0: [u8; 4], // 4 bytes [20..24] decay_scores: [AtomicU64; 3], // 24 bytes [24..48] (f64 via bits) _pad1: [u8; 16], // 16 bytes [48..64] } const _: () = assert!( core::mem::size_of::() == 64, "HotSignalState must be exactly one cache line" ); const _: () = assert!( core::mem::align_of::() == 64, "HotSignalState must be cache-line aligned" ); ``` --- ## 4. Group Commit ### 4.1 Architecture The WAL uses a single-writer architecture with group commit to amortize `fdatasync()` cost across multiple concurrent producers. ``` Group Commit Architecture Signal Writer 1 ---+ Signal Writer 2 ---+---> [bounded MPSC channel] ---> WAL Commit Thread Signal Writer 3 ---+ capacity: 8192 | Signal Writer 4 ---+ | Entity Writer -----+ v Relationship ------+ drain batch (up to max_batch_size or max_delay) | v writev() syscall (single scatter-gather write for all records) | v fdatasync() (one fsync for entire batch) | v notify all waiters (oneshot channels) ``` **Why single-writer.** WAL writes must be sequential (records are ordered by `seqno`). Parallelizing the WAL would require either: (a) per-thread WAL segments with a merge step (added complexity, slower recovery), or (b) a mutex around the write call (serial anyway, plus lock overhead). A single writer with a channel is simpler, equally fast (the bottleneck is fdatasync, not CPU), and provides natural batching. ### 4.2 Channel and Notification Design ```rust use crossbeam::channel::{bounded, Sender, Receiver}; pub struct WalChannel { sender: Sender, receiver: Receiver, // Owned by the WAL commit thread } pub struct WalEntry { record: WalRecord, durability: DurabilityLevel, /// Notifier for the caller to await durability confirmation. /// None for Eventual durability (caller does not wait). notifier: Option>>, } pub struct GroupCommitConfig { /// Maximum records per group commit batch. /// Higher values amortize fsync better but increase tail latency /// for early arrivals in the batch. pub max_batch_size: usize, // default: 256 /// Maximum time before a batch is flushed, even if not full. /// This bounds the worst-case latency for the first record in a batch. pub max_delay: Duration, // default: 10 ms } ``` | Parameter | Default | Rationale | |-----------|---------|-----------| | Channel capacity | 8192 entries | At 150K signals/sec (4 writers) and ~256 signals/batch, the commit thread drains ~600 batches/sec. 8192 provides ~50ms of buffer before backpressure kicks in. | | `max_batch_size` | 256 | Amortizes one fdatasync (~200us NVMe) across 256 records = ~0.8us/record. | | `max_delay` | 10 ms | Bounds worst-case write latency. At steady state the batch fills before the delay expires. | ### 4.3 Commit Thread Loop ```rust /// WAL commit thread main loop. /// /// This is the only thread that writes to the WAL file. /// It batches records from the MPSC channel and issues a single /// writev() + fdatasync() per batch. fn wal_commit_loop( receiver: Receiver, wal: &mut WalWriter, config: &GroupCommitConfig, shutdown_flag: &AtomicBool, ) { let mut batch = Vec::with_capacity(config.max_batch_size); let mut notifiers: Vec>> = Vec::with_capacity(config.max_batch_size); loop { // Block until at least one entry arrives (or timeout for shutdown check). match receiver.recv_timeout(Duration::from_millis(100)) { Ok(entry) => { if let Some(n) = entry.notifier { notifiers.push(n); } batch.push(entry.record); // Drain up to max_batch_size or until max_delay expires. let deadline = Instant::now() + config.max_delay; while batch.len() < config.max_batch_size { match receiver.recv_deadline(deadline) { Ok(entry) => { if let Some(n) = entry.notifier { notifiers.push(n); } batch.push(entry.record); } Err(_timeout) => break, } } // Write the batch: single writev() syscall. let seqno_range = wal.write_batch(&batch); // Durable: one fdatasync() for the entire batch. wal.fdatasync(); // Notify all waiters that their records are durable. for notifier in notifiers.drain(..) { let _ = notifier.send(Ok(seqno_range.start)); } batch.clear(); } Err(_timeout) => { if shutdown_flag.load(Ordering::Acquire) { // Drain remaining entries before exiting. while let Ok(entry) = receiver.try_recv() { if let Some(n) = entry.notifier { notifiers.push(n); } batch.push(entry.record); } if !batch.is_empty() { let seqno_range = wal.write_batch(&batch); wal.fdatasync(); for notifier in notifiers.drain(..) { let _ = notifier.send(Ok(seqno_range.start)); } } break; } } } } } ``` ### 4.4 Latency-Throughput Tradeoff ``` Group Commit Latency vs Throughput Throughput (signals/sec) | 200K ----+ *************** | ***** 150K ----+ **** | *** 100K ----+ *** | ** 50K ----+ ** <-- Batched (max_batch=256, max_delay=10ms) | ** 0 ----+--*---------+----------+----------+--- 0 0.2 1.0 5.0 10.0 Write Latency p50 (ms) Immediate durability: ~200us per write (fdatasync each), ~5K writes/sec Batched (256, 10ms): ~50us per write (amortized), ~150K writes/sec (4 writers) Eventual: ~1us per write (no fsync wait), ~500K writes/sec (4 writers) ``` ### 4.5 Benchmark Targets | Metric | Target | Conditions | |--------|--------|------------| | Single-writer throughput (Immediate) | > 5,000 signals/sec | 1 writer, fsync per write, NVMe SSD | | Single-writer throughput (Batched) | > 50,000 signals/sec | 1 writer, batch 256 / 10ms | | Multi-writer throughput (Batched, 4 writers) | > 150,000 signals/sec | 4 writers, batch 256 / 10ms | | Write latency p50 (Batched) | < 100 us | Under concurrent query load | | Write latency p99 (Batched) | < 500 us | Under concurrent query load | | Write latency p999 (Batched) | < 2 ms | Includes worst-case batch fill time | | fdatasync amortization ratio | > 100:1 | Records per fdatasync at sustained load | --- ## 5. Read-Write Isolation ### 5.1 Core Guarantee Writers never block readers. Readers never block writers. This is achieved through four mechanisms, each appropriate to the data structure being accessed: | Data Structure | Write Mechanism | Read Mechanism | Isolation Strategy | |----------------|----------------|----------------|-------------------| | Decay scores (HotSignalState) | Atomic CAS loop | Atomic load + lazy decay | Lock-free atomics | | Windowed counters (WarmSignalState) | Atomic fetch_add | Atomic load + sum | Lock-free atomics | | Entity metadata | Allocate new struct, swap pointer | Read current pointer | ArcSwap (wait-free reads) | | Relationship graph edges | Append-only list with atomic length | Read up to atomic length | Append-only + atomic fence | | HNSW vector index | Per-node locks (short-held) | Lock-free graph traversal | Fine-grained locking | | Tantivy text index | Immutable segments + mutable buffer | Read committed segments | Segment immutability | | Dedup bloom filter | Atomic bit-set | Atomic bit-test | Lock-free bit operations | ### 5.2 Signal State: Pure Atomics A signal writer: 1. Loads `last_update_ns` (Acquire). 2. Computes `dt`. 3. CAS-updates each `decay_scores[i]` (AcqRel/Acquire). 4. Stores `last_update_ns` (Release) only if the event is newer. 5. `fetch_add(1)` on the current minute bucket (Relaxed). 6. `fetch_add(1)` on `all_time_count` (Relaxed). A ranking reader: 1. Loads `last_update_ns` (Acquire). 2. Loads `decay_scores[i]` (Acquire). 3. Computes `score * exp(-lambda * dt)`. The Acquire on `last_update_ns` in step 1 of the reader synchronizes with the Release in step 4 of the writer. This guarantees: if the reader sees timestamp T, it sees all score updates that were stored before T was stored. ``` Memory Ordering Relationships Signal Writer Thread Ranking Query Thread ===================== ===================== 1. load last_update_ns 1. load last_update_ns | (Acquire) | (Acquire) v v 2. CAS decay_scores[0] 2. load decay_scores[0] | (AcqRel) | (Acquire) v v 3. CAS decay_scores[1] 3. compute: score * exp(-lambda*dt) | (AcqRel) | v v 4. CAS decay_scores[2] 4. return score | (AcqRel) v 5. store last_update_ns | (Release) v [synchronization point] | +-- The Release in step 5 pairs with the Acquire in the reader's step 1. If the reader sees the timestamp stored in step 5, it is guaranteed to see all score updates from steps 2-4. Note: the reader may also see the OLD timestamp (before step 5). In that case, it sees old scores and applies old decay -- which is still a correct (slightly stale) result. There is no window where the reader sees a new timestamp with old scores. ``` **Stale read analysis.** The maximum staleness of a decay score read is the time between a writer's CAS (step 3) and the reader's load (step 2). In practice this is nanoseconds. The ranking impact is zero -- the difference between `exp(-lambda * dt)` and `exp(-lambda * (dt + 10ns))` is less than `1e-15` relative error. ### 5.3 Entity Metadata: Copy-on-Write with ArcSwap Entity metadata (title, format, tags, embedding pointer) changes infrequently but is read on every query. Updates use copy-on-write: ```rust use arc_swap::ArcSwap; use std::sync::Arc; /// Entity metadata, immutable once published. /// Readers get a snapshot via ArcSwap::load(). /// Writers replace the entire struct atomically. pub struct EntityMetadataStore { entries: DashMap>, } impl EntityMetadataStore { /// Read metadata for an entity. Lock-free, wait-free. /// Returns an Arc that keeps the snapshot alive for the query's duration. pub fn get(&self, id: &EntityId) -> Option>> { self.entries.get(id).map(|entry| entry.value().load()) } /// Update metadata. Allocates a new struct, atomically swaps the pointer. /// Old struct is dropped when all readers release their Arc. pub fn update(&self, id: &EntityId, new_meta: EntityMetadata) { if let Some(entry) = self.entries.get(id) { entry.value().store(Arc::new(new_meta)); } } } ``` **Why `ArcSwap` and not `RwLock`.** `ArcSwap::load()` is wait-free on x86-64 -- it compiles to a single atomic load. `RwLock::read()` involves at least one atomic increment (reader count) and one atomic decrement on drop, plus potential contention on the writer side. For a read-heavy workload (10K reads/sec, <1 write/sec per entity), `ArcSwap` eliminates all reader-side contention. ### 5.4 Relationship Graph: Append-Only Adjacency Lists Relationship edges are modeled as append-only adjacency lists. New edges are appended; readers iterate from the beginning up to the current atomic length. ```rust pub struct AdjacencyList { /// Edge data, pre-allocated to capacity. edges: Box<[RelationshipEdge]>, /// Number of valid edges. Atomically incremented on append. len: AtomicU32, capacity: u32, } impl AdjacencyList { /// Append an edge. Lock-free for the common case (len < capacity). pub fn push(&self, edge: RelationshipEdge) -> Result<(), CapacityExceeded> { let idx = self.len.fetch_add(1, Ordering::AcqRel); if idx >= self.capacity { self.len.fetch_sub(1, Ordering::Relaxed); return Err(CapacityExceeded); } // SAFETY: idx < capacity, and only one thread can claim each index // because fetch_add is atomic. No two threads write the same slot. // The slot at idx has not been written before (append-only, monotonic idx). // RelationshipEdge does not implement Drop (no double-drop hazard). unsafe { let slot = &self.edges[idx as usize] as *const _ as *mut RelationshipEdge; std::ptr::write(slot, edge); } Ok(()) } /// Iterate over all edges. Lock-free. pub fn iter(&self) -> impl Iterator { let len = self.len.load(Ordering::Acquire) as usize; self.edges[..len].iter() } } ``` ### 5.5 HNSW Vector Index (USearch) **Concurrent reads.** Multiple query threads traverse the HNSW graph simultaneously. Traversal is read-only: greedy nearest-neighbor navigation. No locks are acquired during traversal. **Writes.** New vectors are inserted with per-node-level locking: ``` HNSW Insert Concurrency 1. Assign the new node a random max-layer L. 2. Traverse layers L_max down to L+1 (greedy search, no locks). 3. For each layer l from L down to 0: a. Find the M nearest neighbors at layer l (read-only). b. Acquire write locks on the M neighbors. c. Add bidirectional edges. d. Prune if any neighbor exceeds max_connections. e. Release all locks for layer l. 4. If L == L_max, atomically update the entry point. ``` The lock granularity is per-node, held for nanoseconds. Two concurrent inserts contend only if they modify the same node's neighbor list. **Deletions.** Lazy tombstoning: mark the node as deleted (atomic flag), skip during search. Background compaction rebuilds affected graph regions. ### 5.6 Inverted Index (Tantivy) Tantivy's segment-based architecture provides natural concurrency through immutability: ``` Tantivy Segment Concurrency +------------------+ | Mutable Buffer | <-- Index maintenance thread | (in-memory) | adds docs via IndexWriter +--------+---------+ (serialized, not hot path) | flush (background, 100ms cadence) | v +------------------+------------------+ | Segment A | Segment B | <-- Immutable on disk | (committed) | (committed) | Query threads read +------------------+------------------+ all committed segments | Segment C | Segment D | via Searcher snapshot | (committed) | (merging...) | (lock-free) +------------------+------------------+ ``` Key properties: 1. **Committed segments are immutable.** Query threads read without synchronization. 2. **The mutable buffer is serialized.** Tantivy's `IndexWriter` holds an internal lock. Acceptable because document indexing runs on the index maintenance thread, not the signal write hot path. 3. **Segment merges are invisible to readers.** A merge creates a new segment and atomically swaps the segment list. Readers that started before the swap continue reading old segments. 4. **Searcher snapshots.** `IndexReader::searcher()` returns a `Searcher` with a point-in-time snapshot of the segment list. --- ## 6. Deadlock Prevention ### 6.1 Lock Ordering Hierarchy tidalDB uses very few locks, but where locks exist, they follow a strict ordering to prevent deadlock. ``` Lock Ordering Hierarchy (acquire top-to-bottom, never bottom-to-top) Level 0 (highest): Schema Lock (RwLock, acquired for DDL operations) | Level 1: WAL Commit (implicit: channel serialization, not a lock) | Level 2: Entity Metadata (ArcSwap, not a lock -- listed for ordering clarity) | Level 3: HNSW Node Locks (parking_lot::RwLock per node, short-held) | Level 4: Tantivy IndexWriter (Tantivy-internal mutex, serializes doc adds) | Level 5: Materializer Coordination (Mutex, protects rollup schedule state) | Level 6 (lowest): Storage Backend Transactions (redb write transactions, fjall batch writes) ``` **Rule: a thread that holds a lock at level N may only acquire locks at level N+1 or higher.** Acquiring a lock at the same or lower level while holding a lock at level N is a deadlock risk and is prohibited. ### 6.2 Resource Acquisition Order Proof **Claim: tidalDB is deadlock-free.** **Proof.** A deadlock requires a cycle in the lock wait graph: thread A holds lock L1 and waits for L2, while thread B holds L2 and waits for L1. The lock ordering hierarchy assigns a total order to all locks. Every thread acquires locks in strictly increasing level order. If thread A holds a lock at level N and attempts to acquire a lock at level M, then M > N (by the rule). If thread B holds a lock at level M and attempts to acquire another lock, it must be at level > M > N. Therefore B never waits for a lock at level N, and no cycle can form. QED. ### 6.3 Why Most Operations Need No Locks | Operation | Lock Required? | Why Not | |-----------|---------------|---------| | Signal write (hot-tier update) | No | Atomic CAS loops, no locks. | | Signal write (WAL enqueue) | No | Channel send, not a lock. | | Ranking query (decay score read) | No | Atomic loads, no locks. | | Ranking query (entity metadata) | No | ArcSwap load (wait-free). | | Ranking query (text search) | No | Tantivy Searcher snapshot (lock-free). | | Ranking query (vector search) | No | HNSW traversal (read-only, no locks). | | Materializer (bucket rotation) | No | Atomic stores for bucket pointers. | The only operations that acquire locks: | Operation | Lock Level | Duration | Frequency | |-----------|-----------|----------|-----------| | Schema change (DEFINE SIGNAL, etc.) | 0 | Milliseconds | Rare (deployment-time) | | HNSW vector insert | 3 | Nanoseconds per node | Per new entity | | Tantivy document add | 4 | Microseconds per batch | Per new/updated entity | | Materializer schedule update | 5 | Microseconds | Once per minute | | Rollup persistence | 6 | Milliseconds | Once per hour | ### 6.4 Timeout on Lock Acquisitions All lock acquisitions use `parking_lot`'s `try_lock_for` with a timeout: ```rust use parking_lot::RwLock; use std::time::Duration; const LOCK_TIMEOUT: Duration = Duration::from_secs(5); fn acquire_schema_lock( lock: &RwLock, ) -> Result, TidalError> { lock.try_write_for(LOCK_TIMEOUT) .ok_or_else(|| TidalError::Internal( "Schema lock acquisition timed out after 5s. Possible deadlock.".into() )) } ``` A lock timeout is treated as an internal error, logged loudly, and triggers graceful degradation. It does not crash the process. ### 6.5 Deadlock Detection in Debug Builds ```rust #[cfg(debug_assertions)] fn enable_deadlock_detection() { std::thread::spawn(move || { loop { std::thread::sleep(Duration::from_secs(10)); let deadlocks = parking_lot::deadlock::check_deadlock(); if !deadlocks.is_empty() { for (i, threads) in deadlocks.iter().enumerate() { eprintln!("Deadlock #{i}"); for t in threads { eprintln!(" Thread {:?}: {:?}", t.thread_id(), t.backtrace()); } } panic!("Deadlock detected in debug build"); } } }); } ``` This is disabled in release builds (zero runtime cost). --- ## 7. Graceful Degradation Ladder When the system is under pressure, tidalDB sheds load in a controlled, prioritized manner. The priority order is absolute: ``` Priority (highest to lowest): 1. SIGNAL DURABILITY -- Never lose an acknowledged signal event. 2. QUERY LATENCY -- Return results within timeout, even if approximate. 3. MATERIALIZER FRESHNESS -- Tolerate stale aggregates before stale queries. 4. INDEX FRESHNESS -- Tolerate stale text/vector indexes last. ``` ### 7.1 Degradation State Machine ``` Degradation Ladder +----------+ | NORMAL | +-----+----+ | WAL queue > 50% capacity OR query p99 > 40ms OR heap usage > 80% of memory_budget | v +-------+--------+ | ELEVATED_LOAD | +-------+--------+ | WAL queue > 80% capacity OR query p99 > 80ms OR heap usage > 90% of memory_budget | v +-------+-------+ | DEGRADED | +-------+-------+ | WAL queue full (backpressure active) OR query p99 > 200ms OR heap usage > 95% of memory_budget | v +-------+-------+ | CRITICAL | +-------+-------+ Recovery: state transitions DOWN require all triggering conditions to be below 50% of their threshold for 10 seconds (hysteresis prevents oscillation between states). ``` ### 7.2 Degradation Actions by State | State | Signal Write Path | Query Path | Background Work | |-------|------------------|------------|-----------------| | **NORMAL** | Full processing: dedup + WAL + hot + warm + pref + rel + cohort | Full pipeline: candidates=500, all signals, velocity, diversity | All schedules active | | **ELEVATED_LOAD** | `Eventual` signals skip WAL queue (hot-tier only, WAL catch-up later) | Reduce candidates: 500 -> 300. Skip EWMA velocity. | Delay non-critical rollups. Reduce Tantivy commit frequency (100ms -> 500ms). | | **DEGRADED** | Skip preference vector update. Skip relationship weight update. WAL + hot tier only. | Reduce candidates: 300 -> 100. Skip diversity enforcement. Primary decay score only. | Suspend hourly rollups. Checkpoint only. HNSW inserts queued. Tantivy indexing suspended. | | **CRITICAL** | WAL + hot tier only. All derived updates deferred. Block senders if WAL queue full. | Candidates: 100 -> 50. Hot tier cache only. Skip text search. 10ms hard timeout. | Checkpoint only. All other work suspended. | ### 7.3 Query Timeout and Partial Results Every query has a timeout budget. The query executor tracks elapsed time at each stage and can return partial results if the budget is exhausted. ```rust pub struct QueryBudget { pub total: Duration, // Default: 50ms pub retrieval: Duration, // Default: 20ms pub scoring: Duration, // Default: 15ms pub diversity: Duration, // Default: 10ms pub serialization: Duration, // Default: 5ms } pub struct QueryMetadata { pub completeness: Completeness, pub stages_completed: Vec, pub execution_time: Duration, pub system_state: DegradationState, } pub enum Completeness { Full, Partial { reason: &'static str }, } ``` If a stage exceeds its budget: 1. **Retrieval timeout.** Return candidates found so far. 2. **Scoring timeout.** Return candidates scored so far, sorted by partial score. 3. **Diversity timeout.** Return scored candidates without diversity enforcement. ### 7.4 Signal Backpressure When the WAL commit thread cannot keep up, the bounded channel provides natural backpressure: 1. Channel fills to capacity (8192 entries). 2. Signal writer threads block on `sender.send()`. 3. `db.signal()` blocks until the WAL commit thread drains space. 4. The application sees increased signal write latency. This is the correct behavior: signal durability is the highest priority. Blocking the producer is better than dropping signals. For `Eventual` durability signals in ELEVATED_LOAD and above: the signal is written directly to the hot tier (atomics, non-blocking) and a WAL record is enqueued without a notifier. If lost due to crash, the hot-tier update is also lost (hot tier is rebuilt from WAL on recovery). Acceptable for `Eventual` signals by definition. --- ## 8. Background Task Scheduling ### 8.1 Task Priority System Background tasks compete for CPU and I/O bandwidth. A priority scheduler ensures that time-sensitive tasks run before best-effort work. ``` Background Task Priorities Priority 0 (highest): Checkpoint - Must complete within max_checkpoint_staleness (2 min) - Bounds crash recovery time - Runs every 30-60 seconds Priority 1: Bucket Rotation - Must complete within 1 minute (minute buckets) - Windowed aggregation accuracy depends on timely rotation - Runs every 60 seconds Priority 2: HNSW Insertions - New entities become ANN-discoverable - Latency: minutes acceptable, hours not - Batched from insert queue Priority 3: Tantivy Commit - New entities become text-searchable - Latency: 100ms-500ms depending on degradation state - Batched from document queue Priority 4: Hourly Rollups - Materializes windowed aggregates for 24h+ windows - Staleness up to 5 minutes tolerated (queries fall back to warm tier) - Runs every hour Priority 5 (lowest): Segment Recomputation / Daily Rollups / Tier Migration - Behavioral segment refresh, daily aggregates, hot/cold eviction - Staleness up to hours tolerated - Runs on schedule or when idle ``` ### 8.2 I/O Bandwidth Allocation Background tasks must not starve the query read path of I/O bandwidth. tidalDB uses a token-bucket rate limiter to bound background write I/O: ```rust pub struct BackgroundIoConfig { /// Maximum sustained background write rate. /// Background tasks (compaction, rollups, checkpoint) share this budget. /// Default: 100 MB/s (reserves remaining SSD bandwidth for reads + WAL). pub max_background_write_rate: u64, /// Maximum burst size for background writes. /// Allows short bursts (e.g., checkpoint flush) to exceed sustained rate. /// Default: 50 MB pub burst_budget: u64, /// When degradation state >= DEGRADED, reduce background I/O to this fraction. /// Default: 0.25 (25% of normal budget) pub degraded_fraction: f64, } ``` **Allocation under normal operation (100 MB/s budget):** | Task | Allocation | Rationale | |------|-----------|-----------| | Checkpoint flush | 40 MB/s peak, burst | Checkpoint is bursty (flush dirty entities), then idle for 30s. | | Tantivy segment merge | 20 MB/s sustained | Segment merges are I/O-bound. Throttling prevents read latency spikes. | | fjall compaction | 20 MB/s sustained | LSM compaction is the largest sustained background write. | | Rollup persistence | 10 MB/s burst | Hourly rollups write in a burst, then idle. | | HNSW delta journal | 10 MB/s burst | Incremental persistence writes are small and periodic. | ### 8.3 Compaction Throttling Under Query Load fjall and Tantivy both perform background compaction/merging that competes with query reads for SSD bandwidth. tidalDB monitors query latency and throttles compaction when queries are affected: ```rust /// Called by the compaction scheduler before starting a compaction job. fn should_throttle_compaction(metrics: &SystemMetrics) -> ThrottleDecision { let query_p99 = metrics.query_latency_p99(); let degradation = metrics.degradation_state(); match degradation { DegradationState::Normal if query_p99 < Duration::from_millis(30) => { ThrottleDecision::Proceed // Plenty of headroom } DegradationState::Normal => { ThrottleDecision::ReduceRate(0.5) // Halve compaction I/O } DegradationState::ElevatedLoad => { ThrottleDecision::ReduceRate(0.25) // Quarter compaction I/O } DegradationState::Degraded | DegradationState::Critical => { ThrottleDecision::Defer // Suspend compaction entirely } } } ``` Deferred compaction accumulates a backlog. When the system returns to NORMAL, compaction catches up with increased priority. The fjall LSM tree is configured with FIFO compaction for the event log (no urgency -- old SSTs are simply dropped by TTL) and leveled compaction for the signal ledger (moderate urgency -- read amplification increases with L0 file count). --- ## 9. Shutdown Protocol Shutdown must be orderly. No acknowledged signal event may be lost. No query may return a partial error mid-execution. All durable state must be flushed to disk. ### 9.1 Shutdown Sequence ``` Shutdown Sequence (ordered, each step completes before the next begins) Step 1: STOP ACCEPTING NEW SIGNALS timeout: 1s - Set shutdown_flag = true (Release ordering). - Close the signal writer channel sender. - Signal writer threads observe the closed channel and exit. - Remaining signals in the channel are still drained by step 2. Step 2: DRAIN WAL BATCH QUEUE timeout: 10s - The WAL commit thread continues draining until the channel is empty AND all senders are dropped. - Final batch: write + fdatasync. All pending signals are durable. - WAL commit thread exits. Step 3: STOP ACCEPTING NEW QUERIES timeout: 5s - Close the query submission interface. - In-flight queries are allowed to complete (grace period). - After grace period, in-flight queries receive ShuttingDown error. Step 4: FINAL MATERIALIZER CYCLE timeout: 30s - Trigger a synchronous materializer flush: a. Rotate all minute buckets. b. Compute and write hourly rollups for the current partial hour. c. Checkpoint all hot-tier state to disk. - Materializer threads exit. Step 5: PERSIST INDEXES timeout: 30s - Commit Tantivy's mutable buffer (final segment flush). - Save HNSW index to disk (USearch save() + delta journal flush). - Index maintenance threads exit. Step 6: CLOSE STORAGE BACKENDS timeout: 10s - Flush fjall (force memtable to disk). - Close redb (COW B-tree, flush implicit on close). - Close WAL (final segment sealed but not deleted). Step 7: RELEASE LOCK FILE timeout: instant - Release the flock on {data_dir}/meta/LOCK. - Process may now exit. ``` ### 9.2 Shutdown Timeouts and Escalation | Step | Timeout | Escalation on Timeout | |------|---------|----------------------| | 1. Stop signals | 1 second | Force-close channels (drop senders) | | 2. Drain WAL | 10 seconds | Log warning, proceed (unacked Eventual signals may be lost) | | 3. Stop queries | 5 seconds | Cancel in-flight queries with ShuttingDown error | | 4. Materializer | 30 seconds | Skip hourly rollup, do checkpoint only | | 5. Persist indexes | 30 seconds | Skip HNSW save (rebuilt from entity store on next startup) | | 6. Close storage | 10 seconds | Abandon (OS will flush on process exit) | | 7. Release lock | Instant | flock release is instant | Total worst-case shutdown time: 86 seconds. Typical shutdown time: 2-5 seconds. ### 9.3 Crash vs. Clean Shutdown | Aspect | Clean Shutdown | Crash | |--------|---------------|-------| | Acknowledged signals | All durable (WAL flushed) | All durable (WAL flushed at write time) | | Hot-tier state | Checkpointed to disk | Restored from last checkpoint + WAL replay | | Tantivy index | Committed | Rebuilt from entity store | | HNSW index | Saved to disk | Rebuilt from entity store embeddings | | Recovery time | 0 (immediate restart) | ~15 seconds (WAL replay + index rebuild) | | Data loss | None | None (Immediate/Batched). Up to `max_delay` for Eventual. | --- ## 10. Memory Management ### 10.1 Memory Budget Architecture tidalDB operates within a configurable memory budget. The budget is divided among competing subsystems, each with a guaranteed minimum and an elastic maximum. ```rust pub struct MemoryConfig { /// Total memory budget for the tidalDB instance. /// Default: 4 GB. Must be at least 512 MB. pub total_budget: usize, /// Fraction of budget allocated to the hot tier (DashMap of HotSignalState). /// Default: 0.30 (30%). At 64 bytes/entry, 30% of 4 GB = ~20M entries. pub hot_tier_fraction: f64, /// Fraction allocated to warm tier (bucketed counters for active entities). /// Default: 0.25 (25%). pub warm_tier_fraction: f64, /// Fraction allocated to entity metadata (ArcSwap snapshots). /// Default: 0.15 (15%). pub metadata_fraction: f64, /// Fraction allocated to HNSW index (USearch in-memory graph). /// Default: 0.15 (15%). pub hnsw_fraction: f64, /// Fraction allocated to Tantivy (segment caches, searcher buffers). /// Default: 0.10 (10%). pub tantivy_fraction: f64, /// Fraction reserved for operational headroom (WAL buffers, channels, /// query execution scratch space, serialization buffers). /// Default: 0.05 (5%). pub headroom_fraction: f64, } ``` **Reference allocation at 4 GB total budget:** ``` Memory Budget Allocation (4 GB) +----------------------------------------------------------+ | Hot Tier: 1,200 MB (30%) | | ~18.7M HotSignalState entries at 64 bytes each | +----------------------------------------------------------+ | Warm Tier: 1,000 MB (25%) | | ~550K active entities with 6 signal types | +----------------------------------------------------------+ | Entity Metadata: 600 MB (15%) | | ~3M entities at ~200 bytes each | +----------------------------------------------------------+ | HNSW Index: 600 MB (15%) | | ~2M vectors at 1536D f16 (~1.5 KB each + graph) | +----------------------------------------------------------+ | Tantivy: 400 MB (10%) | | Segment caches, term dictionaries | +----------------------------------------------------------+ | Headroom: 200 MB (5%) | | WAL buffers, channels, query scratch | +----------------------------------------------------------+ ``` ### 10.2 Memory Pressure Detection tidalDB monitors its own memory usage and triggers defensive actions before the OS OOM killer intervenes. ```rust pub struct MemoryPressureMonitor { /// Current allocated bytes (tracked via a custom allocator wrapper /// or periodic jemalloc stats query). allocated: AtomicU64, /// Total budget from config. budget: u64, /// Thresholds for defensive actions. thresholds: MemoryThresholds, } pub struct MemoryThresholds { /// Begin evicting cold entities from hot tier. /// Default: 80% of budget. pub eviction_start: f64, /// Aggressively evict: reduce hot tier to minimum, drop warm tier caches. /// Default: 90% of budget. pub aggressive_eviction: f64, /// Emergency: reject new entity insertions, return errors for /// operations that would allocate. Signal writes to existing /// entities still succeed (they update atomics in-place, no allocation). /// Default: 95% of budget. pub emergency: f64, } ``` **Pressure response actions:** | Pressure Level | Trigger | Actions | |---------------|---------|---------| | **Normal** (< 80%) | -- | All allocations permitted. Full hot/warm tier capacity. | | **Eviction** (80-90%) | `allocated > budget * 0.80` | Evict cold entities from hot tier (LRU by `last_access_ns`). Reduce DashMap shard capacity. Trigger tier migration sweep. | | **Aggressive** (90-95%) | `allocated > budget * 0.90` | Drop warm-tier state for entities with no signals in 24h. Shrink Tantivy cache. Reduce HNSW search cache. Trigger degradation state ELEVATED_LOAD if not already there. | | **Emergency** (> 95%) | `allocated > budget * 0.95` | Reject entity creation. Reject embedding insertion. Signal writes to existing entities still succeed (in-place atomic updates). Trigger degradation state DEGRADED or CRITICAL. Log loud warnings. | ### 10.3 OOM Prevention Strategy The goal is to never reach the OS OOM killer. The strategy is defense in depth: 1. **Budget enforcement.** Every subsystem tracks its allocation against its budget fraction. The DashMap capacity for the hot tier is computed from `budget * hot_tier_fraction / 64`. Exceeding capacity triggers eviction, not unbounded growth. 2. **Bounded channels.** The WAL channel (8192 entries), Tantivy document queue, and HNSW insert queue are all bounded. Full channels provide backpressure (blocking senders) rather than unbounded memory growth. 3. **Pre-allocated structures.** `HotSignalState` entries are 64-byte cache-line-aligned structs in a pre-sized DashMap. `WarmSignalState` entries are allocated on insertion and freed on eviction. There are no unbounded `Vec` growths on the hot path. 4. **Periodic jemalloc stats.** Every 5 seconds, the memory monitor queries jemalloc statistics (`jemalloc_ctl::stats::allocated`) and updates the `allocated` counter. This is more accurate than tracking individual allocations (which would add overhead to every `Box::new`). 5. **Graceful degradation integration.** Memory pressure feeds directly into the degradation state machine (Section 7). High memory usage triggers load shedding before OOM. ### 10.4 Per-Query Memory Bound Each query executor is allocated a bounded scratch buffer for candidate scoring and diversity enforcement: ```rust const MAX_CANDIDATES_PER_QUERY: usize = 500; const CANDIDATE_SCORE_SIZE: usize = 48; // EntityId + f64 score + metadata /// Maximum memory a single query can allocate for its scratch space. /// 500 candidates * 48 bytes = 24 KB per query. /// At 8 concurrent queries: 192 KB total. Negligible. const QUERY_SCRATCH_BUDGET: usize = MAX_CANDIDATES_PER_QUERY * CANDIDATE_SCORE_SIZE; ``` Queries that attempt to exceed this budget (e.g., a user-supplied `LIMIT 100000`) are clamped to `MAX_CANDIDATES_PER_QUERY` with a warning in the response metadata. --- ## 11. Invariants and Property Tests ### 11.1 Concurrency Safety Invariants **INV-CON-1: No data races.** All shared mutable state is accessed through atomic operations or synchronization primitives. The Rust type system enforces this at compile time via `Send` and `Sync` bounds. Thread Sanitizer (TSAN) must report zero data races in nightly builds. **INV-CON-2: No lost signal updates.** If `db.signal()` returns `Ok(())`, the signal's effect on all counters (decay scores, windowed counts, all-time count) is reflected in the final state. Under concurrent writes, the CAS retry loop ensures no update is silently dropped. Verified by: loom model checking + stress test total-count assertion. **INV-CON-3: No torn reads.** A ranking query never observes a partially-updated `HotSignalState`. It sees either the state before a concurrent write or the state after, never a mix. Verified by: loom model checking + stress test (readers never see NaN, negative scores, or inconsistent timestamp/score pairs). **INV-CON-4: Lock-free query scoring path.** No mutex, RwLock, or other blocking synchronization primitive is acquired during the execution of a ranking query's scoring phase. DashMap shard read locks are the only locks on the full query path, held for nanoseconds. Verified by: code audit + instrumented lock tracking in debug builds. **INV-CON-5: Bounded CAS retries.** A CAS loop retries at most N-1 times where N is the number of concurrent writer threads. With 4 writers, worst-case retries = 3. Verified by: instrumented CAS retry counter in stress tests. **INV-CON-6: Shutdown completeness.** After `db.shutdown()` returns, all acknowledged signals have been flushed to the WAL and all hot-tier state has been checkpointed. Verified by: shutdown test that reopens the database and asserts state equality. **INV-CON-7: No deadlocks.** The lock ordering hierarchy (Section 6.1) is never violated. No thread holds two locks at the same level simultaneously. Verified by: parking_lot deadlock detection in debug builds + loom tests for atomic protocols. ### 11.2 WAL Durability Invariants **INV-WAL-1: Acknowledged implies durable.** If a signal writer receives a `SeqNo` from the WAL commit thread's oneshot notifier, the record has been fsync'd to disk. The commit thread never notifies before fdatasync completes. **INV-WAL-2: Total ordering.** WAL records are assigned monotonically increasing sequence numbers by the single commit thread. No two records share a sequence number. No sequence number is skipped (except during crash recovery, where partially-written records are discarded). **INV-WAL-3: Channel backpressure, not data loss.** When the WAL channel is full, senders block. They do not drop records. The bounded channel provides flow control, not data loss. ### 11.3 Memory Ordering Invariants **INV-MO-1: Acquire/Release pairing.** Every `Release` store has a corresponding `Acquire` load that synchronizes with it. The pairing is documented in Section 3.2. **INV-MO-2: No Relaxed on synchronization boundaries.** `Relaxed` ordering is used only for pure counters where no other operation depends on seeing the specific increment. State transitions, timestamps, and bucket pointers always use Acquire/Release. **INV-MO-3: SeqCst absence.** No atomic operation in tidalDB uses `SeqCst`. If a future change requires `SeqCst`, it must be justified with a proof that Acquire/Release is insufficient, reviewed, and documented. ### 11.4 Graceful Degradation Invariants **INV-GD-1: Priority ordering.** In any degradation state, signal durability is never sacrificed for query latency. If the WAL queue is full, signal writers block (preserving durability) rather than dropping records (improving latency). **INV-GD-2: Hysteresis.** State transitions downward (e.g., DEGRADED -> ELEVATED_LOAD) require all triggering conditions to be below 50% of their threshold for at least 10 seconds. This prevents oscillation. **INV-GD-3: Partial results are annotated.** A query that returns under degradation or timeout always includes `Completeness::Partial` in its metadata. The application is never silently given incomplete results. ### 11.5 Loom Model Checking ```rust #[cfg(loom)] mod loom_tests { use loom::sync::atomic::{AtomicU64, Ordering}; use loom::thread; /// Verify that concurrent decay score updates never lose an event. /// Loom explores all possible interleavings of two writer threads /// and one reader thread. #[test] fn decay_score_no_lost_updates() { loom::model(|| { let score = loom::sync::Arc::new(AtomicU64::new(0.0f64.to_bits())); let last_update = loom::sync::Arc::new(AtomicU64::new(0)); let s1 = score.clone(); let t1 = last_update.clone(); let w1 = thread::spawn(move || { cas_update(&s1, &t1, 1.0, 100, 1e-6); }); let s2 = score.clone(); let t2 = last_update.clone(); let w2 = thread::spawn(move || { cas_update(&s2, &t2, 2.0, 200, 1e-6); }); w1.join().unwrap(); w2.join().unwrap(); let final_score = f64::from_bits(score.load(Ordering::Acquire)); let final_time = last_update.load(Ordering::Acquire); // Score must reflect both events. assert!(final_score >= 2.0, "Lost update: score={}", final_score); assert_eq!(final_time, 200); }); } } ``` ### 11.6 Stress Tests ```rust #[test] fn stress_concurrent_signal_writes_and_reads() { let db = TestDb::open_with_config(Config { thread_config: ThreadConfig { signal_writers: 4, query_executors: 4, ..Default::default() }, ..Default::default() }); let entities = create_test_entities(&db, 1000); let signals_per_writer = 100_000; let expected_total = 4 * signals_per_writer; // Spawn 4 writer threads, each writing 100K signals. let writers: Vec<_> = (0..4).map(|_| { let db = db.clone(); let entities = entities.clone(); thread::spawn(move || { for i in 0..signals_per_writer { let entity = &entities[i % entities.len()]; db.signal(Signal { kind: "view", item: entity.id(), user: "test_user", weight: 1.0, ..Default::default() }).expect("signal write failed"); } }) }).collect(); // Spawn 4 reader threads, querying continuously. let stop = Arc::new(AtomicBool::new(false)); let read_errors = Arc::new(AtomicU64::new(0)); let readers: Vec<_> = (0..4).map(|_| { let db = db.clone(); let stop = stop.clone(); let errors = read_errors.clone(); thread::spawn(move || { while !stop.load(Ordering::Relaxed) { match db.retrieve(RetrieveQuery { /* ... */ }) { Ok(results) => { for r in &results { if r.score < 0.0 { errors.fetch_add(1, Ordering::Relaxed); } } } Err(_) => { errors.fetch_add(1, Ordering::Relaxed); } } } }) }).collect(); for w in writers { w.join().unwrap(); } stop.store(true, Ordering::Release); for r in readers { r.join().unwrap(); } assert_eq!(read_errors.load(Ordering::Relaxed), 0, "Reader saw invalid state"); let total: u64 = entities.iter() .map(|e| db.signal_count(e.id(), "view", Window::AllTime)) .sum(); assert_eq!(total, expected_total as u64, "Lost signals"); } ``` ### 11.7 Test Matrix | Test Category | Tool | What It Proves | Frequency | |--------------|------|----------------|-----------| | CAS protocol correctness | loom | No lost updates, no torn reads under all interleavings | Pre-commit | | Counter linearizability | stress test | Total written == total counted | Pre-commit | | Concurrent read correctness | stress test | Readers never see negative scores, NaN, or invalid state | Pre-commit | | Crash recovery (concurrent) | crash harness | No lost acked signals, no phantom state | Nightly | | Performance under contention | criterion | Signal write throughput does not degrade >10% at 4 writers vs 1 | Pre-commit | | Deadlock absence | parking_lot detection | No cycles in lock wait graph | Debug builds (continuous) | | Memory ordering soundness | TSAN | No data races detected | Nightly (requires nightly Rust) | | Memory pressure handling | stress test | OOM never reached; eviction triggers correctly | Nightly | | Graceful degradation | load test | State transitions occur at documented thresholds | Nightly | ### 11.8 Performance Targets | Metric | Target | Conditions | |--------|--------|------------| | Multi-writer throughput (4 threads, Batched) | > 150,000 signals/sec | 4 writer threads, 100K entities | | Multi-writer throughput (4 threads, contended) | > 100,000 signals/sec | 4 writer threads, 100 entities (high contention) | | Write latency p50 (Batched) | < 100 us | Under concurrent query load | | Write latency p99 (Batched) | < 500 us | Under concurrent query load | | RETRIEVE p50 | < 30 ms | 8 concurrent queries, normal signal load | | RETRIEVE p99 | < 50 ms | 8 concurrent queries, normal signal load | | Decay score read (per entity) | < 20 ns | Under concurrent signal writes | | Windowed count (1h) | < 300 ns | Under concurrent bucket rotation | | Shutdown time (typical) | < 5 seconds | Normal operation | | Crash recovery time | < 15 seconds | WAL replay + index rebuild | --- ## Appendix A: Dependency Inventory | Crate | Purpose | Concurrency Feature Used | |-------|---------|------------------------| | `dashmap` | Concurrent hash maps for entity state lookup | Sharded RwLock internally. Provides concurrent read/write access. | | `crossbeam` | Channels (MPSC) for WAL queue and task distribution | Lock-free bounded and unbounded channels. Epoch-based reclamation if needed. | | `parking_lot` | Faster mutexes/RwLocks where locks are necessary | Smaller lock size (1 word vs 3 for std). Deadlock detection. `try_lock_for` with timeout. | | `arc-swap` | Wait-free atomic pointer swap for entity metadata COW | `ArcSwap::load()` compiles to a single atomic load on x86-64. | ## Appendix B: Platform-Specific Behavior | Behavior | x86-64 | ARM64 (aarch64) | |----------|--------|-----------------| | Acquire load | Plain `mov` (TSO provides Acquire for free) | `ldar` (load-acquire instruction) | | Release store | Plain `mov` (TSO provides Release for free) | `stlr` (store-release instruction) | | CAS | `lock cmpxchg` (hardware lock on cache line) | `ldxr`/`stxr` (load-exclusive/store-exclusive) | | `compare_exchange_weak` | Same as strong on x86-64 (no spurious failure) | May spuriously fail (LL/SC). Preferred in loops. | | False sharing granularity | 64-byte cache line | 64-byte cache line (some cores use 128, but 64 is safe) | | Memory model | TSO (Total Store Order) -- stronger than Acquire/Release | Weakly ordered -- Acquire/Release are essential | tidalDB's memory ordering choices are correct on both architectures. The Acquire/Release pairs are necessary for ARM64 and free (no overhead) on x86-64. ## Appendix C: Anti-Patterns | Anti-Pattern | Why It Is Wrong | What To Do Instead | |-------------|-----------------|-------------------| | `Arc>` | Serializes all readers. At 10K queries/sec, this is a bottleneck. | Atomic fields within `HotSignalState`. CAS loops for compound updates. | | `Relaxed` on `last_update_ns` | Reader could see new timestamp with old decay score, producing over-decayed result. | `Release` on writer store, `Acquire` on reader load. | | `SeqCst` everywhere "to be safe" | Forces global total order, requiring full memory barriers on ARM64. Measurable overhead for no correctness benefit. | Use minimum ordering per Section 3.2. | | Global lock for HNSW writes | Serializes all vector insertions. | Per-node locks held for nanoseconds. | | Unbounded channel for WAL queue | OOM if commit thread falls behind. | Bounded channel. Senders block (backpressure). | | `thread::sleep` for coordination | Wastes CPU, adds sleep-duration latency. | Channel notification or condition variables. | | Spin locks | Burn CPU, starve other threads. | `parking_lot::Mutex` (spins briefly, then parks). | --- ## References - [Storage Engine Specification](01-storage-engine.md) -- WAL design, group commit, hybrid backend, checkpoint procedure - [Signal System Specification](03-signal-system.md) -- HotSignalState layout, atomic access patterns, CAS loops for decay scores - [Feedback Loop Specification](10-feedback-loop.md) -- 7-step signal ingestion pipeline, atomic multi-update semantics - [Text Retrieval Specification](06-text-retrieval.md) -- Tantivy segment management, commit cadence - [Vector Retrieval Specification](07-vector-retrieval.md) -- USearch concurrent access model, lazy deletion - [thoughts.md](../../thoughts.md) -- Lock-free patterns from Engram (AtomicF32, DashMap), Citadel (AtomicU64, group commit), StemeDB (CAS vote counting) - [CODING_GUIDELINES.md](../../CODING_GUIDELINES.md) -- Lock-free hot path requirement, cache-line alignment - Herlihy, M., Shavit, N. "The Art of Multiprocessor Programming." Morgan Kaufmann, 2008 - McKenney, P.E. "Is Parallel Programming Hard, And, If So, What Can You Do About It?" kernel.org, 2023 - Tokio/Loom documentation -- Model-checked concurrency testing for Rust atomics