tidaldb/docs/ops/capacity-planning.md
2026-02-23 22:41:16 -07:00

154 lines
9.0 KiB
Markdown

# Capacity Planning
This document provides RAM, disk, and startup time estimates for tidalDB deployments. Use these tables to provision hardware before going to production.
All estimates assume a single-node deployment with default configuration (30-second checkpoint interval, f16 vector quantization, DashMap-based hot tier).
---
## RAM Capacity
tidalDB is an in-memory-first database. USearch HNSW indexes, the signal ledger hot tier, and Tantivy reader segments all reside in RAM during operation. There is no swap tolerance for USearch -- if the process is swapped, ANN query latency degrades from microseconds to seconds.
| Items | Embedding Dims | USearch RAM | Signal Ledger RAM (10 signals) | Tantivy RAM | Total Estimate |
|------:|---------------:|------------:|-------------------------------:|------------:|---------------:|
| 100K | 128D | ~26 MB | ~110 MB | ~50 MB | ~200 MB |
| 100K | 768D | ~154 MB | ~110 MB | ~50 MB | ~320 MB |
| 100K | 1536D | ~307 MB | ~110 MB | ~50 MB | ~470 MB |
| 1M | 128D | ~256 MB | ~1.1 GB | ~200 MB | ~1.6 GB |
| 1M | 768D | ~1.5 GB | ~1.1 GB | ~200 MB | ~2.8 GB |
| 1M | 1536D | ~3.1 GB | ~1.1 GB | ~200 MB | ~4.4 GB |
| 10M | 128D | ~2.6 GB | ~11 GB | ~500 MB | ~14 GB |
| 10M | 768D | ~15 GB | ~11 GB | ~500 MB | ~27 GB |
| 10M | 1536D | ~31 GB | ~11 GB | ~500 MB | ~43 GB |
### Formulas
**USearch HNSW index:**
```
items * dims * 2 bytes (f16 quantization) * 1.2 (HNSW graph overhead)
```
The 20% graph overhead accounts for HNSW neighbor lists (M=16 default, two layers). Actual overhead varies with M and ef_construction parameters.
**Signal ledger hot tier:**
```
items * signal_count * ~1,088 bytes/entry
```
Each `(entity_id, signal_type_id)` entry in the DashMap holds the running decay score, windowed counters (BucketedCounter with minute and hour buckets), velocity state, and the DashMap per-shard overhead. The 1,088 bytes/entry figure was measured in the m7p3 scale benchmarks.
The signal ledger has a memory budget of 5M entries (`DEFAULT_MAX_SIGNAL_ENTRIES`). When exceeded, the checkpoint thread evicts cold entries (oldest `last_update` timestamp). If your workload has more than 5M active `(entity, signal_type)` pairs, cold entries will be served from fjall checkpoints (slower, but correct).
**Tantivy text index:**
Tantivy's RAM usage depends on the number of indexed documents, average document length, and the number of open reader segments. The estimates above assume short metadata fields (title + description, ~200 bytes average). Long-form content indexing will increase RAM proportionally.
### Notes
- Signal ledger RAM is for the in-memory hot tier only. The WAL and fjall checkpoints add disk usage, not RAM.
- The "10 signals" column assumes 10 distinct signal types per entity. Scale linearly for more signal types.
- USearch RAM is the dominant cost at high dimensionality. If you use 1536D embeddings (e.g., OpenAI text-embedding-3-large), plan for USearch to consume 70%+ of total RAM at 10M items.
---
## Disk Capacity
Disk usage comes from three sources: fjall LSM-tree storage (metadata, relationships, signal checkpoints), WAL segments (append-only signal event log), and Tantivy/USearch index files.
| Items | Metadata Size | Signal Events/Day | Disk/Day (WAL) | Fjall (90 days) | Total (90 days) |
|------:|:----------------|------------------:|----------------:|----------------:|----------------:|
| 100K | small (256B avg) | 50K | ~2 MB | ~1 GB | ~1.2 GB |
| 1M | small | 500K | ~20 MB | ~10 GB | ~11.8 GB |
| 10M | small | 5M | ~200 MB | ~100 GB | ~118 GB |
### Formulas
**WAL daily growth:**
```
signal_events_per_day * ~40 bytes/event
```
Each WAL entry contains: 4-byte magic, 8-byte sequence, 1-byte event type, 8-byte entity ID, 2-byte signal type ID, 8-byte timestamp, 8-byte weight (f64), 32-byte BLAKE3 checksum. WAL segments are compacted after each successful checkpoint (every 30 seconds), so WAL disk usage represents only the uncompacted tail, not cumulative growth.
**Fjall storage:**
```
items * metadata_avg_bytes * 1.5 (LSM write amplification)
```
The 1.5x amplification factor accounts for LSM-tree space amplification (multiple sorted runs before compaction merges them). Actual amplification depends on the compaction strategy and write pattern. Signal checkpoints are also stored in fjall -- add ~100 bytes per active `(entity, signal_type)` pair for the serialized checkpoint data.
**Tantivy and USearch on disk:**
- Tantivy: roughly 1.5-2x the raw text size after indexing (inverted index + postings + term dictionary).
- USearch: saved index files are approximately the same size as the in-memory representation (items * dims * 2 bytes + graph metadata).
### WAL Compaction
WAL segments older than the last successful checkpoint are automatically deleted by the checkpoint thread (every 30 seconds). Under normal operation, WAL disk usage stays bounded at roughly `signal_rate * 40 bytes * 30 seconds`. Monitor `tidaldb_wal_lag_bytes` -- if it grows unbounded, checkpointing may be failing (check `tidaldb_checkpoint_failures_total`).
---
## Startup Time
Startup involves: opening fjall keyspaces, restoring the signal ledger from checkpoint, replaying WAL events since the last checkpoint, rebuilding in-memory indexes (bitmap, range, universe, creator-items, collections, suggestions), and loading USearch vector indexes.
| Items | Vectors | Typical Startup |
|------:|--------:|:----------------|
| 100K | 100K | ~2-5 sec |
| 1M | 1M | ~15-45 sec |
| 10M | 10M | ~3-8 min |
### Dominant Costs
1. **USearch index load** is the dominant startup cost at 1M+ vectors. USearch rebuilds the HNSW graph from its serialized format. Progress is logged every 10K vectors.
2. **Signal ledger restore** reads the checkpoint from fjall (a single prefix scan of `Tag::Sig` keys), then replays any WAL events with sequence numbers higher than the checkpoint's `wal_sequence`. Time is proportional to the number of active signal entries + unreplayed WAL events.
3. **Entity state rebuild** scans the items and users keyspaces to reconstruct creator-items bitmaps, relationship indexes (follows, blocks, hides), and interaction weights. Progress is logged every 10K items.
4. **Suggestion index rebuild** scans all item metadata for "title" fields and indexes terms for autocomplete. This is a sequential scan -- fast for 100K items, noticeable at 10M.
5. **Collection index rebuild** reconstructs collection membership bitmaps from fjall.
### Notes
- Startup time is I/O-bound, not CPU-bound. Fast NVMe storage reduces startup time significantly compared to spinning disk.
- WAL replay time depends on how many signals were written since the last checkpoint (at most ~30 seconds of writes under normal operation).
- Tantivy indexes are opened directly from disk (memory-mapped) and do not require a rebuild step.
---
## Recommended Provisioning
**General rule:** provision 2x the estimated RAM for headroom.
| Scale | Recommended RAM | Recommended Disk | CPU Cores |
|:---------|:----------------|:-----------------|:----------|
| 100K items, 128D | 512 MB | 5 GB SSD | 2 |
| 100K items, 768D | 1 GB | 5 GB SSD | 2 |
| 1M items, 128D | 4 GB | 25 GB SSD | 4 |
| 1M items, 768D | 8 GB | 25 GB SSD | 4 |
| 10M items, 128D | 32 GB | 250 GB NVMe | 8 |
| 10M items, 768D | 64 GB | 250 GB NVMe | 8 |
| 10M items, 1536D | 96 GB | 250 GB NVMe | 16 |
### Why 2x headroom?
- Signal ledger entries grow as new `(entity, signal_type)` pairs are written. The hot tier can hold up to 5M entries before trimming kicks in.
- Tantivy segment merges temporarily double the index size during merge operations.
- USearch does not support incremental resize -- if you approach capacity, you need enough free RAM to hold both the old and new index during a potential rebuild.
- The Rust allocator (jemalloc or system) has its own fragmentation overhead.
### Swap
Do not configure swap for production tidalDB instances. USearch HNSW traversal accesses memory in a random-access pattern that defeats page-level caching. A single swapped page in the HNSW graph can turn a 50-microsecond ANN query into a 50-millisecond disk seek.
### Disk Type
SSD is strongly recommended for all deployments. NVMe is recommended at 10M+ items. The WAL uses synchronous `fsync` on every segment rotation, and fjall's journal uses `persist(SyncAll)` during checkpoint. Spinning disk latency on these operations directly impacts signal write throughput.