stemedb/docs/consistency-model.md
jordan d3a88585fe feat: Phase 6 UAT - Admission control, HLC recency, cluster coordination
This commit includes comprehensive work on Phase 6 features:

## Admission Control (Phase 6 admission middleware)
- AdmissionStore implementation backed by TrustRankStore
- PoW verification with tier-based difficulty computation
- Trust tier progression (Newcomer → Established → Trusted → Authority)
- API integration with admission status endpoints

## HLC Recency Lens (Phase 6C)
- HlcRecencyLens for distributed system ordering
- Hybrid logical clock integration with causality preservation

## Cluster Coordination (Phase 6C)
- Multi-node cluster tests (availability, partition tolerance)
- CRDT convergence tests for anti-entropy sync
- Gateway handler improvements

## Aphoria Code Linter (Phase 2A)
- RFC/OWASP corpus builders with network fetching and caching
- Concept hierarchy with auto-alias creation on conflict detection
- Multiple security extractors (TLS, JWT, CORS, secrets, rate limiting)

## Code Organization
- Split large files into modules to comply with 500-line limit
- Improved test organization with separate test modules
- Fixed rkyv serialization for EigenTrustState (AgentScore struct)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 00:43:37 -07:00

203 lines
7.8 KiB
Markdown

# StemeDB Consistency Model
This document describes the distributed consistency guarantees provided by StemeDB, the mechanisms that enforce them, and what is explicitly **not** guaranteed.
## Six Core Properties
| Property | Guarantee | Mechanism | Test Evidence |
|----------|-----------|-----------|---------------|
| **Eventual Convergence** | All replicas converge to identical state | CRDT merge + Anti-entropy sync | `stemedb-sync/tests/convergence.rs` |
| **Causal Ordering** | Operations respect happens-before | HLC timestamps + `HlcRecencyLens` | `stemedb-lens/src/hlc_recency.rs` |
| **Partition Tolerance** | Writes succeed during network partitions | Leaderless replication | `stemedb-cluster/tests/partition_tolerance.rs` |
| **Availability** | Reads/writes succeed if any replica is up | Any-replica acceptance | `stemedb-cluster/tests/availability.rs` |
| **Durability** | Committed writes survive crashes | WAL with fsync | `stemedb-wal/src/lib.rs` |
| **Conflict Resolution** | Deterministic winner selection | Lens-based resolution | `stemedb-lens/src/*.rs` |
## What IS Guaranteed
### 1. Eventual Convergence
All nodes eventually contain the same set of assertions. After network partitions heal and anti-entropy sync completes, every replica has identical data.
**Mechanism:**
- CRDT (Conflict-free Replicated Data Type) stores for assertions and votes
- Merkle tree-based diff detection for efficient sync
- Anti-entropy worker periodically syncs with peers
**Timing:**
- Convergence typically occurs within seconds of partition healing
- Configurable `anti_entropy_interval` (default: 5 seconds)
- Metrics available via `AntiEntropyWorker::avg_convergence_duration_ms()`
### 2. Causal Ordering
Operations that happen-before other operations are ordered correctly. If assertion A causally precedes assertion B, any node that has B also has A.
**Mechanism:**
- Hybrid Logical Clock (HLC) timestamps on every assertion
- HLC propagates through anti-entropy sync
- `HlcRecencyLens` resolves "most recent" deterministically using HLC, not wall clock
**Key insight:** Wall clocks can drift between nodes. HLC combines physical time with logical ordering to provide a total order even when clocks disagree.
### 3. Partition Tolerance
Writes continue on both sides of a network partition. No data is lost - both partitions' writes survive and merge after healing.
**Mechanism:**
- Leaderless replication: any replica accepts writes
- Append-only storage: writes never conflict (coexist)
- Lens resolution at read time, not write time
### 4. High Availability
If any replica for a shard is reachable, reads and writes succeed. There is no single point of failure.
**Mechanism:**
- Multiple replicas per shard (configurable replication factor)
- Writes accepted by any replica
- Reads served by any replica with current data
### 5. Durability
Once a write is acknowledged, it survives process crashes and restarts.
**Mechanism:**
- Write-ahead log (WAL) with fsync
- Assertion data written to durable storage before acknowledgment
- Crash recovery replays uncommitted WAL entries
### 6. Deterministic Conflict Resolution
When multiple assertions exist for the same subject+predicate, all nodes resolve to the same winner.
**Mechanism:**
- Lenses provide resolution strategies:
- `HlcRecencyLens`: Latest HLC timestamp wins (total order)
- `ConsensusLens`: Most common value wins
- `ConfidenceLens`: Highest confidence wins
- `TrustAwareAuthorityLens`: Weighted by source reputation
- Tiebreaker: `source_hash` provides deterministic ordering when primary criteria match
## What is NOT Guaranteed
### 1. Linearizability
StemeDB is **not** linearizable. A write on node A is not immediately visible on node B.
**Why:** Linearizability requires synchronous replication, which conflicts with partition tolerance and availability.
**Workaround:** Use HLC timestamps to establish order. If your use case requires seeing your own writes immediately, read from the node you wrote to.
### 2. Read-Your-Writes (Cross-Node)
After writing to node A, a read from node B may not see the write immediately.
**Why:** Anti-entropy sync is asynchronous to optimize for availability.
**Workaround:**
- Sticky sessions (always read from the node you wrote to)
- Wait for anti-entropy sync to complete (typically <10 seconds)
- Use gossip for faster propagation of new writes
### 3. Snapshot Isolation
Concurrent reads may see different subsets of data.
**Why:** There is no global transaction coordinator.
**Workaround:** For consistent snapshots, use epoch-aware lenses that filter to a specific epoch.
### 4. Strong Consistency
There is no guarantee that all nodes see operations in the same order at the same time.
**Why:** This would require coordination, violating the CAP theorem's availability guarantee.
## Clock Skew Handling
### HLC Design
HLC timestamps combine:
- **Physical time:** NTP64 format (nanoseconds since Unix epoch)
- **Logical counter:** Disambiguates events with same physical time
- **Node ID:** Breaks ties when counter and time match
### Skew Detection
The system detects clock skew exceeding 500ms:
- `detect_clock_skew()` compares local and remote HLC timestamps
- `clock_skew_events` metric tracks skew occurrences
- Warning logged when skew exceeds threshold
### Recommendations
1. **Use NTP:** All nodes should synchronize clocks via NTP
2. **Monitor skew:** Track `clock_skew_events` metric
3. **Tolerate drift:** HLC handles moderate skew (< seconds) gracefully
4. **Investigate large skew:** Skew > 1 second may indicate NTP misconfiguration
## Recovery Scenarios
### Partition Heal
1. Anti-entropy detects divergent Merkle roots
2. Diff computed to find missing assertions
3. Missing assertions fetched and merged via CRDT
4. Local HLC updated from remote timestamps
5. Convergence achieved when roots match
**Metric:** `avg_convergence_duration_ms()` tracks time from divergence detection to convergence.
### Node Crash
1. On restart, WAL is replayed
2. Uncommitted entries are re-applied
3. Merkle tree rebuilt from stored assertions
4. Anti-entropy resumes syncing with peers
### Corrupt WAL
1. Corrupted entries detected via checksum
2. Valid entries up to corruption point recovered
3. Node syncs missing data from peers via anti-entropy
## Testing Evidence
All consistency properties are verified by automated tests:
| Test File | Property Tested |
|-----------|-----------------|
| `crates/stemedb-sync/tests/convergence.rs` | Two-node convergence, overlapping data, lens determinism, merge commutativity |
| `crates/stemedb-cluster/tests/partition_tolerance.rs` | Write success during partition, post-partition convergence, concurrent writes |
| `crates/stemedb-cluster/tests/availability.rs` | Read/write on any replica, node failure isolation, quorum availability |
| `crates/stemedb-lens/src/hlc_recency.rs` | HLC ordering, clock skew scenarios, deterministic tiebreakers |
Run all consistency tests:
```bash
cargo test -p stemedb-sync --test convergence
cargo test -p stemedb-cluster --test partition_tolerance
cargo test -p stemedb-cluster --test availability
cargo test -p stemedb-lens -- hlc_recency
```
## Metrics Reference
| Metric | Location | Description |
|--------|----------|-------------|
| `sync_cycles` | `AntiEntropyWorker` | Completed sync cycles |
| `sync_failures` | `AntiEntropyWorker` | Failed sync attempts |
| `assertions_synced` | `AntiEntropyWorker` | Total assertions merged |
| `hlc_updates` | `AntiEntropyWorker` | Times local HLC advanced from remote |
| `clock_skew_events` | `AntiEntropyWorker` | Times skew exceeded 500ms |
| `convergence_count()` | `AntiEntropyWorker` | Number of convergence events |
| `avg_convergence_duration_ms()` | `AntiEntropyWorker` | Average time to converge |
## See Also
- [Architecture Overview](../architecture.md)
- [Distributed Write Path](research/distributed-write-path.md)
- [Data Structures](data-structures.md)
- [Roadmap](../roadmap.md)