stemedb/docs/consistency-model.md

# StemeDB Consistency Model

This document describes the distributed consistency guarantees provided by StemeDB, the mechanisms that enforce them, and what is explicitly **not** guaranteed.

## Six Core Properties

| Property | Guarantee | Mechanism | Test Evidence |
|----------|-----------|-----------|---------------|
| **Eventual Convergence** | All replicas converge to identical state | CRDT merge + Anti-entropy sync | `stemedb-sync/tests/convergence.rs` |
| **Causal Ordering** | Operations respect happens-before | HLC timestamps + `HlcRecencyLens` | `stemedb-lens/src/hlc_recency.rs` |
| **Partition Tolerance** | Writes succeed during network partitions | Leaderless replication | `stemedb-cluster/tests/partition_tolerance.rs` |
| **Availability** | Reads/writes succeed if any replica is up | Any-replica acceptance | `stemedb-cluster/tests/availability.rs` |
| **Durability** | Committed writes survive crashes | WAL with fsync | `stemedb-wal/src/lib.rs` |
| **Conflict Resolution** | Deterministic winner selection | Lens-based resolution | `stemedb-lens/src/*.rs` |

## What IS Guaranteed

### 1. Eventual Convergence

All nodes eventually contain the same set of assertions. After network partitions heal and anti-entropy sync completes, every replica has identical data.

**Mechanism:**
- CRDT (Conflict-free Replicated Data Type) stores for assertions and votes
- Merkle tree-based diff detection for efficient sync
- Anti-entropy worker periodically syncs with peers

**Timing:**
- Convergence typically occurs within seconds of partition healing
- Configurable `anti_entropy_interval` (default: 5 seconds)
- Metrics available via `AntiEntropyWorker::avg_convergence_duration_ms()`

### 2. Causal Ordering

Operations that happen-before other operations are ordered correctly. If assertion A causally precedes assertion B, any node that has B also has A.

**Mechanism:**
- Hybrid Logical Clock (HLC) timestamps on every assertion
- HLC propagates through anti-entropy sync
- `HlcRecencyLens` resolves "most recent" deterministically using HLC, not wall clock

**Key insight:** Wall clocks can drift between nodes. HLC combines physical time with logical ordering to provide a total order even when clocks disagree.

### 3. Partition Tolerance

Writes continue on both sides of a network partition. No data is lost - both partitions' writes survive and merge after healing.

**Mechanism:**
- Leaderless replication: any replica accepts writes
- Append-only storage: writes never conflict (coexist)
- Lens resolution at read time, not write time

### 4. High Availability

If any replica for a shard is reachable, reads and writes succeed. There is no single point of failure.

**Mechanism:**
- Multiple replicas per shard (configurable replication factor)
- Writes accepted by any replica
- Reads served by any replica with current data

### 5. Durability

Once a write is acknowledged, it survives process crashes and restarts.

**Mechanism:**
- Write-ahead log (WAL) with fsync
- Assertion data written to durable storage before acknowledgment
- Crash recovery replays uncommitted WAL entries

### 6. Deterministic Conflict Resolution

When multiple assertions exist for the same subject+predicate, all nodes resolve to the same winner.

**Mechanism:**
- Lenses provide resolution strategies:
  - `HlcRecencyLens`: Latest HLC timestamp wins (total order)
  - `ConsensusLens`: Most common value wins
  - `ConfidenceLens`: Highest confidence wins
  - `TrustAwareAuthorityLens`: Weighted by source reputation
- Tiebreaker: `source_hash` provides deterministic ordering when primary criteria match

## What is NOT Guaranteed

### 1. Linearizability

StemeDB is **not** linearizable. A write on node A is not immediately visible on node B.

**Why:** Linearizability requires synchronous replication, which conflicts with partition tolerance and availability.

**Workaround:** Use HLC timestamps to establish order. If your use case requires seeing your own writes immediately, read from the node you wrote to.

### 2. Read-Your-Writes (Cross-Node)

After writing to node A, a read from node B may not see the write immediately.

**Why:** Anti-entropy sync is asynchronous to optimize for availability.

**Workaround:**
- Sticky sessions (always read from the node you wrote to)
- Wait for anti-entropy sync to complete (typically <10 seconds)
- Use gossip for faster propagation of new writes

### 3. Snapshot Isolation

Concurrent reads may see different subsets of data.

**Why:** There is no global transaction coordinator.

**Workaround:** For consistent snapshots, use epoch-aware lenses that filter to a specific epoch.

### 4. Strong Consistency

There is no guarantee that all nodes see operations in the same order at the same time.

**Why:** This would require coordination, violating the CAP theorem's availability guarantee.

## Clock Skew Handling

### HLC Design

HLC timestamps combine:
- **Physical time:** NTP64 format (nanoseconds since Unix epoch)
- **Logical counter:** Disambiguates events with same physical time
- **Node ID:** Breaks ties when counter and time match

### Skew Detection

The system detects clock skew exceeding 500ms:
- `detect_clock_skew()` compares local and remote HLC timestamps
- `clock_skew_events` metric tracks skew occurrences
- Warning logged when skew exceeds threshold

### Recommendations

1. **Use NTP:** All nodes should synchronize clocks via NTP
2. **Monitor skew:** Track `clock_skew_events` metric
3. **Tolerate drift:** HLC handles moderate skew (< seconds) gracefully
4. **Investigate large skew:** Skew > 1 second may indicate NTP misconfiguration

## Recovery Scenarios

### Partition Heal

1. Anti-entropy detects divergent Merkle roots
2. Diff computed to find missing assertions
3. Missing assertions fetched and merged via CRDT
4. Local HLC updated from remote timestamps
5. Convergence achieved when roots match

**Metric:** `avg_convergence_duration_ms()` tracks time from divergence detection to convergence.

### Node Crash

1. On restart, WAL is replayed
2. Uncommitted entries are re-applied
3. Merkle tree rebuilt from stored assertions
4. Anti-entropy resumes syncing with peers

### Corrupt WAL

1. Corrupted entries detected via checksum
2. Valid entries up to corruption point recovered
3. Node syncs missing data from peers via anti-entropy

## Testing Evidence

All consistency properties are verified by automated tests:

| Test File | Property Tested |
|-----------|-----------------|
| `crates/stemedb-sync/tests/convergence.rs` | Two-node convergence, overlapping data, lens determinism, merge commutativity |
| `crates/stemedb-cluster/tests/partition_tolerance.rs` | Write success during partition, post-partition convergence, concurrent writes |
| `crates/stemedb-cluster/tests/availability.rs` | Read/write on any replica, node failure isolation, quorum availability |
| `crates/stemedb-lens/src/hlc_recency.rs` | HLC ordering, clock skew scenarios, deterministic tiebreakers |

Run all consistency tests:

```bash
cargo test -p stemedb-sync --test convergence
cargo test -p stemedb-cluster --test partition_tolerance
cargo test -p stemedb-cluster --test availability
cargo test -p stemedb-lens -- hlc_recency
```

## Metrics Reference

| Metric | Location | Description |
|--------|----------|-------------|
| `sync_cycles` | `AntiEntropyWorker` | Completed sync cycles |
| `sync_failures` | `AntiEntropyWorker` | Failed sync attempts |
| `assertions_synced` | `AntiEntropyWorker` | Total assertions merged |
| `hlc_updates` | `AntiEntropyWorker` | Times local HLC advanced from remote |
| `clock_skew_events` | `AntiEntropyWorker` | Times skew exceeded 500ms |
| `convergence_count()` | `AntiEntropyWorker` | Number of convergence events |
| `avg_convergence_duration_ms()` | `AntiEntropyWorker` | Average time to converge |

## See Also

- [Architecture Overview](../architecture.md)
- [Distributed Write Path](research/distributed-write-path.md)
- [Data Structures](data-structures.md)
- [Roadmap](../roadmap.md)