This commit includes comprehensive work on Phase 6 features: ## Admission Control (Phase 6 admission middleware) - AdmissionStore implementation backed by TrustRankStore - PoW verification with tier-based difficulty computation - Trust tier progression (Newcomer → Established → Trusted → Authority) - API integration with admission status endpoints ## HLC Recency Lens (Phase 6C) - HlcRecencyLens for distributed system ordering - Hybrid logical clock integration with causality preservation ## Cluster Coordination (Phase 6C) - Multi-node cluster tests (availability, partition tolerance) - CRDT convergence tests for anti-entropy sync - Gateway handler improvements ## Aphoria Code Linter (Phase 2A) - RFC/OWASP corpus builders with network fetching and caching - Concept hierarchy with auto-alias creation on conflict detection - Multiple security extractors (TLS, JWT, CORS, secrets, rate limiting) ## Code Organization - Split large files into modules to comply with 500-line limit - Improved test organization with separate test modules - Fixed rkyv serialization for EigenTrustState (AgentScore struct) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
203 lines
7.8 KiB
Markdown
203 lines
7.8 KiB
Markdown
# StemeDB Consistency Model
|
|
|
|
This document describes the distributed consistency guarantees provided by StemeDB, the mechanisms that enforce them, and what is explicitly **not** guaranteed.
|
|
|
|
## Six Core Properties
|
|
|
|
| Property | Guarantee | Mechanism | Test Evidence |
|
|
|----------|-----------|-----------|---------------|
|
|
| **Eventual Convergence** | All replicas converge to identical state | CRDT merge + Anti-entropy sync | `stemedb-sync/tests/convergence.rs` |
|
|
| **Causal Ordering** | Operations respect happens-before | HLC timestamps + `HlcRecencyLens` | `stemedb-lens/src/hlc_recency.rs` |
|
|
| **Partition Tolerance** | Writes succeed during network partitions | Leaderless replication | `stemedb-cluster/tests/partition_tolerance.rs` |
|
|
| **Availability** | Reads/writes succeed if any replica is up | Any-replica acceptance | `stemedb-cluster/tests/availability.rs` |
|
|
| **Durability** | Committed writes survive crashes | WAL with fsync | `stemedb-wal/src/lib.rs` |
|
|
| **Conflict Resolution** | Deterministic winner selection | Lens-based resolution | `stemedb-lens/src/*.rs` |
|
|
|
|
## What IS Guaranteed
|
|
|
|
### 1. Eventual Convergence
|
|
|
|
All nodes eventually contain the same set of assertions. After network partitions heal and anti-entropy sync completes, every replica has identical data.
|
|
|
|
**Mechanism:**
|
|
- CRDT (Conflict-free Replicated Data Type) stores for assertions and votes
|
|
- Merkle tree-based diff detection for efficient sync
|
|
- Anti-entropy worker periodically syncs with peers
|
|
|
|
**Timing:**
|
|
- Convergence typically occurs within seconds of partition healing
|
|
- Configurable `anti_entropy_interval` (default: 5 seconds)
|
|
- Metrics available via `AntiEntropyWorker::avg_convergence_duration_ms()`
|
|
|
|
### 2. Causal Ordering
|
|
|
|
Operations that happen-before other operations are ordered correctly. If assertion A causally precedes assertion B, any node that has B also has A.
|
|
|
|
**Mechanism:**
|
|
- Hybrid Logical Clock (HLC) timestamps on every assertion
|
|
- HLC propagates through anti-entropy sync
|
|
- `HlcRecencyLens` resolves "most recent" deterministically using HLC, not wall clock
|
|
|
|
**Key insight:** Wall clocks can drift between nodes. HLC combines physical time with logical ordering to provide a total order even when clocks disagree.
|
|
|
|
### 3. Partition Tolerance
|
|
|
|
Writes continue on both sides of a network partition. No data is lost - both partitions' writes survive and merge after healing.
|
|
|
|
**Mechanism:**
|
|
- Leaderless replication: any replica accepts writes
|
|
- Append-only storage: writes never conflict (coexist)
|
|
- Lens resolution at read time, not write time
|
|
|
|
### 4. High Availability
|
|
|
|
If any replica for a shard is reachable, reads and writes succeed. There is no single point of failure.
|
|
|
|
**Mechanism:**
|
|
- Multiple replicas per shard (configurable replication factor)
|
|
- Writes accepted by any replica
|
|
- Reads served by any replica with current data
|
|
|
|
### 5. Durability
|
|
|
|
Once a write is acknowledged, it survives process crashes and restarts.
|
|
|
|
**Mechanism:**
|
|
- Write-ahead log (WAL) with fsync
|
|
- Assertion data written to durable storage before acknowledgment
|
|
- Crash recovery replays uncommitted WAL entries
|
|
|
|
### 6. Deterministic Conflict Resolution
|
|
|
|
When multiple assertions exist for the same subject+predicate, all nodes resolve to the same winner.
|
|
|
|
**Mechanism:**
|
|
- Lenses provide resolution strategies:
|
|
- `HlcRecencyLens`: Latest HLC timestamp wins (total order)
|
|
- `ConsensusLens`: Most common value wins
|
|
- `ConfidenceLens`: Highest confidence wins
|
|
- `TrustAwareAuthorityLens`: Weighted by source reputation
|
|
- Tiebreaker: `source_hash` provides deterministic ordering when primary criteria match
|
|
|
|
## What is NOT Guaranteed
|
|
|
|
### 1. Linearizability
|
|
|
|
StemeDB is **not** linearizable. A write on node A is not immediately visible on node B.
|
|
|
|
**Why:** Linearizability requires synchronous replication, which conflicts with partition tolerance and availability.
|
|
|
|
**Workaround:** Use HLC timestamps to establish order. If your use case requires seeing your own writes immediately, read from the node you wrote to.
|
|
|
|
### 2. Read-Your-Writes (Cross-Node)
|
|
|
|
After writing to node A, a read from node B may not see the write immediately.
|
|
|
|
**Why:** Anti-entropy sync is asynchronous to optimize for availability.
|
|
|
|
**Workaround:**
|
|
- Sticky sessions (always read from the node you wrote to)
|
|
- Wait for anti-entropy sync to complete (typically <10 seconds)
|
|
- Use gossip for faster propagation of new writes
|
|
|
|
### 3. Snapshot Isolation
|
|
|
|
Concurrent reads may see different subsets of data.
|
|
|
|
**Why:** There is no global transaction coordinator.
|
|
|
|
**Workaround:** For consistent snapshots, use epoch-aware lenses that filter to a specific epoch.
|
|
|
|
### 4. Strong Consistency
|
|
|
|
There is no guarantee that all nodes see operations in the same order at the same time.
|
|
|
|
**Why:** This would require coordination, violating the CAP theorem's availability guarantee.
|
|
|
|
## Clock Skew Handling
|
|
|
|
### HLC Design
|
|
|
|
HLC timestamps combine:
|
|
- **Physical time:** NTP64 format (nanoseconds since Unix epoch)
|
|
- **Logical counter:** Disambiguates events with same physical time
|
|
- **Node ID:** Breaks ties when counter and time match
|
|
|
|
### Skew Detection
|
|
|
|
The system detects clock skew exceeding 500ms:
|
|
- `detect_clock_skew()` compares local and remote HLC timestamps
|
|
- `clock_skew_events` metric tracks skew occurrences
|
|
- Warning logged when skew exceeds threshold
|
|
|
|
### Recommendations
|
|
|
|
1. **Use NTP:** All nodes should synchronize clocks via NTP
|
|
2. **Monitor skew:** Track `clock_skew_events` metric
|
|
3. **Tolerate drift:** HLC handles moderate skew (< seconds) gracefully
|
|
4. **Investigate large skew:** Skew > 1 second may indicate NTP misconfiguration
|
|
|
|
## Recovery Scenarios
|
|
|
|
### Partition Heal
|
|
|
|
1. Anti-entropy detects divergent Merkle roots
|
|
2. Diff computed to find missing assertions
|
|
3. Missing assertions fetched and merged via CRDT
|
|
4. Local HLC updated from remote timestamps
|
|
5. Convergence achieved when roots match
|
|
|
|
**Metric:** `avg_convergence_duration_ms()` tracks time from divergence detection to convergence.
|
|
|
|
### Node Crash
|
|
|
|
1. On restart, WAL is replayed
|
|
2. Uncommitted entries are re-applied
|
|
3. Merkle tree rebuilt from stored assertions
|
|
4. Anti-entropy resumes syncing with peers
|
|
|
|
### Corrupt WAL
|
|
|
|
1. Corrupted entries detected via checksum
|
|
2. Valid entries up to corruption point recovered
|
|
3. Node syncs missing data from peers via anti-entropy
|
|
|
|
## Testing Evidence
|
|
|
|
All consistency properties are verified by automated tests:
|
|
|
|
| Test File | Property Tested |
|
|
|-----------|-----------------|
|
|
| `crates/stemedb-sync/tests/convergence.rs` | Two-node convergence, overlapping data, lens determinism, merge commutativity |
|
|
| `crates/stemedb-cluster/tests/partition_tolerance.rs` | Write success during partition, post-partition convergence, concurrent writes |
|
|
| `crates/stemedb-cluster/tests/availability.rs` | Read/write on any replica, node failure isolation, quorum availability |
|
|
| `crates/stemedb-lens/src/hlc_recency.rs` | HLC ordering, clock skew scenarios, deterministic tiebreakers |
|
|
|
|
Run all consistency tests:
|
|
|
|
```bash
|
|
cargo test -p stemedb-sync --test convergence
|
|
cargo test -p stemedb-cluster --test partition_tolerance
|
|
cargo test -p stemedb-cluster --test availability
|
|
cargo test -p stemedb-lens -- hlc_recency
|
|
```
|
|
|
|
## Metrics Reference
|
|
|
|
| Metric | Location | Description |
|
|
|--------|----------|-------------|
|
|
| `sync_cycles` | `AntiEntropyWorker` | Completed sync cycles |
|
|
| `sync_failures` | `AntiEntropyWorker` | Failed sync attempts |
|
|
| `assertions_synced` | `AntiEntropyWorker` | Total assertions merged |
|
|
| `hlc_updates` | `AntiEntropyWorker` | Times local HLC advanced from remote |
|
|
| `clock_skew_events` | `AntiEntropyWorker` | Times skew exceeded 500ms |
|
|
| `convergence_count()` | `AntiEntropyWorker` | Number of convergence events |
|
|
| `avg_convergence_duration_ms()` | `AntiEntropyWorker` | Average time to converge |
|
|
|
|
## See Also
|
|
|
|
- [Architecture Overview](../architecture.md)
|
|
- [Distributed Write Path](research/distributed-write-path.md)
|
|
- [Data Structures](data-structures.md)
|
|
- [Roadmap](../roadmap.md)
|