stemedb/docs/research/distributed-write-path.md

# Distributed Write Path Research: CockroachDB/Spanner-Style for Episteme

**Date:** 2026-02-01
**Status:** Research Complete
**Target:** Design simplest correct distributed write path for append-only knowledge graph

---

## Executive Summary

After comprehensive research into Google Spanner, CockroachDB, Hybrid Logical Clocks, and distributed append-only systems, **we can implement a significantly simpler architecture than traditional distributed databases** because:

1. **Assertions are immutable** - no read-write conflicts
2. **Content-addressed** - automatic deduplication
3. **Read-time conflict resolution** - no distributed transactions needed
4. **Natural partition keys** - subject/predicate provide excellent sharding boundaries

**Recommendation:** Use **Hybrid Logical Clocks (HLC) + Single-Writer-Per-Partition + Simple Raft Replication** instead of full Spanner-style distributed transactions.

---

## 1. Google Spanner's Architecture

### Core Components

**TrueTime API**
- GPS + atomic clock synchronization providing <1ms uncertainty in 99th percentile
- Returns time interval `[earliest, latest]` guaranteed to contain actual time
- Enables external consistency without distributed coordination

**External Consistency Guarantee**
- If transaction T1 completes before T2 starts committing, then `timestamp(T2) > timestamp(T1)`
- Achieved via **commit-wait**: system waits until `TT.now().earliest > commit_timestamp`
- Guarantees serializable transactions visible globally

**Paxos Groups**
- Data partitioned into ranges, each range replicated via Paxos
- Writes flow: Client → Gateway → Paxos Leader → Replicas
- Read-only transactions skip locks (use snapshot isolation)

**Key Insight for Episteme:** Spanner's complexity stems from mutable records requiring distributed transactions. We don't have mutable records.

**Sources:**
- [Spanner TrueTime Documentation](https://docs.cloud.google.com/spanner/docs/true-time-external-consistency)
- [Strict Serializability in Spanner](https://cloud.google.com/blog/products/databases/strict-serializability-and-external-consistency-in-spanner)
- [Life of Spanner Reads & Writes](https://docs.cloud.google.com/spanner/docs/whitepapers/life-of-reads-and-writes)

---

## 2. CockroachDB's Distributed Write Path

### Architecture Layers

```
Client Request
    ↓
Transaction Coordinator (TxnCoordSender)
    ↓
Distribution Layer (DistSender) - routes to range leaseholder
    ↓
Replication Layer (Raft) - consensus among replicas
    ↓
Storage Layer (RocksDB) - persistent KV store
```

### Key Optimizations

**Parallel Commits**
- Reduces commit latency by 50% (from 2 consensus rounds to 1)
- Introduces `STAGING` transaction status
- Writes final batch and transaction record in parallel
- Improved p50 latency by 47%, throughput by 72% in benchmarks

**Transaction Pipelining**
- Combines with parallel commits for near-theoretical minimum latency
- Total latency = sum(read latencies) + 1 consensus round
- Trade-off: Extra WAL write for STAGING status

**MultiRaft**
- Each range has own consensus group
- Nodes participate in thousands of concurrent Raft groups
- Leaseholder serves reads, coordinates writes

**DistSender Intelligence**
- Batches requests to same range
- Routes to current leaseholder (cached)
- Handles range splits/merges transparently

**Key Insight for Episteme:** CockroachDB's complexity handles multi-statement ACID transactions across ranges. Single assertion writes are much simpler.

**Sources:**
- [Parallel Commits Protocol](https://www.cockroachlabs.com/blog/parallel-commits/)
- [Transaction Pipelining](https://www.cockroachlabs.com/blog/transaction-pipelining/)
- [Replication Layer Documentation](https://www.cockroachlabs.com/docs/stable/architecture/replication-layer)
- [Life of a Distributed Transaction](https://www.cockroachlabs.com/docs/stable/architecture/life-of-a-distributed-transaction)

---

## 3. Hybrid Logical Clocks (HLC)

### Why HLC Instead of TrueTime

TrueTime requires GPS/atomic clocks (expensive infrastructure). HLC achieves similar guarantees using NTP + logical counters.

### HLC Structure

```rust
struct HLC {
    physical: u64,  // Wall clock time (NTP-synchronized)
    logical: u64,   // Counter for same physical timestamp
    node_id: u64,   // Tie-breaker for total ordering
}
```

### Properties

1. **Causality Preservation:** If event A happens-before event B, then `HLC(A) < HLC(B)`
2. **Close to Physical Time:** Physical component stays within NTP bounds (~10-100ms typically)
3. **Monotonic:** Timestamps never decrease, even with clock skew
4. **No Clock Updates:** HLC reads but never modifies system clock

### How It Works

```rust
// On local event
fn generate_timestamp() -> HLC {
    let now = system_time();
    if now > last_hlc.physical {
        HLC { physical: now, logical: 0, node_id }
    } else {
        HLC { physical: last_hlc.physical, logical: last_hlc.logical + 1, node_id }
    }
}

// On receiving message with remote timestamp
fn update_on_receive(remote_hlc: HLC) -> HLC {
    let now = system_time();
    let physical = max(now, remote_hlc.physical, last_hlc.physical);
    let logical = if physical == last_hlc.physical {
        last_hlc.logical + 1
    } else if physical == remote_hlc.physical {
        remote_hlc.logical + 1
    } else {
        0
    };
    HLC { physical, logical, node_id }
}
```

### Clock Skew Handling

- CockroachDB default max offset: 500ms
- Nodes self-evict if skew exceeds threshold
- NTP typically achieves <10ms on LAN, <100ms on WAN

### Rust Implementations

1. **uhlc-rs** (atolab/uhlc-rs)
   - Production-ready, used in distributed systems
   - Unique identifier per clock
   - Monotonic timestamps close to physical time

2. **hlc-rs** (tbg/hlc-rs)
   - Simpler implementation
   - Good starting point

3. **hlc-gen**
   - Lock-free implementation
   - `#[no_std]` compatible
   - High throughput

**Recommendation:** Start with `uhlc-rs` - battle-tested and well-documented.

**Sources:**
- [Hybrid Logical Clocks Paper](https://cse.buffalo.edu/~demirbas/publications/hlc.pdf)
- [uhlc-rs Implementation](https://github.com/atolab/uhlc-rs)
- [HLC in CockroachDB](https://www.yugabyte.com/blog/evolving-clock-sync-for-distributed-databases/)
- [Clock Skew Management](https://sergeiturukin.com/2017/06/26/hybrid-logical-clocks.html)

---

## 4. The Append-Only Advantage

### What We Skip

**No Distributed Transactions**
- Single assertion write = single consensus operation
- No two-phase commit
- No transaction coordinator
- No deadlock detection

**No Read-Write Conflicts**
- Assertions never mutate
- No need for pessimistic locking
- No MVCC tombstones
- Reads never block writes

**No Transaction Isolation Complexity**
- No serializable snapshot isolation
- No read-write conflict tracking
- No transaction retry logic

**Automatic Deduplication**
- Content addressing: identical assertions have same ID
- Write is idempotent: `store(hash(content), content)`
- Replaying writes is safe

### What We Keep

**Ordering Guarantees**
- Supersession chains require causal ordering
- HLC provides total ordering across cluster
- Example: Assertion A2 supersedes A1, must ensure `timestamp(A2) > timestamp(A1)`

**Durability**
- Still need WAL + fsync
- Still need replication for fault tolerance
- Still need crash recovery

**Consistency**
- Raft consensus ensures replicas agree
- Leader election on failure
- No split-brain scenarios

### Comparison: Full Spanner vs Episteme

| Feature | Spanner/CockroachDB | Episteme (Append-Only) |
|---------|---------------------|------------------------|
| Write path | TxnCoordSender → DistSender → Raft | Direct → Raft |
| Transaction coordinator | Required | Not needed |
| Distributed locks | Yes (pessimistic) | No locks needed |
| Two-phase commit | Yes | No |
| Intent resolution | Complex | Not applicable |
| Read-write conflicts | Must detect and resolve | Cannot occur |
| Write amplification | High (intents + resolution) | Low (direct append) |
| Latency | 2-3 network hops | 1 network hop |

**Key Insight:** Immutability eliminates entire classes of distributed systems problems.

**Sources:**
- [Append-Only Logs](https://medium.com/@komalshehzadi/append-only-logs-the-immutable-diary-of-data-58c36a871c7c)
- [Immutability Changes Everything (ACM)](https://queue.acm.org/detail.cfm?id=2884038)
- [The Rise of Immutable Data Stores](https://www.odbms.org/2015/10/the-rise-of-immutable-data-stores/)
- [Data Replication in Distributed Systems](https://candost.blog/books/data-replication-in-distributed-systems/)

---

## 5. Sharding Strategy

### Options Analysis

#### 5.1 Hash-Based Sharding (by Assertion ID)

```
Shard = BLAKE3(assertion_content) % num_shards
```

**Pros:**
- Uniform distribution
- No hotspots
- Simple implementation

**Cons:**
- Related assertions (same subject) scattered across shards
- Query "all claims about subject X" requires scatter-gather
- No data locality

**Verdict:** ❌ Poor fit for knowledge graph queries

---

#### 5.2 Range-Based Sharding (by Subject)

```
Shard_A: subjects [aaa, bbb)
Shard_B: subjects [bbb, ccc)
Shard_C: subjects [ccc, zzz]
```

**Pros:**
- All assertions about subject X on same shard
- Range queries efficient
- Good for "tell me everything about X"

**Cons:**
- Hotspot subjects (e.g., "USA", "COVID-19")
- Manual range assignment
- Complex rebalancing

**Verdict:** ⚠️ Good for queries, problematic for hot subjects

---

#### 5.3 Hash-Based by Subject (Recommended)

```
Shard = hash(subject_uri) % num_shards
```

**Pros:**
- All assertions about subject X on same shard
- Automatic load distribution
- No manual range management
- Simple shard assignment

**Cons:**
- Predicate-based queries still scatter-gather
- Rebalancing requires re-hashing (use consistent hashing)

**Verdict:** ✅ Best balance for Episteme

---

#### 5.4 Composite: Subject + Predicate Hash

```
Shard = hash(subject_uri || predicate_uri) % num_shards
```

**Pros:**
- Even more specific data locality
- "Subject X with predicate P" queries hit single shard

**Cons:**
- "All claims about subject X" now scatters
- More complex query routing

**Verdict:** ⚠️ Only if predicate-specific queries dominate workload

---

### Recommended Strategy

**Primary: Hash-based sharding by Subject**

```rust
fn assign_shard(assertion: &Assertion, num_shards: u32) -> u32 {
    let hash = blake3::hash(assertion.subject.as_bytes());
    u32::from_le_bytes(hash.as_bytes()[0..4].try_into().unwrap()) % num_shards
}
```

**With Consistent Hashing for Rebalancing:**

Use jump hash or consistent hash ring to minimize data movement when adding/removing shards.

```rust
// Jump hash provides minimal key redistribution
fn jump_hash(key: u64, num_buckets: u32) -> u32 {
    let mut b = -1i64;
    let mut j = 0i64;
    let mut k = key;
    while j < num_buckets as i64 {
        b = j;
        k = k.wrapping_mul(2862933555777941757).wrapping_add(1);
        j = ((b.wrapping_add(1)) as f64 * (f64::powi(2.0, 31) / ((k >> 33).wrapping_add(1)) as f64)) as i64;
    }
    b as u32
}
```

**Sources:**
- [Database Sharding Strategies](https://engineeringatscale.substack.com/p/system-design-concepts-dive-deep)
- [Sharding Types (PlanetScale)](https://planetscale.com/blog/types-of-sharding)
- [Consistent Hashing Guide](https://www.toptal.com/big-data/consistent-hashing)
- [YugabyteDB Sharding Analysis](https://www.yugabyte.com/blog/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/)
- [RDF Triple Store Partitioning](https://www.iaria.org/conferences2018/filesDBKDA18/IztokSavnik_Tutorial_3store-arch.pdf)

---

## 6. Write Amplification and Performance

### Raft Write Overhead

**Standard Raft Write Path:**
1. Client sends write to leader
2. Leader appends to local WAL (fsync)
3. Leader sends AppendEntries RPC to followers
4. Followers append to local WAL (fsync)
5. Followers respond to leader
6. Leader commits when quorum reached
7. Leader responds to client

**Cost Analysis:**
- **Network:** 1 round-trip (leader ↔ followers)
- **Disk:** 1 fsync at leader + (replication_factor - 1) fsyncs at followers
- **Write Amplification:** `replication_factor` (typically 3x-5x)

### Batching Optimizations

**Batch AppendEntries:**
Instead of sending RPC per write, batch N writes into single RPC.

```rust
// Naive: 1000 writes = 1000 RPCs
for write in writes {
    leader.append_entries(vec![write]).await?;
}

// Optimized: 1000 writes = 1 RPC
leader.append_entries(writes).await?;
```

**Performance Impact (from YugabyteDB):**
- Throughput: 2x improvement with batching
- Latency: 30% reduction for p99

**Write Buffering:**
Buffer writes in memory, flush when:
- Buffer reaches size threshold (e.g., 1MB)
- Time threshold reached (e.g., 10ms)
- Explicit flush request

**Trade-off:** Latency vs throughput
- Small batches (10ms): Lower latency, moderate throughput
- Large batches (100ms): Higher latency, maximum throughput

### Pipeline Optimization

**CockroachDB's Parallel Commits Approach:**

Standard 2-phase commit:
```
Write intents → Commit record → Resolve intents
   (1 RTT)         (1 RTT)         (async)
Total: 2 RTT
```

Parallel commits:
```
Write intents + Commit record (parallel)
   (1 RTT)
Total: 1 RTT
```

**Episteme Equivalent:**
We don't have intents, but we can pipeline:

```rust
// Sequential: 2 consensus rounds
let wal_entry = wal.append(assertion).await?;  // Round 1
let index_update = index.update(assertion).await?;  // Round 2

// Pipelined: 1 consensus round
let (wal_entry, index_update) = tokio::join!(
    wal.append(assertion),
    index.update(assertion),
);
```

### Expected Performance Numbers

**Local Write (Single Node):**
- Latency: 1-5ms (dominated by fsync)
- Throughput: 10K-50K writes/sec (NVMe SSD)

**Distributed Write (3 Replicas, Same DC):**
- Latency: 5-15ms (1 RTT + fsync)
- Throughput: 5K-20K writes/sec per shard

**Distributed Write (3 Replicas, Multi-Region):**
- Latency: 50-200ms (depends on geography)
- Throughput: 1K-5K writes/sec per shard

**Batched Writes (100 assertions/batch):**
- Throughput: 100K-500K assertions/sec per shard

**With 100 Shards:**
- Aggregate throughput: 10M-50M assertions/sec

**Sources:**
- [Raft Consensus Algorithm](https://raft.github.io/)
- [Fast Raft Optimizations](https://arxiv.org/html/2506.17793v1)
- [YugabyteDB Write Buffering](https://dev.to/yugabyte/write-buffering-to-reduce-raft-consensus-latency-in-yugabytedb-2dg6)
- [Improved Raft with Batch Processing](https://link.springer.com/chapter/10.1007/978-981-19-2456-9_44)

---

## 7. Concrete Architecture Proposal

### The Simplest Correct Design

```
┌─────────────────────────────────────────────────────────┐
│                     Agent (Writer)                       │
│                                                          │
│  1. Generate Assertion                                   │
│  2. Sign with Ed25519                                    │
│  3. Compute Content Hash (BLAKE3)                        │
└────────────────┬────────────────────────────────────────┘
                 │ HTTP POST /assertions
                 ▼
┌─────────────────────────────────────────────────────────┐
│                   Gateway Node (Any)                     │
│                                                          │
│  1. Validate signature                                   │
│  2. Compute shard: hash(subject) % num_shards            │
│  3. Route to shard leader                                │
└────────────────┬────────────────────────────────────────┘
                 │ gRPC → Shard Leader
                 ▼
┌─────────────────────────────────────────────────────────┐
│               Shard N (3-5 Replicas)                     │
│                                                          │
│  ┌──────────────────────────────────────────┐           │
│  │           Leader Node                    │           │
│  │                                          │           │
│  │  1. Generate HLC timestamp               │           │
│  │  2. Append to WAL (fsync)                │           │
│  │  3. Propose to Raft group                │◄──────────┤── Raft RPCs
│  │  4. Wait for quorum                      │           │
│  │  5. Apply to KV store                    │           │
│  │  6. Update indexes (async)               │           │
│  │  7. Respond to gateway                   │           │
│  └──────────────────────────────────────────┘           │
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │  Follower 1  │  │  Follower 2  │  │  Follower 3  │  │
│  │              │  │              │  │              │  │
│  │ - Replicate  │  │ - Replicate  │  │ - Replicate  │  │
│  │ - Apply      │  │ - Apply      │  │ - Apply      │  │
│  │ - Ready for  │  │ - Ready for  │  │ - Ready for  │  │
│  │   failover   │  │   failover   │  │   failover   │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
└─────────────────────────────────────────────────────────┘
```

### Component Breakdown

#### 7.1 Gateway Layer (Stateless)

**Responsibilities:**
- Accept HTTP/gRPC requests
- Authenticate agents (verify Ed25519 signatures)
- Route requests to correct shard
- Cache shard topology
- Return responses to clients

**Implementation:**
```rust
use axum::{Router, routing::post};
use blake3::Hash;

struct Gateway {
    shard_map: Arc<RwLock<ShardMap>>,
    http_client: reqwest::Client,
}

async fn handle_assertion(
    State(gateway): State<Gateway>,
    Json(assertion): Json<Assertion>,
) -> Result<StatusCode, ApiError> {
    // 1. Validate signature
    assertion.verify_signature()?;

    // 2. Compute shard
    let shard_id = compute_shard(&assertion.subject, gateway.shard_map.read().await.num_shards);

    // 3. Find leader
    let leader_addr = gateway.shard_map.read().await.get_leader(shard_id)?;

    // 4. Forward to leader
    gateway.http_client
        .post(format!("{}/internal/append", leader_addr))
        .json(&assertion)
        .send()
        .await?;

    Ok(StatusCode::CREATED)
}

fn compute_shard(subject: &str, num_shards: u32) -> u32 {
    let hash = blake3::hash(subject.as_bytes());
    u32::from_le_bytes(hash.as_bytes()[0..4].try_into().unwrap()) % num_shards
}
```

**Scaling:** Run many gateway instances behind load balancer.

---

#### 7.2 Shard Leader (Stateful)

**Responsibilities:**
- Accept writes for this shard's key range
- Generate HLC timestamps
- Append to WAL
- Replicate via Raft
- Apply committed entries to storage
- Update indexes
- Handle leader election

**Implementation:**
```rust
use uhlc::HLC;
use tikv_raft::{RawNode, Config};

struct ShardLeader {
    shard_id: u32,
    hlc: Arc<Mutex<HLC>>,
    wal: WriteAheadLog,
    kv_store: KVStore,
    raft_node: RawNode<MemStorage>,
    index_store: IndexStore,
}

impl ShardLeader {
    async fn append_assertion(&self, assertion: Assertion) -> Result<AssertionId, StorageError> {
        // 1. Generate HLC timestamp
        let timestamp = self.hlc.lock().await.new_timestamp();

        // 2. Create versioned assertion
        let versioned = VersionedAssertion {
            assertion,
            timestamp,
            shard_id: self.shard_id,
        };

        // 3. Serialize
        let bytes = serialize(&versioned)?;

        // 4. Propose to Raft (this does WAL append + replication)
        let propose_id = self.raft_node.propose(vec![], bytes)?;

        // 5. Wait for commit (via channel or polling)
        self.wait_for_commit(propose_id).await?;

        // 6. Apply to KV store (idempotent)
        let assertion_id = AssertionId::from_hash(blake3::hash(&bytes));
        self.kv_store.put(assertion_id.as_bytes(), &bytes)?;

        // 7. Update indexes (async, best-effort)
        tokio::spawn({
            let index_store = self.index_store.clone();
            let versioned = versioned.clone();
            async move {
                let _ = index_store.index_assertion(&versioned).await;
            }
        });

        Ok(assertion_id)
    }
}
```

---

#### 7.3 Raft Replication (tikv/raft-rs)

**Configuration:**
```rust
use tikv_raft::{Config, RawNode, storage::MemStorage};

fn create_raft_node(node_id: u64, peers: Vec<u64>) -> RawNode<MemStorage> {
    let config = Config {
        id: node_id,
        election_tick: 10,
        heartbeat_tick: 3,
        max_size_per_msg: 1024 * 1024, // 1MB
        max_inflight_msgs: 256,
        ..Default::default()
    };

    let storage = MemStorage::new();
    RawNode::new(&config, storage, peers).unwrap()
}
```

**Raft Message Handling:**
```rust
async fn run_raft_loop(raft_node: Arc<Mutex<RawNode<MemStorage>>>) {
    let mut ticker = tokio::time::interval(Duration::from_millis(100));

    loop {
        ticker.tick().await;

        let mut node = raft_node.lock().await;
        node.tick();

        // Process ready entries
        if node.has_ready() {
            let ready = node.ready();

            // 1. Persist entries to WAL
            if !ready.entries().is_empty() {
                persist_entries(ready.entries()).await;
            }

            // 2. Send messages to peers
            for msg in ready.messages {
                send_to_peer(msg).await;
            }

            // 3. Apply committed entries
            for entry in ready.committed_entries {
                apply_entry(entry).await;
            }

            // 4. Advance Raft state
            node.advance(ready);
        }
    }
}
```

---

#### 7.4 Storage Layer

**Write-Ahead Log:**
```rust
struct WriteAheadLog {
    file: File,
    current_offset: u64,
}

impl WriteAheadLog {
    async fn append(&mut self, entry: &RaftEntry) -> Result<u64, io::Error> {
        let bytes = serialize(entry)?;
        let checksum = crc32c::crc32c(&bytes);

        // Write: [len: u32][checksum: u32][data: bytes]
        let record = [
            &(bytes.len() as u32).to_le_bytes()[..],
            &checksum.to_le_bytes()[..],
            &bytes[..],
        ].concat();

        self.file.write_all(&record).await?;
        self.file.sync_all().await?; // fsync

        let offset = self.current_offset;
        self.current_offset += record.len() as u64;
        Ok(offset)
    }
}
```

**KV Store (RocksDB):**
```rust
use rocksdb::{DB, Options, WriteBatch};

struct KVStore {
    db: Arc<DB>,
}

impl KVStore {
    fn put(&self, key: &[u8], value: &[u8]) -> Result<(), rocksdb::Error> {
        self.db.put(key, value)
    }

    fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>, rocksdb::Error> {
        self.db.get(key)
    }

    fn batch_put(&self, entries: Vec<(Vec<u8>, Vec<u8>)>) -> Result<(), rocksdb::Error> {
        let mut batch = WriteBatch::default();
        for (key, value) in entries {
            batch.put(key, value);
        }
        self.db.write(batch)
    }
}
```

---

#### 7.5 Index Store (Async Updates)

**Subject Index:**
```rust
struct SubjectIndex {
    db: Arc<DB>,
}

impl SubjectIndex {
    async fn index_assertion(&self, assertion: &VersionedAssertion) -> Result<(), StorageError> {
        let subject_key = format!("subject:{}:{}", assertion.assertion.subject, assertion.timestamp);
        self.db.put(subject_key.as_bytes(), assertion.assertion.id.as_bytes())?;
        Ok(())
    }

    async fn query_subject(&self, subject: &str, limit: usize) -> Result<Vec<AssertionId>, StorageError> {
        let prefix = format!("subject:{}:", subject);
        let mut results = Vec::new();

        let iter = self.db.prefix_iterator(prefix.as_bytes());
        for (_, value) in iter.take(limit) {
            results.push(AssertionId::from_bytes(&value)?);
        }

        Ok(results)
    }
}
```

---

### 7.6 Failure Handling

**Leader Failure:**
1. Followers detect missing heartbeats (election_timeout)
2. Follower transitions to candidate, increments term
3. Candidate requests votes from peers
4. Majority votes → becomes new leader
5. New leader sends heartbeats
6. Gateway detects new leader (via Raft status API)

**Follower Failure:**
1. Leader detects missing ACKs
2. Leader continues with remaining quorum
3. When follower recovers, catches up via log replication

**Network Partition:**
1. Minority partition cannot elect leader (no quorum)
2. Minority partition rejects writes
3. Majority partition continues serving writes
4. When partition heals, minority replays majority's log

**Consistency Guarantee:** Raft ensures linearizability (strongest consistency model).

---

### 7.7 What We Skip

**No Transaction Coordinator:**
- Single assertion = single Raft proposal
- No multi-shard transactions

**No Intent Resolution:**
- CockroachDB writes "intents" then resolves them
- We write final values directly

**No Read-Write Conflict Detection:**
- Assertions are immutable
- No conflict possible

**No Distributed Locks:**
- No lock table
- No deadlock detection

**No MVCC Garbage Collection:**
- Assertions never deleted
- No tombstones to clean up

---

## 8. Rust Crates to Use

### Core Dependencies

```toml
[dependencies]
# Consensus
raft = "0.7"                    # TiKV's Raft implementation
raft-proto = "0.7"              # Protobuf definitions

# Hybrid Logical Clock
uhlc = "0.5"                    # Battle-tested HLC

# Storage
rocksdb = "0.21"                # Embedded KV store
sled = "0.34"                   # Alternative: pure-Rust KV store

# Serialization
rkyv = "0.7"                    # Zero-copy serialization (already using)
blake3 = "1.5"                  # Content addressing (already using)

# Async runtime
tokio = { version = "1.35", features = ["full"] }
tokio-stream = "0.1"

# Networking
tonic = "0.10"                  # gRPC (Raft messages)
axum = "0.7"                    # HTTP gateway (already using)
tower = "0.4"                   # Middleware

# Error handling
anyhow = "1.0"
thiserror = "1.0"

# Observability
tracing = "0.1"                 # Already using
tracing-subscriber = "0.3"
metrics = "0.21"
```

### Optional (Future)

```toml
# If we need service discovery
etcd-client = "0.12"

# If we want built-in consensus
tikv-client = "0.2"             # Use TiKV as storage backend

# If we need distributed tracing
opentelemetry = "0.21"
opentelemetry-jaeger = "0.20"
```

---

## 9. Implementation Phases

### Phase 1: Single-Node with HLC (1-2 weeks)

**Goal:** Prove HLC integration + baseline performance

```rust
// Extend current ingestion with HLC
struct IngestionPipeline {
    hlc: Arc<Mutex<HLC>>,
    wal: WriteAheadLog,
    kv_store: KVStore,
}

impl IngestionPipeline {
    async fn ingest(&self, assertion: Assertion) -> Result<AssertionId, Error> {
        let timestamp = self.hlc.lock().await.new_timestamp();
        let versioned = VersionedAssertion { assertion, timestamp };

        // Existing WAL append logic
        self.wal.append(&versioned).await?;

        // Existing KV store logic
        let id = AssertionId::from_content(&versioned);
        self.kv_store.put(id.as_bytes(), &serialize(&versioned)?)?;

        Ok(id)
    }
}
```

**Tests:**
- HLC monotonicity under concurrent writes
- HLC causality preservation (ingest A, ingest B superseding A, verify timestamps)
- Performance baseline (writes/sec, p99 latency)

---

### Phase 2: Single-Shard with Raft Replication (3-4 weeks)

**Goal:** Fault-tolerant single shard (3 replicas)

**Architecture:**
```
Writer → Leader (with Raft) → Followers (2x)
```

**Implementation:**
1. Wrap `IngestionPipeline` with Raft layer
2. Leader proposes writes to Raft group
3. Followers replicate and apply
4. Leader responds when committed

**Tests:**
- Leader failure: kill leader, verify follower elected, writes continue
- Follower failure: kill follower, verify writes continue with 2/3 quorum
- Recovery: restart failed node, verify catches up
- Split-brain: network partition, verify minority rejects writes

---

### Phase 3: Multi-Shard with Gateway (4-6 weeks)

**Goal:** Horizontal scalability

**Architecture:**
```
Writers → Gateway Layer (stateless, N instances)
              ↓
    ┌─────────┼─────────┐
Shard 0    Shard 1    Shard 2
(3 replicas each)
```

**Implementation:**
1. Gateway routes by `hash(subject) % num_shards`
2. Shard topology stored in etcd or Raft metadata
3. Gateway caches topology, refreshes on errors
4. Dynamic shard addition (consistent hashing)

**Tests:**
- Load balancing: verify writes distributed evenly
- Shard failover: kill entire shard, verify gateway routes to backup
- Rebalancing: add shard, verify rehashing redistributes data
- Hotspot handling: send 10K writes to same subject, verify throughput

---

### Phase 4: Production Hardening (ongoing)

**Observability:**
```rust
// Metrics
metrics::counter!("assertions_ingested_total", "shard" => shard_id.to_string());
metrics::histogram!("raft_commit_duration_ms", commit_duration.as_millis() as f64);
metrics::gauge!("raft_leader_id", leader_id as f64);

// Tracing
#[instrument(skip(self), fields(shard_id = %self.shard_id, assertion_id = %assertion.id))]
async fn append_assertion(&self, assertion: Assertion) -> Result<()> {
    info!("Appending assertion");
    // ...
}
```

**Chaos Testing:**
- Random node kills
- Network latency injection
- Disk failures
- Clock skew simulation

**Performance Tuning:**
- Batch size optimization (find sweet spot for latency vs throughput)
- RocksDB compaction tuning
- Raft message batching
- Index update batching

---

## 10. Key Design Decisions Summary

| Decision | Choice | Rationale |
|----------|--------|-----------|
| **Clock** | Hybrid Logical Clock (HLC) | Causality + physical time without GPS clocks |
| **Consensus** | Raft (tikv/raft-rs) | Simple, proven, excellent Rust implementation |
| **Sharding** | Hash by subject | Data locality for common queries |
| **Replication** | 3 replicas per shard | Tolerate 1 failure (quorum = 2/3) |
| **Storage** | RocksDB (via rust-rocksdb) | LSM tree, proven at scale, good Rust bindings |
| **Serialization** | rkyv (already using) | Zero-copy, fastest option |
| **Transactions** | None (single-assertion writes) | Append-only eliminates need |
| **Gateway** | Stateless HTTP/gRPC | Easy to scale horizontally |

---

## 11. Performance Expectations

### Single Shard (3 Replicas, Same Datacenter)

| Metric | No Batching | With Batching (100/batch) |
|--------|-------------|---------------------------|
| Write Latency (p50) | 5-10ms | 5-10ms |
| Write Latency (p99) | 15-30ms | 15-30ms |
| Throughput | 5K-20K/sec | 100K-500K/sec |
| Write Amplification | 3x (replication) | 3x |

### 100 Shards

| Metric | Value |
|--------|-------|
| Aggregate Throughput | 10M-50M assertions/sec |
| Storage Efficiency | ~70% (accounting for WAL, indexes) |
| Fault Tolerance | Lose any 1 replica per shard |

### Resource Requirements (per node)

| Resource | Requirement |
|----------|-------------|
| CPU | 8-16 cores |
| RAM | 32-64 GB |
| Disk | 1-4 TB NVMe SSD |
| Network | 10 Gbps |

---

## 12. Comparison: Traditional vs Episteme

### CockroachDB Transaction Flow

```
1. Client: BEGIN TRANSACTION
2. TxnCoordSender: Create transaction record
3. DistSender: Route to ranges, write intents
4. Raft: Replicate intents (2 RTT)
5. TxnCoordSender: Write commit record
6. Raft: Replicate commit (1 RTT)
7. TxnCoordSender: Resolve intents (async)
8. Client: Transaction complete

Total: ~3-4 RTT, complex state machine
```

### Episteme Assertion Flow

```
1. Client: POST /assertions
2. Gateway: Route to shard leader
3. Leader: Propose to Raft
4. Raft: Replicate and commit (1 RTT)
5. Leader: Apply to storage
6. Gateway: Return assertion ID

Total: ~1-2 RTT, simple state machine
```

**Latency Reduction:** 50-75% vs traditional distributed database
**Code Complexity:** ~80% less than CockroachDB transaction coordinator

---

## 13. Open Questions & Future Work

### Q1: How to handle supersession chains across shards?

**Problem:** Assertion A1 on shard X supersedes assertion A0 on shard Y. How to ensure ordering?

**Option A:** Require supersession to happen on same shard (reject if subjects differ)
**Option B:** Use HLC timestamps, accept eventual consistency for cross-shard supersession
**Option C:** Introduce lightweight cross-shard coordination (2PC only for supersession)

**Recommendation:** Start with Option B (eventual consistency). Most supersession chains will be same subject (same shard).

### Q2: How to query across multiple shards efficiently?

**Problem:** Query "all assertions about subjects in category X" may span shards.

**Option A:** Scatter-gather (query all shards, merge results)
**Option B:** Secondary index shard (index by predicate or category)
**Option C:** Materialized views pre-aggregated by common queries

**Recommendation:** Implement Option A first, optimize hot queries with Option C.

### Q3: What's the story for multi-region deployments?

**Problem:** Replicate across US-East, US-West, EU for low-latency reads globally.

**Option A:** Single Raft group across regions (high write latency)
**Option B:** Regional Raft groups with async replication (eventual consistency)
**Option C:** Spanner-style Paxos with commit-wait (requires TrueTime)

**Recommendation:** Start single-region, then explore Option B (similar to CockroachDB follower reads).

### Q4: How to handle hot shards?

**Problem:** Celebrity subject (e.g., "Donald Trump") receives 100K writes/sec.

**Option A:** Over-provision that shard (more CPU/disk)
**Option B:** Sub-shard by predicate (split "Donald Trump" + "nationality" vs "Donald Trump" + "age")
**Option C:** Use consistent hashing to split hot subject across multiple sub-shards

**Recommendation:** Monitor for hotspots, implement Option A initially, Option C if needed.

---

## 14. Recommended Reading

### Essential Papers
1. [Raft Consensus Algorithm](https://raft.github.io/raft.pdf) - Diego Ongaro & John Ousterhout
2. [Logical Physical Clocks (HLC)](https://cse.buffalo.edu/~demirbas/publications/hlc.pdf) - Kulkarni & Demirbas
3. [Spanner: Google's Globally-Distributed Database](https://research.google.com/archive/spanner-osdi2012.pdf) - Corbett et al.

### Blog Posts
4. [Life of a CockroachDB Transaction](https://www.cockroachlabs.com/blog/parallel-commits/)
5. [Building a Distributed Log from Scratch](https://bravenewgeek.com/building-a-distributed-log-from-scratch-part-2-data-replication/)

### Books
6. **Designing Data-Intensive Applications** - Martin Kleppmann (Chapters 5, 7, 9)

---

## Conclusion

**For Episteme's append-only, content-addressed knowledge graph, we can build a significantly simpler distributed write path than traditional databases:**

1. **Use HLC instead of TrueTime** - No specialized hardware needed
2. **Single-writer-per-partition** - Eliminates distributed transactions
3. **Raft for replication** - Simple, battle-tested consensus
4. **Hash-based sharding by subject** - Natural data locality
5. **Async index updates** - Decouple write path from read optimization

**Expected outcome:** A system that scales to millions of concurrent writers, handles petabytes of assertions, and maintains strong consistency guarantees—with ~1/5th the complexity of CockroachDB.

**Next steps:** Implement Phase 1 (HLC integration) to validate timestamp generation and causality tracking before investing in Raft infrastructure.

---

## Sources

### Google Spanner
- [Spanner: TrueTime and external consistency | Google Cloud Documentation](https://docs.cloud.google.com/spanner/docs/true-time-external-consistency)
- [Strict Serializability and External Consistency in Spanner | Google Cloud Blog](https://cloud.google.com/blog/products/databases/strict-serializability-and-external-consistency-in-spanner)
- [Life of Spanner Reads & Writes | Google Cloud Documentation](https://docs.cloud.google.com/spanner/docs/whitepapers/life-of-reads-and-writes)

### CockroachDB
- [Parallel Commits: An atomic commit protocol for globally distributed transactions](https://www.cockroachlabs.com/blog/parallel-commits/)
- [How Pipelining consensus writes speeds up distributed SQL transactions](https://www.cockroachlabs.com/blog/transaction-pipelining/)
- [Replication Layer | CockroachDB Documentation](https://www.cockroachlabs.com/docs/stable/architecture/replication-layer)
- [Life of a Distributed Transaction | CockroachDB Documentation](https://www.cockroachlabs.com/docs/stable/architecture/life-of-a-distributed-transaction)

### Hybrid Logical Clocks
- [Hybrid Logical Clock in Rust | lib.rs](https://lib.rs/crates/hybrid-logical-clock)
- [Hybrid Logical Clocks | Kevin Sookocheff](https://sookocheff.com/post/time/hybrid-logical-clocks/)
- [uhlc-rs: A Unique Hybrid Logical Clock for Rust | GitHub](https://github.com/atolab/uhlc-rs)
- [Logical Physical Clocks Paper (PDF)](https://cse.buffalo.edu/~demirbas/publications/hlc.pdf)
- [Evolving Clock Sync in Distributed Databases | YugabyteDB](https://www.yugabyte.com/blog/evolving-clock-sync-for-distributed-databases/)

### Append-Only Systems
- [Append-Only Logs: The Immutable Diary of Data | Medium](https://medium.com/@komalshehzadi/append-only-logs-the-immutable-diary-of-data-58c36a871c7c)
- [Immutability Changes Everything | ACM Queue](https://queue.acm.org/detail.cfm?id=2884038)
- [The Rise of Immutable Data Stores | ODBMS.org](https://www.odbms.org/2015/10/the-rise-of-immutable-data-stores/)
- [Data Replication in Distributed Systems | Candost's Blog](https://candost.blog/books/data-replication-in-distributed-systems/)

### Sharding
- [System Design Concepts: Database Sharding Strategies](https://engineeringatscale.substack.com/p/system-design-concepts-dive-deep)
- [Sharding strategies: directory-based, range-based, and hash-based | PlanetScale](https://planetscale.com/blog/types-of-sharding)
- [The Ultimate Guide to Consistent Hashing | Toptal](https://www.toptal.com/big-data/consistent-hashing)
- [4 Data Sharding Strategies We Analyzed When Building YugabyteDB](https://www.yugabyte.com/blog/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/)
- [RDF Triple Store Architectures (PDF)](https://www.iaria.org/conferences2018/filesDBKDA18/IztokSavnik_Tutorial_3store-arch.pdf)

### Raft & Consensus
- [Raft Consensus Algorithm](https://raft.github.io/)
- [Fast Raft: Optimizations to the Raft Consensus Protocol | arXiv](https://arxiv.org/html/2506.17793v1)
- [Write Buffering to Reduce Raft Consensus Latency in YugabyteDB](https://dev.to/yugabyte/write-buffering-to-reduce-raft-consensus-latency-in-yugabytedb-2dg6)
- [An Improved Raft Consensus Algorithm Based on Asynchronous Batch Processing | Springer](https://link.springer.com/chapter/10.1007/978-981-19-2456-9_44)

### Rust Implementations
- [tikv/raft-rs: Raft distributed consensus algorithm implemented in Rust | GitHub](https://github.com/tikv/raft-rs)
- [Implement Raft in Rust | TiKV Blog](https://tikv.org/blog/implement-raft-in-rust/)
- [raft crate | crates.io](https://crates.io/crates/raft)

### CRDTs
- [Conflict-free Replicated Data Types | CRDT Tech](https://crdt.tech/)
- [How CRDTs and Rust are revolutionizing distributed systems](https://kerkour.com/rust-crdt)
- [CRDT Implementations | CRDT Tech](https://crdt.tech/implementations)

### Kafka & Replicated Logs
- [Kafka Replication | Confluent Documentation](https://docs.confluent.io/kafka/design/replication.html)
- [Append-only Log: The core of Kafka | Medium](https://medium.com/@aryan25822/append-only-log-the-core-of-kafka-5740fc965fe7)
- [Building a Distributed Log from Scratch, Part 2 | Brave New Geek](https://bravenewgeek.com/building-a-distributed-log-from-scratch-part-2-data-replication/)
- [The Log: What every software engineer should know | LinkedIn Engineering](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying)

### Content-Addressed Storage
- [Content Identifiers (CIDs) | IPFS Docs](https://docs.ipfs.tech/concepts/content-addressing/)
- [InterPlanetary File System | Wikipedia](https://en.wikipedia.org/wiki/InterPlanetary_File_System)

### Leader Election
- [Leader election in distributed systems | AWS Builders' Library](https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/)
- [Top Leader Election Algorithms in Distributed Databases | ByteByteGo](https://blog.bytebytego.com/p/top-leader-election-algorithms-in)
- [Single Leader Replication | Medium](https://medium.com/@Omkar-Wagholikar/distributed-systems-deep-dive-single-leader-replication-0a96fb303036)

### Clock Skew
- [Clock Synchronization Is a Nightmare | Arpit Bhayani](https://arpitbhayani.me/blogs/clock-sync-nightmare/)
- [Clock Offset vs Clock Skew | Baeldung](https://www.baeldung.com/cs/clock-offset-skew-difference)