tidaldb/docs/planning/milestone-1/phase-1/OVERVIEW.md
jordan 29400d48db feat: implement Milestone 1 phases 1-3 — schema, WAL, and storage layer
Implements the foundation of tidalDB's data pipeline:

**Phase 1 – Schema primitives**
- EntityId newtype (u64, big-endian ordering)
- SignalTypeDefinition with pre-computed decay λ, deduped/sorted windows
- SchemaBuilder with full constraint validation (duplicates, identifiers,
  half-life, windows, velocity)
- LumenError wrapping all subsystems with required From impls

**Phase 2 – Write-Ahead Log**
- Length-prefixed, BLAKE3-protected entry format
- Group-commit writer (batch up to 100 events / 10 ms)
- Double-buffered content-hash deduplication
- Checkpoint, truncation, and crash-recovery with full replay
- Integration, property, and UAT tests (incl. 5,500-event deterministic UAT)
- Proptest coverage scaled to 10 000 events/run (was ≤500) to meet
  acceptance criterion; cases reduced 100→10 to keep runtime comparable

**Phase 3 – Storage engine**
- StorageEngine trait (get/put/delete/scan/batch/flush)
- Key encoding: [EntityId][0x00][Tag][suffix] with ordering/prefix helpers
- InMemoryBackend (BTreeMap + RwLock)
- FjallStorage with three isolated keyspaces and atomic batch helper
- Property tests for key ordering and round-trip correctness

Also adds planning docs for phases 4-5, research docs, architecture
overview, and roadmap updates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 16:43:24 -07:00

84 lines
5.0 KiB
Markdown

# Milestone 1, Phase 1: Core Type System and Schema
## Phase Deliverable
The foundational type system -- entity IDs, signal type definitions, decay rate declarations, window specifications, and the error types that every subsequent module depends on. The schema module that validates and stores signal/entity definitions.
## Acceptance Criteria
- [ ] `EntityId` is a u64 newtype with `Display`, `Hash`, `Eq`, `Ord`
- [ ] `SignalTypeDef` declaration captures: name, decay model (exponential/linear/permanent), half-life duration, enabled windows (1h/24h/7d/30d/all_time), velocity enabled flag
- [ ] `DecayModel::Exponential` stores pre-computed lambda derived from half-life: `lambda = ln(2) / half_life_seconds`
- [ ] `LumenError` enum covers Storage, NotFound, Schema, Durability, Query, Internal variants per CODING_GUIDELINES.md
- [ ] Schema validation rejects: duplicate signal names, zero/negative half-life, empty window list on non-permanent signals, velocity without windows
- [ ] All hot-path numeric types use the precision specified in research (f64 for decay scores, u64 for timestamps in nanoseconds)
## Dependencies
- **Requires:** Nothing -- this is the root of the dependency DAG
- **Blocks:** m1p2 (WAL), m1p3 (Storage/fjall), and transitively all subsequent phases
## Research References
- [docs/research/tidaldb_signal_ledger.md](../../../research/tidaldb_signal_ledger.md) -- decay formula, EntityState struct, running-score approach
- [docs/research/phase1_1_type_system.md](../../../research/phase1_1_type_system.md) -- newtype patterns, Duration handling, error hierarchy, schema validation, f64 precision analysis, Window enum design
- [CODING_GUIDELINES.md](../../../../CODING_GUIDELINES.md) -- error handling (section 7), module boundaries (section 9), dependencies (section 10)
- [thoughts.md](../../../../thoughts.md) -- Part V.12 (subject-prefix keys), Part II.1 (WAL convergence)
## Spec References
- [docs/specs/03-signal-system.md](../../../specs/03-signal-system.md) -- signal type declaration, decay types and lambda precomputation, window definitions, signal ledger architecture
- [docs/specs/11-schema.md](../../../specs/11-schema.md) -- schema definition API, type system, validation rules, schema versioning
- [docs/specs/02-entity-model.md](../../../specs/02-entity-model.md) -- EntityKind (Item/User/Creator), entity ID encoding, storage representation
- [docs/specs/01-storage-engine.md](../../../specs/01-storage-engine.md) -- key encoding scheme using big-endian EntityId and Timestamp
- [docs/specs/00-architecture-overview.md](../../../specs/00-architecture-overview.md) -- system architecture, code module map showing schema/ layout
## Task Index
| # | Task | Delivers | Depends On | Complexity |
|---|------|----------|------------|------------|
| 01 | Core Identity and Temporal Types | `EntityId`, `EntityKind`, `Timestamp`, `Score` | None | S |
| 02 | Signal Type Definitions | `SignalTypeDef`, `DecayModel`, `DecaySpec`, `Window`, `WindowSet` | Task 01 | S |
| 03 | Error Types and Schema Validation | `LumenError`, `SchemaError`, `Schema`, `SchemaBuilder` | Task 01, Task 02 | S |
## Task Dependency DAG
```
Task 01: Core Identity Types
|
v
Task 02: Signal Type Definitions (uses EntityKind from Task 01)
|
v
Task 03: Error Types + Schema Validation (uses EntityId, SignalTypeDef, DecayModel, Window)
```
Tasks 01 and 02 are technically parallelizable if `EntityKind` is extracted first, but at complexity S each, sequential execution is fine.
## File Layout
```
tidal/src/
lib.rs -- pub mod declarations, Result<T> alias, re-exports
schema/
mod.rs -- pub use re-exports from submodules
entity.rs -- Task 01: EntityId, EntityKind
timestamp.rs -- Task 01: Timestamp newtype
score.rs -- Task 01: Score newtype (finite f64 with Ord)
signal.rs -- Task 02: SignalTypeDef, DecayModel, Window, WindowSet
error.rs -- Task 03: LumenError, SchemaError, sub-error stubs
validation.rs -- Task 03: Schema, SchemaBuilder, DecaySpec, SignalBuilder
signals/mod.rs -- empty (m1p4)
storage/mod.rs -- empty (m1p3)
query/mod.rs -- empty (Milestone 2)
ranking/mod.rs -- empty (Milestone 2)
```
## Open Questions
1. **String vs u64 entity IDs in public API** -- API.md uses string IDs (`"item_abc"`), internal types use `u64`. Resolution: `EntityId` is `u64` internally. String-to-u64 mapping is a m1p5 concern when the public `Lumen` API is built. m1p1 defines only the internal type.
2. **EntityId uniqueness scope** -- globally unique or per-EntityKind? Resolution: signal names are globally unique (no `item.view` vs `user.view`). Entity IDs are scoped per-EntityKind by storage namespace. Different column families isolate the namespaces.
3. **Custom windows** -- `Window::Custom(Duration)` deferred. The five fixed variants cover every sort mode and ranking profile in the spec. Adding custom windows would require dynamic bucket allocation. Revisit if M5 benchmarks demand it.