Implements the foundation of tidalDB's data pipeline: **Phase 1 – Schema primitives** - EntityId newtype (u64, big-endian ordering) - SignalTypeDefinition with pre-computed decay λ, deduped/sorted windows - SchemaBuilder with full constraint validation (duplicates, identifiers, half-life, windows, velocity) - LumenError wrapping all subsystems with required From impls **Phase 2 – Write-Ahead Log** - Length-prefixed, BLAKE3-protected entry format - Group-commit writer (batch up to 100 events / 10 ms) - Double-buffered content-hash deduplication - Checkpoint, truncation, and crash-recovery with full replay - Integration, property, and UAT tests (incl. 5,500-event deterministic UAT) - Proptest coverage scaled to 10 000 events/run (was ≤500) to meet acceptance criterion; cases reduced 100→10 to keep runtime comparable **Phase 3 – Storage engine** - StorageEngine trait (get/put/delete/scan/batch/flush) - Key encoding: [EntityId][0x00][Tag][suffix] with ordering/prefix helpers - InMemoryBackend (BTreeMap + RwLock) - FjallStorage with three isolated keyspaces and atomic batch helper - Property tests for key ordering and round-trip correctness Also adds planning docs for phases 4-5, research docs, architecture overview, and roadmap updates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
84 lines
5.0 KiB
Markdown
84 lines
5.0 KiB
Markdown
# Milestone 1, Phase 1: Core Type System and Schema
|
|
|
|
## Phase Deliverable
|
|
|
|
The foundational type system -- entity IDs, signal type definitions, decay rate declarations, window specifications, and the error types that every subsequent module depends on. The schema module that validates and stores signal/entity definitions.
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `EntityId` is a u64 newtype with `Display`, `Hash`, `Eq`, `Ord`
|
|
- [ ] `SignalTypeDef` declaration captures: name, decay model (exponential/linear/permanent), half-life duration, enabled windows (1h/24h/7d/30d/all_time), velocity enabled flag
|
|
- [ ] `DecayModel::Exponential` stores pre-computed lambda derived from half-life: `lambda = ln(2) / half_life_seconds`
|
|
- [ ] `LumenError` enum covers Storage, NotFound, Schema, Durability, Query, Internal variants per CODING_GUIDELINES.md
|
|
- [ ] Schema validation rejects: duplicate signal names, zero/negative half-life, empty window list on non-permanent signals, velocity without windows
|
|
- [ ] All hot-path numeric types use the precision specified in research (f64 for decay scores, u64 for timestamps in nanoseconds)
|
|
|
|
## Dependencies
|
|
|
|
- **Requires:** Nothing -- this is the root of the dependency DAG
|
|
- **Blocks:** m1p2 (WAL), m1p3 (Storage/fjall), and transitively all subsequent phases
|
|
|
|
## Research References
|
|
|
|
- [docs/research/tidaldb_signal_ledger.md](../../../research/tidaldb_signal_ledger.md) -- decay formula, EntityState struct, running-score approach
|
|
- [docs/research/phase1_1_type_system.md](../../../research/phase1_1_type_system.md) -- newtype patterns, Duration handling, error hierarchy, schema validation, f64 precision analysis, Window enum design
|
|
- [CODING_GUIDELINES.md](../../../../CODING_GUIDELINES.md) -- error handling (section 7), module boundaries (section 9), dependencies (section 10)
|
|
- [thoughts.md](../../../../thoughts.md) -- Part V.12 (subject-prefix keys), Part II.1 (WAL convergence)
|
|
|
|
## Spec References
|
|
|
|
- [docs/specs/03-signal-system.md](../../../specs/03-signal-system.md) -- signal type declaration, decay types and lambda precomputation, window definitions, signal ledger architecture
|
|
- [docs/specs/11-schema.md](../../../specs/11-schema.md) -- schema definition API, type system, validation rules, schema versioning
|
|
- [docs/specs/02-entity-model.md](../../../specs/02-entity-model.md) -- EntityKind (Item/User/Creator), entity ID encoding, storage representation
|
|
- [docs/specs/01-storage-engine.md](../../../specs/01-storage-engine.md) -- key encoding scheme using big-endian EntityId and Timestamp
|
|
- [docs/specs/00-architecture-overview.md](../../../specs/00-architecture-overview.md) -- system architecture, code module map showing schema/ layout
|
|
|
|
## Task Index
|
|
|
|
| # | Task | Delivers | Depends On | Complexity |
|
|
|---|------|----------|------------|------------|
|
|
| 01 | Core Identity and Temporal Types | `EntityId`, `EntityKind`, `Timestamp`, `Score` | None | S |
|
|
| 02 | Signal Type Definitions | `SignalTypeDef`, `DecayModel`, `DecaySpec`, `Window`, `WindowSet` | Task 01 | S |
|
|
| 03 | Error Types and Schema Validation | `LumenError`, `SchemaError`, `Schema`, `SchemaBuilder` | Task 01, Task 02 | S |
|
|
|
|
## Task Dependency DAG
|
|
|
|
```
|
|
Task 01: Core Identity Types
|
|
|
|
|
v
|
|
Task 02: Signal Type Definitions (uses EntityKind from Task 01)
|
|
|
|
|
v
|
|
Task 03: Error Types + Schema Validation (uses EntityId, SignalTypeDef, DecayModel, Window)
|
|
```
|
|
|
|
Tasks 01 and 02 are technically parallelizable if `EntityKind` is extracted first, but at complexity S each, sequential execution is fine.
|
|
|
|
## File Layout
|
|
|
|
```
|
|
tidal/src/
|
|
lib.rs -- pub mod declarations, Result<T> alias, re-exports
|
|
schema/
|
|
mod.rs -- pub use re-exports from submodules
|
|
entity.rs -- Task 01: EntityId, EntityKind
|
|
timestamp.rs -- Task 01: Timestamp newtype
|
|
score.rs -- Task 01: Score newtype (finite f64 with Ord)
|
|
signal.rs -- Task 02: SignalTypeDef, DecayModel, Window, WindowSet
|
|
error.rs -- Task 03: LumenError, SchemaError, sub-error stubs
|
|
validation.rs -- Task 03: Schema, SchemaBuilder, DecaySpec, SignalBuilder
|
|
signals/mod.rs -- empty (m1p4)
|
|
storage/mod.rs -- empty (m1p3)
|
|
query/mod.rs -- empty (Milestone 2)
|
|
ranking/mod.rs -- empty (Milestone 2)
|
|
```
|
|
|
|
## Open Questions
|
|
|
|
1. **String vs u64 entity IDs in public API** -- API.md uses string IDs (`"item_abc"`), internal types use `u64`. Resolution: `EntityId` is `u64` internally. String-to-u64 mapping is a m1p5 concern when the public `Lumen` API is built. m1p1 defines only the internal type.
|
|
|
|
2. **EntityId uniqueness scope** -- globally unique or per-EntityKind? Resolution: signal names are globally unique (no `item.view` vs `user.view`). Entity IDs are scoped per-EntityKind by storage namespace. Different column families isolate the namespaces.
|
|
|
|
3. **Custom windows** -- `Window::Custom(Duration)` deferred. The five fixed variants cover every sort mode and ranking profile in the spec. Adding custom windows would require dynamic bucket allocation. Revisit if M5 benchmarks demand it.
|