350 lines
21 KiB
Markdown
350 lines
21 KiB
Markdown
# Research: Enterprise Readiness Risks -- fjall Backup API and Schema Fingerprinting
|
|
|
|
## Risk 1: fjall v3 Backup/Snapshot API
|
|
|
|
### Question
|
|
|
|
Does fjall 3.x expose a safe backup/snapshot API that tidalDB can use to implement `TidalDb::create_backup(dest: &Path) -> Result<BackupInfo>` while the database is live?
|
|
|
|
### TidalDB Context
|
|
|
|
tidalDB uses fjall 3.0.2 (`fjall = "3"` in Cargo.toml) as its durable storage engine. The `FjallStorage` struct (at `/tidal/src/storage/fjall.rs`) owns a single `fjall::Database` with three keyspaces: items, users, creators. A backup must also capture:
|
|
|
|
- **WAL segments** (`{data_dir}/wal/`)
|
|
- **Tantivy text indexes** (`{data_dir}/text_index/`, `{data_dir}/creator_text_index/`)
|
|
- **USearch vector indexes** (stored as `.idx` files via `VectorIndex::save()`)
|
|
- **Signal ledger checkpoints** (serialized into fjall under `Tag::Sig = 0x02`)
|
|
- **Co-engagement, cohort, collection, session data** (all in fjall under their respective tags)
|
|
|
|
The backup must be consistent: a restored backup should produce the same query results as the source at the point in time the backup was taken.
|
|
|
|
### Survey of fjall 3.0.2 API Surface
|
|
|
|
**`fjall::Database` public methods (complete list from docs.rs):**
|
|
|
|
| Method | Purpose | Backup Relevance |
|
|
|--------|---------|------------------|
|
|
| `snapshot()` | Cross-keyspace MVCC read snapshot | Read consistency only; does NOT produce files |
|
|
| `persist(PersistMode)` | Flushes active journal to disk | Required pre-backup for durability |
|
|
| `batch()` | Atomic cross-keyspace write batch | Not relevant |
|
|
| `keyspace(name, opts)` | Open/create a keyspace | Not relevant |
|
|
| `disk_space()` | Total bytes on disk | Informational only |
|
|
| `journal_count()` | Number of journal files | Informational only |
|
|
| `list_keyspace_names()` | Enumerate keyspaces | Useful for backup enumeration |
|
|
|
|
**`fjall::Keyspace` relevant methods:**
|
|
|
|
| Method | Purpose | Backup Relevance |
|
|
|--------|---------|------------------|
|
|
| `path()` | Returns the LSM-tree's filesystem path | Needed to locate files to copy |
|
|
| `rotate_memtable_and_wait()` | Flushes memtable to SST, blocks until done | Critical pre-backup step |
|
|
| `disk_space()` | Keyspace bytes on disk | Informational |
|
|
|
|
**`fjall::Snapshot`:**
|
|
- Implements the `Readable` trait (get, iter, range, prefix, etc.)
|
|
- This is a logical MVCC snapshot for consistent reads -- it does NOT produce a physical file-level snapshot
|
|
- Cannot be used for file-level backup
|
|
|
|
### Has snapshot/backup API: NO
|
|
|
|
fjall 3.0.2 does **not** expose a `backup_to()`, `checkpoint()`, or `export()` method. This is tracked as [GitHub issue #52: "Backup using Checkpointing"](https://github.com/fjall-rs/fjall/issues/52), which remains **open and blocked** as of December 2024.
|
|
|
|
The planned API (not yet implemented):
|
|
```rust
|
|
Database::backup_to<P: AsRef<Path>>(&self, path: P) -> crate::Result<()>
|
|
TxDatabase::backup_to<P: AsRef<Path>>(&self, path: P) -> crate::Result<()>
|
|
```
|
|
|
|
The blocker is [issue #70](https://github.com/fjall-rs/fjall/issues/70) -- an "unopened keyspace locking" mechanism needed for safe online backup.
|
|
|
|
### Comparison with Other Embedded Databases
|
|
|
|
| Database | Backup API | Online? | Hard Links? | Notes |
|
|
|----------|-----------|---------|-------------|-------|
|
|
| **RocksDB** | `Checkpoint::CreateCheckpoint()` | Yes | Yes (same FS) | Hard-links SSTs, copies MANIFEST. Consistent across column families. Production-proven at scale. |
|
|
| **SQLite** | `sqlite3_backup_init/step/finish` | Yes | No (page copy) | Incremental page-by-page copy while source remains writable. |
|
|
| **LMDB** | `mdb_env_copy2()` | Yes | No (page copy) | Copy-on-write B-tree makes consistent snapshots trivial. |
|
|
| **DuckDB** | `EXPORT DATABASE` / `COPY` | Semi | No | SQL-level export; not a byte-level checkpoint. |
|
|
| **fjall 3.0.2** | None | N/A | N/A | Issue #52 open. Maintainer recommends `cp -R` offline. |
|
|
|
|
### Safe Backup Procedure for fjall 3.0.2
|
|
|
|
Given the absence of a backup API, there are two viable approaches:
|
|
|
|
#### Approach A: Quiesce + File Copy (Recommended)
|
|
|
|
This is the approach the fjall maintainer explicitly recommends for offline backup. Adapted for tidalDB's multi-engine architecture:
|
|
|
|
```
|
|
1. Pause writes (set an AtomicBool flag that makes signal/entity writes return Err(Backpressure))
|
|
2. Flush all in-flight data:
|
|
a. Flush text syncers (item + creator) via flush_tx channel -- blocks until Tantivy commits
|
|
b. Checkpoint signal ledger + cohort ledger + co-engagement to fjall
|
|
c. For each keyspace: call rotate_memtable_and_wait() to flush memtables to SSTs
|
|
d. Call db.persist(PersistMode::SyncAll) to fsync all journal data
|
|
e. Write WAL checkpoint marker
|
|
3. Copy the entire data_dir recursively to dest:
|
|
a. {data_dir}/items/ -> {dest}/items/ (fjall SSTs + journals)
|
|
b. {data_dir}/users/ -> {dest}/users/ (fjall SSTs + journals)
|
|
c. {data_dir}/creators/ -> {dest}/creators/ (fjall SSTs + journals)
|
|
d. {data_dir}/wal/ -> {dest}/wal/ (tidalDB WAL segments)
|
|
e. {data_dir}/text_index/ -> {dest}/text_index/
|
|
f. {data_dir}/creator_text_index/ -> {dest}/creator_text_index/
|
|
g. {data_dir}/cache/ -> {dest}/cache/ (if present)
|
|
4. Resume writes (clear the AtomicBool flag)
|
|
5. Return BackupInfo { path, size_bytes, timestamp, wal_sequence }
|
|
```
|
|
|
|
**Write pause duration estimate:** The flush operations (steps 2a-2d) are I/O-bound. For a database with 10M entities and active signal writes:
|
|
- Text syncer flush: ~100ms (channel round-trip + Tantivy commit)
|
|
- Signal checkpoint: ~50ms (serialize DashMap entries to fjall)
|
|
- rotate_memtable_and_wait per keyspace: ~50ms each (3 keyspaces = ~150ms)
|
|
- persist(SyncAll): ~10ms (fsync)
|
|
- File copy: proportional to data size; 1GB at 500MB/s = ~2s
|
|
|
|
**Total estimated write pause: 300ms flush + copy time.** For a 1GB database, roughly 2-3 seconds.
|
|
|
|
#### Approach B: Snapshot-Consistent Logical Export
|
|
|
|
Use `Database::snapshot()` for a consistent logical view, then iterate and write to a new fjall database:
|
|
|
|
```
|
|
1. Take snapshot = db.snapshot()
|
|
2. For each keyspace, iterate snapshot and write to a new Database at dest
|
|
3. Separately copy WAL, Tantivy indexes, vector indexes
|
|
```
|
|
|
|
**Problems with this approach:**
|
|
- Does not capture WAL/Tantivy/vector files consistently with the fjall snapshot
|
|
- Much slower than file copy (must deserialize/reserialize every KV pair)
|
|
- No way to snapshot Tantivy or USearch indexes concurrently with the fjall snapshot
|
|
- The logical export would need to reconstruct the exact on-disk format fjall expects
|
|
|
|
**Verdict: Approach B is not viable.** The cross-engine consistency problem (fjall + Tantivy + USearch) makes logical export impractical.
|
|
|
|
#### Approach C: Hard-Link Optimization (Same Filesystem)
|
|
|
|
A refinement of Approach A for same-filesystem backups:
|
|
|
|
```
|
|
1. Quiesce + flush (same as Approach A steps 1-2)
|
|
2. For fjall SST files: hard-link instead of copy (SSTs are immutable after flush)
|
|
3. For journal files, WAL, Tantivy, USearch: copy (these are mutable)
|
|
4. Resume writes
|
|
```
|
|
|
|
This mirrors RocksDB's Checkpoint approach. However, it requires:
|
|
- Enumerating fjall's internal file structure (SSTs vs journals vs metadata)
|
|
- Understanding which files are immutable after `rotate_memtable_and_wait()`
|
|
- This is fragile without fjall's cooperation (internal layout may change between versions)
|
|
|
|
**Verdict: Too fragile without fjall API support.** Wait for issue #52 resolution, then adopt hard-link optimization.
|
|
|
|
### Recommendation for `create_backup()` Implementation
|
|
|
|
**Use Approach A: Quiesce + File Copy.**
|
|
|
|
```rust
|
|
pub fn create_backup(&self, dest: &Path) -> Result<BackupInfo> {
|
|
// 1. Pause writes via AtomicBool
|
|
// 2. Flush all engines (text, signal, fjall, WAL)
|
|
// 3. fs_extra::dir::copy(data_dir, dest, &CopyOptions::new())
|
|
// 4. Resume writes
|
|
// 5. Return metadata
|
|
}
|
|
```
|
|
|
|
Key implementation notes:
|
|
- `rotate_memtable_and_wait()` is public but annotated "NOTE: Used in tests" in fjall source. It is the correct pre-backup call -- it ensures all in-memory data is flushed to SSTs. The annotation reflects that most users do not need to call it directly, not that it is unsafe.
|
|
- `persist(PersistMode::SyncAll)` must follow to ensure journal data reaches disk.
|
|
- The write pause is bounded by I/O throughput, not by data volume (no serialization).
|
|
- Future: when fjall ships issue #52 (`Database::backup_to()`), replace the file copy with the native API for hard-link support and reduced pause duration.
|
|
|
|
### Open Questions
|
|
|
|
1. **rotate_memtable_and_wait() stability:** This method is public in fjall 3.0.2 but undocumented on docs.rs. It appears in the keyspace source as `pub fn rotate_memtable_and_wait`. tidalDB already calls it in `FjallBackend::flush()`. Risk: it could be renamed or removed in a minor fjall release. Mitigation: pin fjall version; the method is already in tidalDB's dependency surface.
|
|
|
|
2. **Tantivy backup safety:** Tantivy indexes are append-only segment files plus a `meta.json`. Copying after a `commit()` (via flush_tx) should be safe, but this needs a test that verifies a copied Tantivy index opens correctly.
|
|
|
|
3. **USearch backup safety:** USearch `.idx` files are written atomically by `VectorIndex::save()`. If a backup races with a save, the file could be truncated. The quiesce step prevents this, but we should add a file size/checksum validation on the backup side.
|
|
|
|
4. **Incremental backup:** File copy is O(data_size) every time. For large databases, incremental backup (only copy changed SSTs) would reduce pause duration. This requires tracking file checksums or modification times. Defer to post-MVP.
|
|
|
|
---
|
|
|
|
## Risk 2: Schema Fingerprint Migration Risk
|
|
|
|
### Question
|
|
|
|
Can tidalDB safely add schema fingerprint persistence at `open()` time without breaking existing databases that were opened before the feature existed?
|
|
|
|
### TidalDB Context
|
|
|
|
The `Schema` struct (`/tidal/src/schema/validation/mod.rs`) contains:
|
|
- `signals: HashMap<String, SignalTypeDef>` -- signal names, decay params, windows, velocity config
|
|
- `embedding_slots: Vec<EmbeddingSlotDef>` -- vector dimension config
|
|
- `text_fields: Vec<TextFieldDef>` -- BM25 field config
|
|
- `creator_text_fields: Vec<TextFieldDef>` -- creator search fields
|
|
- `policies: HashMap<String, AgentPolicy>` -- session rate limiting
|
|
|
|
The fingerprint would hash signal names + decay parameters (the fields that affect storage layout and signal score interpretation). If an application opens a database with a different schema than was used to create it, signal scores become meaningless (wrong decay rates applied to stored data) and WAL replay produces incorrect results.
|
|
|
|
Currently there is no guard against this. `open_with_schema()` at `/tidal/src/db/open.rs` accepts any schema and proceeds.
|
|
|
|
### Proposed Behavior
|
|
|
|
```
|
|
open() time:
|
|
1. Compute fingerprint = hash(sorted signal names + decay params)
|
|
2. Read stored fingerprint from fjall (e.g., well-known key in items keyspace)
|
|
3. Match:
|
|
a. No stored fingerprint -> bootstrap: write fingerprint, succeed
|
|
b. Stored fingerprint == computed -> succeed
|
|
c. Stored fingerprint != computed -> return TidalError::SchemaMismatch
|
|
```
|
|
|
|
### Analysis of Bootstrap Logic
|
|
|
|
#### Case 1: Brand-new database (first open ever)
|
|
|
|
No stored fingerprint. Write it. Succeed. This is correct -- there is no prior data to conflict with.
|
|
|
|
#### Case 2: Existing database, first open after feature addition
|
|
|
|
This is the migration risk. The database has data written with schema S1. The application opens with schema S2 (which may or may not equal S1). No stored fingerprint exists.
|
|
|
|
**If S1 == S2 (common case):** Bootstrap writes the fingerprint. All subsequent opens validate correctly. No problem.
|
|
|
|
**If S1 != S2 (the dangerous case the feature is supposed to prevent):** Bootstrap writes the WRONG fingerprint (S2's, not S1's). The data was written with S1, but the fingerprint now says S2. The database is silently corrupted -- not by the fingerprint feature, but by the schema mismatch that already existed before the feature was added.
|
|
|
|
**Verdict:** The bootstrap case cannot distinguish "first open with this schema" from "schema was changed." This is inherent -- without a stored fingerprint, there is no ground truth to compare against. The bootstrap behavior is **correct and safe** because:
|
|
|
|
1. If the schema matches, writing the fingerprint is harmless and enables future protection.
|
|
2. If the schema does not match, the data was already corrupted before this feature existed. The fingerprint does not make it worse -- it just fails to detect the pre-existing problem.
|
|
3. The alternative (refusing to open when no fingerprint exists) would break every existing database on the first upgrade. That is worse.
|
|
|
|
#### Case 3: Subsequent opens with matching schema
|
|
|
|
Stored fingerprint matches computed fingerprint. Succeed. This is the steady-state happy path.
|
|
|
|
#### Case 4: Subsequent opens with mismatched schema
|
|
|
|
Stored fingerprint does not match. Return `TidalError::SchemaMismatch`. This is the feature's purpose -- preventing silent corruption.
|
|
|
|
### Edge Cases
|
|
|
|
#### Edge Case 1: Schema additions (adding new signal types)
|
|
|
|
Adding a new signal type (e.g., `"share"`) changes the fingerprint. The open would fail with `SchemaMismatch`. This is **correct behavior** -- the application must decide whether the existing data is compatible with the new schema. Options:
|
|
|
|
- **Force open:** A builder method like `.allow_schema_migration()` could skip the check and overwrite the fingerprint. The application takes responsibility.
|
|
- **Migration tool:** A CLI command that validates compatibility and updates the fingerprint.
|
|
|
|
For tidalDB's workload, adding a signal type is backward-compatible (existing data is unaffected; the new signal starts empty). But removing or changing a signal type is NOT backward-compatible (existing scores become meaningless). The fingerprint feature intentionally blocks both; the migration tool should validate the specific change.
|
|
|
|
#### Edge Case 2: HashMap iteration order
|
|
|
|
`Schema.signals` is a `HashMap<String, SignalTypeDef>`. HashMap iteration order is non-deterministic. The fingerprint hash MUST sort signals by name before hashing, or the same schema will produce different fingerprints across runs.
|
|
|
|
**Implementation requirement:** Sort signal names alphabetically, then hash `(name, decay_model, windows, velocity_enabled)` tuples in order.
|
|
|
|
#### Edge Case 3: Floating-point decay parameters
|
|
|
|
Decay lambda is computed from `half_life` as `ln(2) / half_life_secs`. Floating-point equality is not reflexive for NaN, but lambda is always a valid positive f64. However, hashing `f64` directly is problematic (`f64` does not implement `Hash`).
|
|
|
|
**Solution:** Hash the `half_life` duration in nanoseconds (a `u128`), not the computed lambda. This avoids floating-point comparison issues entirely and hashes the user's declared intent, not a derived value.
|
|
|
|
#### Edge Case 4: Ephemeral mode
|
|
|
|
Ephemeral databases have no durable storage. Fingerprint persistence is meaningless. Skip the check entirely for `StorageMode::Ephemeral`.
|
|
|
|
#### Edge Case 5: Concurrent opens
|
|
|
|
If two processes open the same data directory simultaneously (which tidalDB does not currently support, but fjall does not prevent), they could race on the fingerprint write. This is not a new problem -- concurrent opens without coordination are already unsafe.
|
|
|
|
#### Edge Case 6: Schema fingerprint storage location
|
|
|
|
The fingerprint should be stored at a well-known key in the items keyspace, using a dedicated tag or a sentinel entity ID. Options:
|
|
|
|
- **Option A: Sentinel entity ID 0 with Tag::Meta** -- `[0x00..00][0x00][0x03]["schema_fingerprint"]`
|
|
- Pro: Uses existing key encoding; entity ID 0 is reserved (real entities start at 1+)
|
|
- Con: Occupies the entity ID 0 namespace
|
|
|
|
- **Option B: New Tag::SchemaFingerprint = 0x0D** -- `[0x00..00][0x00][0x0D]`
|
|
- Pro: Clean separation; easy to locate via prefix scan
|
|
- Con: New tag value (minor, well-understood extension)
|
|
|
|
**Recommendation:** Option B. A dedicated tag is cleaner and avoids ambiguity about entity ID 0.
|
|
|
|
### Production Precedent
|
|
|
|
| System | Schema Versioning Approach | Bootstrap Behavior |
|
|
|--------|---------------------------|-------------------|
|
|
| **DuckDB** | Storage format version in file header | Refuses to open if version mismatch; provides `EXPORT DATABASE` migration path |
|
|
| **SQLite** | `user_version` pragma (application-managed) | Application sets version; no built-in schema hash |
|
|
| **RocksDB** | No schema concept (KV store) | N/A |
|
|
| **MongoDB** | `schemaVersion` field in documents | Application-level; "Schema Versioning Pattern" adds version per document |
|
|
| **Flyway/Liquibase** | Migration history table | First run creates history table (bootstrap); subsequent runs compare |
|
|
|
|
The "first run writes, subsequent runs compare" pattern is standard across migration frameworks. The bootstrap-then-validate approach is well-established.
|
|
|
|
### Recommendation
|
|
|
|
**Implement the bootstrap logic as proposed.** It is safe and follows production precedent.
|
|
|
|
Implementation checklist:
|
|
|
|
1. Add `Tag::SchemaFingerprint = 0x0D` to `/tidal/src/storage/keys.rs`
|
|
2. Implement `Schema::fingerprint() -> [u8; 32]` that:
|
|
- Sorts signal names alphabetically
|
|
- For each signal: hashes `(name, decay_type, half_life_nanos, windows_sorted, velocity_enabled)`
|
|
- Uses BLAKE3 or SHA-256 (BLAKE3 preferred for speed; already in the Rust ecosystem)
|
|
3. In `open_with_schema()` (persistent mode only):
|
|
- Read key `[0x00..00][0x00][0x0D]` from items keyspace
|
|
- If absent: write fingerprint, log "schema fingerprint initialized", succeed
|
|
- If present and matches: succeed
|
|
- If present and mismatches: return `TidalError::SchemaMismatch { stored: hex, computed: hex }`
|
|
4. Add `SchemaMismatch` variant to `TidalError`
|
|
5. Skip entirely for `StorageMode::Ephemeral`
|
|
|
|
### Open Questions
|
|
|
|
1. **What fields to include in the fingerprint?** Signal names + decay params are critical because they affect score interpretation. Should embedding slot dimensions and text field definitions also be included? Adding a new text field is backward-compatible, but changing dimensions is not. Recommendation: include signal fields + embedding dimensions. Exclude text fields and policies (additive changes to these are safe).
|
|
|
|
2. **Force-open escape hatch:** Should `TidalDbBuilder` expose `.allow_schema_migration()` from day one? This is useful for development but dangerous in production. Recommendation: add it but log a WARN-level message when used. Do not add it until the first user needs it.
|
|
|
|
3. **Migration tool:** A future `tidalctl schema migrate` command should compare old and new schemas, validate that the change is backward-compatible (additions only, no decay parameter changes), and update the fingerprint. This is post-MVP.
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
### fjall backup: Use quiesce + file copy
|
|
|
|
fjall 3.0.2 has **no backup API** ([issue #52](https://github.com/fjall-rs/fjall/issues/52) open and blocked). The safe procedure is: pause writes, flush all engines (`rotate_memtable_and_wait` + `persist(SyncAll)` + Tantivy flush + WAL checkpoint), copy the entire data directory, resume writes. Estimated write pause: 300ms + file copy time. When fjall ships its backup API, switch to it for hard-link support.
|
|
|
|
### Schema fingerprint: Safe to implement with bootstrap logic
|
|
|
|
The "no fingerprint -> write and succeed" bootstrap is correct and follows production precedent (Flyway, DuckDB, etc.). It cannot detect schema mismatches that predate the feature, but this is inherent -- the feature prevents future mismatches, not past ones. Key implementation details: sort signals before hashing, hash half_life nanos (not lambda), use a dedicated `Tag::SchemaFingerprint`, skip for ephemeral mode.
|
|
|
|
## Sources
|
|
|
|
- [fjall docs.rs -- Database struct](https://docs.rs/fjall/latest/fjall/struct.Database.html)
|
|
- [fjall docs.rs -- Keyspace struct](https://docs.rs/fjall/latest/fjall/struct.Keyspace.html)
|
|
- [fjall docs.rs -- Snapshot struct](https://docs.rs/fjall/latest/fjall/struct.Snapshot.html)
|
|
- [fjall docs.rs -- PersistMode enum](https://docs.rs/fjall/latest/fjall/enum.PersistMode.html)
|
|
- [fjall GitHub issue #52: Backup using Checkpointing](https://github.com/fjall-rs/fjall/issues/52) -- open, blocked
|
|
- [fjall keyspace source: rotate_memtable_and_wait](https://github.com/fjall-rs/fjall/blob/main/src/keyspace/mod.rs) -- public, annotated "NOTE: Used in tests"
|
|
- [fjall 3.0 release blog post](https://fjall-rs.github.io/post/fjall-3/) -- confirms checkpoint is "looking ahead," not shipped
|
|
- [RocksDB Checkpoints wiki](https://github.com/facebook/rocksdb/wiki/Checkpoints) -- hard-link SSTs, copy MANIFEST, consistent cross-CF
|
|
- [RocksDB Checkpoint blog post, 2015](https://rocksdb.org/blog/2015/11/10/use-checkpoints-for-efficient-snapshots.html)
|
|
- [SQLite Online Backup API](https://sqlite.org/backup.html) -- sqlite3_backup_init/step/finish
|
|
- [DuckDB Storage Versions](https://duckdb.org/docs/stable/internals/storage) -- version in file header, refuses mismatched opens
|
|
- [MongoDB Schema Versioning Pattern](https://www.mongodb.com/blog/post/building-with-patterns-the-schema-versioning-pattern)
|
|
- tidalDB source: `/tidal/src/storage/fjall.rs` -- FjallStorage, FjallBackend, flush_all()
|
|
- tidalDB source: `/tidal/src/db/mod.rs` -- TidalDb struct, shutdown_inner(), data surface
|
|
- tidalDB source: `/tidal/src/db/open.rs` -- open_with_schema(), the integration point for fingerprint check
|
|
- tidalDB source: `/tidal/src/db/paths.rs` -- directory layout: wal, items, users, creators, cache
|
|
- tidalDB source: `/tidal/src/schema/validation/mod.rs` -- Schema struct, signals HashMap
|
|
- tidalDB source: `/tidal/src/storage/keys.rs` -- Tag enum, key encoding format
|