190 lines
12 KiB
Markdown
190 lines
12 KiB
Markdown
# Recovery Guide
|
|
|
|
This document covers error scenarios, their causes, data at risk, and step-by-step recovery procedures for tidalDB.
|
|
|
|
---
|
|
|
|
## Error Scenarios
|
|
|
|
### 1. `StorageError::Corruption` on open
|
|
|
|
**Error message:** `storage error: data corruption: <details>`
|
|
|
|
**Cause:** fjall data files are corrupted. This can happen due to bit rot, partial writes from a hard power loss (no UPS), or physical disk failure. fjall uses checksums on its SST (sorted string table) files, so corruption is detected rather than silently returning wrong data.
|
|
|
|
**Data at risk:** All item metadata and relationships stored in the corrupt keyspace since the last backup.
|
|
|
|
**Recovery steps:**
|
|
|
|
1. **Identify which engine is corrupt.** tidalDB uses three fjall keyspaces: `items/`, `users/`, `creators/`. The error message will typically include the path or keyspace name. Check the log output for the specific directory.
|
|
|
|
2. **If you have a backup:**
|
|
- Stop the process.
|
|
- Replace the corrupt engine directory (e.g., `{data_dir}/items/`) with the corresponding directory from the backup.
|
|
- Reopen tidalDB. The WAL replay (`recover()`) will restore signal history from the backup's checkpoint timestamp forward. Any signals between the backup and the corruption event are preserved in WAL segments.
|
|
- Verify data integrity by running queries against known entities.
|
|
|
|
3. **If you have no backup:**
|
|
- Stop the process.
|
|
- Delete the corrupt engine directory (e.g., `rm -rf {data_dir}/items/`).
|
|
- Reopen tidalDB. It will start fresh for that engine -- all metadata and relationships in that keyspace are lost.
|
|
- Signal state in the WAL is independent of fjall and is recoverable. WAL replay will restore signal scores.
|
|
- You will need to re-ingest item metadata and embeddings from your upstream data source.
|
|
|
|
4. **If corruption is in the `items/` keyspace specifically:**
|
|
- Signal checkpoints (`Tag::Sig`) are stored in the items keyspace. Losing this keyspace means the signal ledger falls back to full WAL replay (slower startup but no data loss for signals).
|
|
- Collection definitions (`Tag::Collection`), cohort definitions (`Tag::CohortDef`), and co-engagement data are also in items. These will be lost.
|
|
|
|
### 2. `WalError::Corruption` on open
|
|
|
|
**Error message:** `WAL corruption: <details>` (surfaced as `TidalError::Durability`)
|
|
|
|
**Cause:** A WAL segment was partially written when the process crashed (e.g., SIGKILL during a signal write). The BLAKE3 checksum on the partial entry does not match, so the WAL reader detects corruption.
|
|
|
|
**Data at risk:** Signals written after the last successful WAL entry in the corrupt segment. In practice, this is at most one signal event (the one that was mid-write when the crash occurred).
|
|
|
|
**Automatic recovery:** tidalDB's crash recovery (`recover()`) automatically truncates the corrupt tail of the WAL and continues. The truncated entry is logged at WARN level. No manual action is needed unless `open()` continues to fail after the automatic recovery attempt.
|
|
|
|
**Manual recovery (if automatic recovery fails):**
|
|
|
|
1. List WAL segment files: `ls -la {data_dir}/wal/`
|
|
2. Identify the segment with the highest sequence number in its filename (e.g., `segment_000042.wal`).
|
|
3. Delete that single file: `rm {data_dir}/wal/segment_000042.wal`
|
|
4. Reopen tidalDB. It will replay from the remaining segments up to the last complete entry.
|
|
5. The signals in the deleted segment that were written after the last checkpoint are lost. Signals before the checkpoint are already materialized in fjall and are safe.
|
|
|
|
**Prevention:** tidalDB uses `fsync` on WAL segment rotation and BLAKE3 checksums on every entry. The only scenario where corruption occurs is a hard crash (SIGKILL, power loss) during the write of a single entry. Clean shutdowns (`db.close()`) always leave the WAL in a consistent state.
|
|
|
|
### 3. `TidalError::Config(DataDirLocked)` on open
|
|
|
|
**Error message:** `config error: data directory is already open by another process: <path>`
|
|
|
|
**Cause:** Another process has acquired the advisory lock on `{data_dir}/tidaldb.lock`. tidalDB uses file locking to prevent two processes from opening the same data directory simultaneously, which would cause data corruption.
|
|
|
|
**Data at risk:** None. This error is protective -- no data is accessed or modified.
|
|
|
|
**Recovery:**
|
|
|
|
1. Find the other process: `ps aux | grep <your_binary_name>` or `lsof {data_dir}/tidaldb.lock`
|
|
2. If the other process is a legitimate tidalDB instance, stop it cleanly (send SIGTERM and wait for graceful shutdown).
|
|
3. If the other process crashed and left a stale lock file:
|
|
- Verify no process is actually using the directory: `lsof +D {data_dir}`
|
|
- Delete the lock file: `rm {data_dir}/tidaldb.lock`
|
|
- Reopen tidalDB.
|
|
|
|
**Note:** The lock file (`tidaldb.lock`) is an advisory lock. Deleting it while another process is running will NOT prevent corruption -- the lock only works if both processes respect it. Always verify no process is running before deleting.
|
|
|
|
### 4. `TidalError::Schema(UnknownSignalType)` or schema fingerprint mismatch on open
|
|
|
|
**Error message:** `unknown signal type: '<name>'` or schema fingerprint mismatch
|
|
|
|
**Cause:** The application's schema definition has changed since the database was created. Signal decay parameters, signal names, or embedding slot dimensions differ from what was used when the data was written.
|
|
|
|
**Data at risk:** None. This is a protective error. No data is modified.
|
|
|
|
**Recovery options:**
|
|
|
|
1. **Revert the schema** to match the one used when the database was created. This is the safest option if the schema change was unintentional.
|
|
|
|
2. **Add the new signal type alongside the old one.** If you are adding a new signal (e.g., `"share"`), keep all existing signal definitions unchanged and add the new one. Existing signal data is unaffected. Note: this changes the schema fingerprint, so you may need to use a migration path when fingerprint validation is enforced.
|
|
|
|
3. **Start fresh.** Delete `{data_dir}` entirely and reopen with the new schema. All existing data is lost. Re-ingest from your upstream data source.
|
|
|
|
**What you must NOT do:** Change decay parameters (half_life) on an existing signal type and force the database open. The existing decay scores were computed with the old half_life. Applying a different decay rate to historical scores produces mathematically incorrect results. If you need a different decay rate, define a new signal type with the new parameters and let the old signal data age out naturally.
|
|
|
|
### 5. Disk full during operation
|
|
|
|
**Symptoms:** Signal writes return `TidalError::Durability(...)`. Metadata writes return `TidalError::Storage(StorageError::Io(...))`. Checkpoint thread logs errors. `tidaldb_checkpoint_failures_total` increments.
|
|
|
|
**Data at risk:** Signals written after the last successful checkpoint. In-memory state remains correct and readable -- queries continue to work against the hot tier.
|
|
|
|
**Recovery:**
|
|
|
|
1. Free disk space. Priorities:
|
|
- Delete old WAL segments that predate the last checkpoint. Check `{data_dir}/wal/` for segments with low sequence numbers. The checkpoint thread compacts these automatically, but if checkpointing itself failed due to disk pressure, compaction may be stuck.
|
|
- If you have a recent backup, you can safely delete all WAL segments and let the database start from the fjall checkpoint on next open.
|
|
- Clear temporary files, logs, or other non-tidalDB data from the volume.
|
|
|
|
2. Once disk space is available, signal writes resume automatically. The WAL writer thread retries on the next signal event. No restart is needed.
|
|
|
|
3. Verify recovery: check that `tidaldb_checkpoint_failures_total` stops incrementing and `tidaldb_checkpoint_age_seconds` returns to < 60 seconds.
|
|
|
|
**Prevention:** Monitor `tidaldb_wal_lag_bytes` and set alerts at 80% of your disk capacity. The WAL is the fastest-growing component. At 5M signals/day, the WAL grows ~200 MB/day before compaction.
|
|
|
|
### 6. `TidalError::Backpressure` during signal writes
|
|
|
|
**Error message:** `backpressure: WAL queue full, retry after <N>ms`
|
|
|
|
**Cause:** The WAL writer thread's channel is full. This means signal writes are arriving faster than the WAL can persist them to disk. This is NOT a data loss event -- the signal was never enqueued, so it can be safely retried.
|
|
|
|
**Recovery:** Retry the signal write after the suggested delay (`retry_after_ms`). If backpressure is sustained:
|
|
- Check disk I/O latency (WAL writes are `fsync`-bound).
|
|
- Check if another process is competing for disk bandwidth.
|
|
- Consider faster storage (NVMe).
|
|
|
|
### 7. `TidalError::RateLimited` during session signal writes
|
|
|
|
**Error message:** `rate limited: agent '<id>' at <limit> signals/sec, retry after <N>ms`
|
|
|
|
**Cause:** An agent has exceeded its configured rate limit for the current session. The signal was NOT written.
|
|
|
|
**Recovery:** Back off and retry after `retry_after_ms`. If rate limiting is too aggressive, adjust the agent's rate limit in the schema's `AgentPolicy` configuration.
|
|
|
|
---
|
|
|
|
## Safe Files to Delete
|
|
|
|
| File/Directory | Safe to delete? | Notes |
|
|
|:---------------|:----------------|:------|
|
|
| `tidaldb.lock` | Yes, if no process is running | Advisory lock file. Auto-recreated on next open. Verify with `lsof` first. |
|
|
| `wal/segment_*.wal` | Only segments with sequence numbers below the last checkpoint | Never delete the segment with the highest sequence number. To find the checkpoint sequence, check the last `tidaldb-checkpoint` log entry. |
|
|
| `items/` | **NO**, not without a backup | Primary fjall keyspace. Contains item metadata, signal checkpoints, collections, cohort definitions, co-engagement data. |
|
|
| `users/` | **NO**, not without a backup | Contains user relationship edges (follows, blocks, hides, interaction weights). |
|
|
| `creators/` | **NO**, not without a backup | Contains creator metadata and embeddings. |
|
|
| `text_index/` | Yes | Tantivy item text index. Rebuilt automatically from item metadata on next open. Rebuild cost is proportional to item count. |
|
|
| `creator_text_index/` | Yes | Tantivy creator text index. Same as above but for creators. |
|
|
| `cache/` | Yes | Temporary cache directory. Safe to delete at any time. |
|
|
|
|
---
|
|
|
|
## Backup and Restore
|
|
|
|
tidalDB's underlying storage engine (fjall 3.x) does not yet expose a native backup API ([fjall issue #52](https://github.com/fjall-rs/fjall/issues/52)). Until that ships, the recommended backup procedure is quiesce-and-copy.
|
|
|
|
### Creating a Backup
|
|
|
|
1. **Stop writes** to the database (either shut down the process or use an application-level write pause).
|
|
2. **Flush all state:**
|
|
- Call `db.close()` for a clean shutdown, which checkpoints the signal ledger, flushes fjall, and writes a WAL checkpoint marker. OR:
|
|
- If keeping the process running: flush text indexes, then wait for the next checkpoint cycle (30 seconds).
|
|
3. **Copy the entire data directory** recursively:
|
|
```
|
|
cp -R {data_dir} {backup_dest}
|
|
```
|
|
This copies: `items/`, `users/`, `creators/`, `wal/`, `text_index/`, `creator_text_index/`, and `tidaldb.lock`.
|
|
4. **Resume writes** (restart the process or unpause).
|
|
5. **Do NOT copy while the database is actively writing** without a quiesce step. Partial fjall SST files or WAL segments will produce corruption on restore.
|
|
|
|
### Restoring from Backup
|
|
|
|
1. Stop the current tidalDB process if running.
|
|
2. Delete or move the current `{data_dir}`.
|
|
3. Copy the backup into place: `cp -R {backup_source} {data_dir}`
|
|
4. Remove the stale lock file: `rm {data_dir}/tidaldb.lock`
|
|
5. Open tidalDB with the same schema. WAL replay will recover any signal events written between the backup's checkpoint and the end of the backup's WAL.
|
|
|
|
### What Survives a Crash Without Backup
|
|
|
|
| Component | Survives unclean shutdown? | Recovery mechanism |
|
|
|:----------|:--------------------------|:-------------------|
|
|
| Item metadata | Yes | Stored in fjall (durable) |
|
|
| Relationships | Yes | Stored in fjall (durable) |
|
|
| Signal decay scores | Yes (up to last checkpoint + WAL tail) | Checkpoint + WAL replay |
|
|
| Windowed counts | Approximately | Checkpoint stores state; WAL replay re-applies events |
|
|
| Active sessions | Yes | Session open/signal events in WAL are replayed; sessions restored as active |
|
|
| Bitmap/range indexes | Rebuilt on open | Scanned from fjall metadata |
|
|
| USearch vectors | Yes | Loaded from saved `.idx` files |
|
|
| Tantivy text index | Yes | Opened from on-disk segments |
|
|
| Collections | Yes | Rebuilt from fjall on open |
|
|
| Suggestions | Rebuilt on open | Scanned from fjall item metadata |
|