jordan 213b8efcca feat: complete M6-M7 + Enterprise Readiness milestones; split oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-23 22:41:16 -07:00

12 KiB

Raw Blame History

Recovery Guide

This document covers error scenarios, their causes, data at risk, and step-by-step recovery procedures for tidalDB.

Error Scenarios

1. `StorageError::Corruption` on open

Error message: storage error: data corruption: <details>

Cause: fjall data files are corrupted. This can happen due to bit rot, partial writes from a hard power loss (no UPS), or physical disk failure. fjall uses checksums on its SST (sorted string table) files, so corruption is detected rather than silently returning wrong data.

Data at risk: All item metadata and relationships stored in the corrupt keyspace since the last backup.

Recovery steps:

Identify which engine is corrupt. tidalDB uses three fjall keyspaces: items/, users/, creators/. The error message will typically include the path or keyspace name. Check the log output for the specific directory.
If you have a backup:
- Stop the process.
- Replace the corrupt engine directory (e.g., {data_dir}/items/) with the corresponding directory from the backup.
- Reopen tidalDB. The WAL replay (recover()) will restore signal history from the backup's checkpoint timestamp forward. Any signals between the backup and the corruption event are preserved in WAL segments.
- Verify data integrity by running queries against known entities.
If you have no backup:
- Stop the process.
- Delete the corrupt engine directory (e.g., rm -rf {data_dir}/items/).
- Reopen tidalDB. It will start fresh for that engine -- all metadata and relationships in that keyspace are lost.
- Signal state in the WAL is independent of fjall and is recoverable. WAL replay will restore signal scores.
- You will need to re-ingest item metadata and embeddings from your upstream data source.
If corruption is in the items/ keyspace specifically:
- Signal checkpoints (Tag::Sig) are stored in the items keyspace. Losing this keyspace means the signal ledger falls back to full WAL replay (slower startup but no data loss for signals).
- Collection definitions (Tag::Collection), cohort definitions (Tag::CohortDef), and co-engagement data are also in items. These will be lost.

2. `WalError::Corruption` on open

Error message: WAL corruption: <details> (surfaced as TidalError::Durability)

Cause: A WAL segment was partially written when the process crashed (e.g., SIGKILL during a signal write). The BLAKE3 checksum on the partial entry does not match, so the WAL reader detects corruption.

Data at risk: Signals written after the last successful WAL entry in the corrupt segment. In practice, this is at most one signal event (the one that was mid-write when the crash occurred).

Automatic recovery: tidalDB's crash recovery (recover()) automatically truncates the corrupt tail of the WAL and continues. The truncated entry is logged at WARN level. No manual action is needed unless open() continues to fail after the automatic recovery attempt.

Manual recovery (if automatic recovery fails):

List WAL segment files: ls -la {data_dir}/wal/
Identify the segment with the highest sequence number in its filename (e.g., segment_000042.wal).
Delete that single file: rm {data_dir}/wal/segment_000042.wal
Reopen tidalDB. It will replay from the remaining segments up to the last complete entry.
The signals in the deleted segment that were written after the last checkpoint are lost. Signals before the checkpoint are already materialized in fjall and are safe.

Prevention: tidalDB uses fsync on WAL segment rotation and BLAKE3 checksums on every entry. The only scenario where corruption occurs is a hard crash (SIGKILL, power loss) during the write of a single entry. Clean shutdowns (db.close()) always leave the WAL in a consistent state.

3. `TidalError::Config(DataDirLocked)` on open

Error message: config error: data directory is already open by another process: <path>

Cause: Another process has acquired the advisory lock on {data_dir}/tidaldb.lock. tidalDB uses file locking to prevent two processes from opening the same data directory simultaneously, which would cause data corruption.

Data at risk: None. This error is protective -- no data is accessed or modified.

Recovery:

Find the other process: ps aux | grep <your_binary_name> or lsof {data_dir}/tidaldb.lock
If the other process is a legitimate tidalDB instance, stop it cleanly (send SIGTERM and wait for graceful shutdown).
If the other process crashed and left a stale lock file:
- Verify no process is actually using the directory: lsof +D {data_dir}
- Delete the lock file: rm {data_dir}/tidaldb.lock
- Reopen tidalDB.

Note: The lock file (tidaldb.lock) is an advisory lock. Deleting it while another process is running will NOT prevent corruption -- the lock only works if both processes respect it. Always verify no process is running before deleting.

4. `TidalError::Schema(UnknownSignalType)` or schema fingerprint mismatch on open

Error message: unknown signal type: '<name>' or schema fingerprint mismatch

Cause: The application's schema definition has changed since the database was created. Signal decay parameters, signal names, or embedding slot dimensions differ from what was used when the data was written.

Data at risk: None. This is a protective error. No data is modified.

Recovery options:

Revert the schema to match the one used when the database was created. This is the safest option if the schema change was unintentional.
Add the new signal type alongside the old one. If you are adding a new signal (e.g., "share"), keep all existing signal definitions unchanged and add the new one. Existing signal data is unaffected. Note: this changes the schema fingerprint, so you may need to use a migration path when fingerprint validation is enforced.
Start fresh. Delete {data_dir} entirely and reopen with the new schema. All existing data is lost. Re-ingest from your upstream data source.

What you must NOT do: Change decay parameters (half_life) on an existing signal type and force the database open. The existing decay scores were computed with the old half_life. Applying a different decay rate to historical scores produces mathematically incorrect results. If you need a different decay rate, define a new signal type with the new parameters and let the old signal data age out naturally.

5. Disk full during operation

Symptoms: Signal writes return TidalError::Durability(...). Metadata writes return TidalError::Storage(StorageError::Io(...)). Checkpoint thread logs errors. tidaldb_checkpoint_failures_total increments.

Data at risk: Signals written after the last successful checkpoint. In-memory state remains correct and readable -- queries continue to work against the hot tier.

Recovery:

Free disk space. Priorities:
- Delete old WAL segments that predate the last checkpoint. Check {data_dir}/wal/ for segments with low sequence numbers. The checkpoint thread compacts these automatically, but if checkpointing itself failed due to disk pressure, compaction may be stuck.
- If you have a recent backup, you can safely delete all WAL segments and let the database start from the fjall checkpoint on next open.
- Clear temporary files, logs, or other non-tidalDB data from the volume.
Once disk space is available, signal writes resume automatically. The WAL writer thread retries on the next signal event. No restart is needed.
Verify recovery: check that tidaldb_checkpoint_failures_total stops incrementing and tidaldb_checkpoint_age_seconds returns to < 60 seconds.

Prevention: Monitor tidaldb_wal_lag_bytes and set alerts at 80% of your disk capacity. The WAL is the fastest-growing component. At 5M signals/day, the WAL grows ~200 MB/day before compaction.

6. `TidalError::Backpressure` during signal writes

Error message: backpressure: WAL queue full, retry after <N>ms

Cause: The WAL writer thread's channel is full. This means signal writes are arriving faster than the WAL can persist them to disk. This is NOT a data loss event -- the signal was never enqueued, so it can be safely retried.

Recovery: Retry the signal write after the suggested delay (retry_after_ms). If backpressure is sustained:

Check disk I/O latency (WAL writes are fsync-bound).
Check if another process is competing for disk bandwidth.
Consider faster storage (NVMe).

7. `TidalError::RateLimited` during session signal writes

Error message: rate limited: agent '<id>' at <limit> signals/sec, retry after <N>ms

Cause: An agent has exceeded its configured rate limit for the current session. The signal was NOT written.

Recovery: Back off and retry after retry_after_ms. If rate limiting is too aggressive, adjust the agent's rate limit in the schema's AgentPolicy configuration.

Safe Files to Delete

File/Directory	Safe to delete?	Notes
`tidaldb.lock`	Yes, if no process is running	Advisory lock file. Auto-recreated on next open. Verify with `lsof` first.
`wal/segment_*.wal`	Only segments with sequence numbers below the last checkpoint	Never delete the segment with the highest sequence number. To find the checkpoint sequence, check the last `tidaldb-checkpoint` log entry.
`items/`	NO, not without a backup	Primary fjall keyspace. Contains item metadata, signal checkpoints, collections, cohort definitions, co-engagement data.
`users/`	NO, not without a backup	Contains user relationship edges (follows, blocks, hides, interaction weights).
`creators/`	NO, not without a backup	Contains creator metadata and embeddings.
`text_index/`	Yes	Tantivy item text index. Rebuilt automatically from item metadata on next open. Rebuild cost is proportional to item count.
`creator_text_index/`	Yes	Tantivy creator text index. Same as above but for creators.
`cache/`	Yes	Temporary cache directory. Safe to delete at any time.

Backup and Restore

tidalDB's underlying storage engine (fjall 3.x) does not yet expose a native backup API (fjall issue #52). Until that ships, the recommended backup procedure is quiesce-and-copy.

Creating a Backup

Stop writes to the database (either shut down the process or use an application-level write pause).
Flush all state:
- Call db.close() for a clean shutdown, which checkpoints the signal ledger, flushes fjall, and writes a WAL checkpoint marker. OR:
- If keeping the process running: flush text indexes, then wait for the next checkpoint cycle (30 seconds).
Copy the entire data directory recursively:
```
cp -R {data_dir} {backup_dest}
```
This copies: items/, users/, creators/, wal/, text_index/, creator_text_index/, and tidaldb.lock.
Resume writes (restart the process or unpause).
Do NOT copy while the database is actively writing without a quiesce step. Partial fjall SST files or WAL segments will produce corruption on restore.

Restoring from Backup

Stop the current tidalDB process if running.
Delete or move the current {data_dir}.
Copy the backup into place: cp -R {backup_source} {data_dir}
Remove the stale lock file: rm {data_dir}/tidaldb.lock
Open tidalDB with the same schema. WAL replay will recover any signal events written between the backup's checkpoint and the end of the backup's WAL.

What Survives a Crash Without Backup

Component	Survives unclean shutdown?	Recovery mechanism
Item metadata	Yes	Stored in fjall (durable)
Relationships	Yes	Stored in fjall (durable)
Signal decay scores	Yes (up to last checkpoint + WAL tail)	Checkpoint + WAL replay
Windowed counts	Approximately	Checkpoint stores state; WAL replay re-applies events
Active sessions	Yes	Session open/signal events in WAL are replayed; sessions restored as active
Bitmap/range indexes	Rebuilt on open	Scanned from fjall metadata
USearch vectors	Yes	Loaded from saved `.idx` files
Tantivy text index	Yes	Opened from on-disk segments
Collections	Yes	Rebuilt from fjall on open
Suggestions	Rebuilt on open	Scanned from fjall item metadata

12 KiB Raw Blame History