tidaldb/docs/ops/recovery.md
2026-02-23 22:41:16 -07:00

12 KiB

Recovery Guide

This document covers error scenarios, their causes, data at risk, and step-by-step recovery procedures for tidalDB.


Error Scenarios

1. StorageError::Corruption on open

Error message: storage error: data corruption: <details>

Cause: fjall data files are corrupted. This can happen due to bit rot, partial writes from a hard power loss (no UPS), or physical disk failure. fjall uses checksums on its SST (sorted string table) files, so corruption is detected rather than silently returning wrong data.

Data at risk: All item metadata and relationships stored in the corrupt keyspace since the last backup.

Recovery steps:

  1. Identify which engine is corrupt. tidalDB uses three fjall keyspaces: items/, users/, creators/. The error message will typically include the path or keyspace name. Check the log output for the specific directory.

  2. If you have a backup:

    • Stop the process.
    • Replace the corrupt engine directory (e.g., {data_dir}/items/) with the corresponding directory from the backup.
    • Reopen tidalDB. The WAL replay (recover()) will restore signal history from the backup's checkpoint timestamp forward. Any signals between the backup and the corruption event are preserved in WAL segments.
    • Verify data integrity by running queries against known entities.
  3. If you have no backup:

    • Stop the process.
    • Delete the corrupt engine directory (e.g., rm -rf {data_dir}/items/).
    • Reopen tidalDB. It will start fresh for that engine -- all metadata and relationships in that keyspace are lost.
    • Signal state in the WAL is independent of fjall and is recoverable. WAL replay will restore signal scores.
    • You will need to re-ingest item metadata and embeddings from your upstream data source.
  4. If corruption is in the items/ keyspace specifically:

    • Signal checkpoints (Tag::Sig) are stored in the items keyspace. Losing this keyspace means the signal ledger falls back to full WAL replay (slower startup but no data loss for signals).
    • Collection definitions (Tag::Collection), cohort definitions (Tag::CohortDef), and co-engagement data are also in items. These will be lost.

2. WalError::Corruption on open

Error message: WAL corruption: <details> (surfaced as TidalError::Durability)

Cause: A WAL segment was partially written when the process crashed (e.g., SIGKILL during a signal write). The BLAKE3 checksum on the partial entry does not match, so the WAL reader detects corruption.

Data at risk: Signals written after the last successful WAL entry in the corrupt segment. In practice, this is at most one signal event (the one that was mid-write when the crash occurred).

Automatic recovery: tidalDB's crash recovery (recover()) automatically truncates the corrupt tail of the WAL and continues. The truncated entry is logged at WARN level. No manual action is needed unless open() continues to fail after the automatic recovery attempt.

Manual recovery (if automatic recovery fails):

  1. List WAL segment files: ls -la {data_dir}/wal/
  2. Identify the segment with the highest sequence number in its filename (e.g., segment_000042.wal).
  3. Delete that single file: rm {data_dir}/wal/segment_000042.wal
  4. Reopen tidalDB. It will replay from the remaining segments up to the last complete entry.
  5. The signals in the deleted segment that were written after the last checkpoint are lost. Signals before the checkpoint are already materialized in fjall and are safe.

Prevention: tidalDB uses fsync on WAL segment rotation and BLAKE3 checksums on every entry. The only scenario where corruption occurs is a hard crash (SIGKILL, power loss) during the write of a single entry. Clean shutdowns (db.close()) always leave the WAL in a consistent state.

3. TidalError::Config(DataDirLocked) on open

Error message: config error: data directory is already open by another process: <path>

Cause: Another process has acquired the advisory lock on {data_dir}/tidaldb.lock. tidalDB uses file locking to prevent two processes from opening the same data directory simultaneously, which would cause data corruption.

Data at risk: None. This error is protective -- no data is accessed or modified.

Recovery:

  1. Find the other process: ps aux | grep <your_binary_name> or lsof {data_dir}/tidaldb.lock
  2. If the other process is a legitimate tidalDB instance, stop it cleanly (send SIGTERM and wait for graceful shutdown).
  3. If the other process crashed and left a stale lock file:
    • Verify no process is actually using the directory: lsof +D {data_dir}
    • Delete the lock file: rm {data_dir}/tidaldb.lock
    • Reopen tidalDB.

Note: The lock file (tidaldb.lock) is an advisory lock. Deleting it while another process is running will NOT prevent corruption -- the lock only works if both processes respect it. Always verify no process is running before deleting.

4. TidalError::Schema(UnknownSignalType) or schema fingerprint mismatch on open

Error message: unknown signal type: '<name>' or schema fingerprint mismatch

Cause: The application's schema definition has changed since the database was created. Signal decay parameters, signal names, or embedding slot dimensions differ from what was used when the data was written.

Data at risk: None. This is a protective error. No data is modified.

Recovery options:

  1. Revert the schema to match the one used when the database was created. This is the safest option if the schema change was unintentional.

  2. Add the new signal type alongside the old one. If you are adding a new signal (e.g., "share"), keep all existing signal definitions unchanged and add the new one. Existing signal data is unaffected. Note: this changes the schema fingerprint, so you may need to use a migration path when fingerprint validation is enforced.

  3. Start fresh. Delete {data_dir} entirely and reopen with the new schema. All existing data is lost. Re-ingest from your upstream data source.

What you must NOT do: Change decay parameters (half_life) on an existing signal type and force the database open. The existing decay scores were computed with the old half_life. Applying a different decay rate to historical scores produces mathematically incorrect results. If you need a different decay rate, define a new signal type with the new parameters and let the old signal data age out naturally.

5. Disk full during operation

Symptoms: Signal writes return TidalError::Durability(...). Metadata writes return TidalError::Storage(StorageError::Io(...)). Checkpoint thread logs errors. tidaldb_checkpoint_failures_total increments.

Data at risk: Signals written after the last successful checkpoint. In-memory state remains correct and readable -- queries continue to work against the hot tier.

Recovery:

  1. Free disk space. Priorities:

    • Delete old WAL segments that predate the last checkpoint. Check {data_dir}/wal/ for segments with low sequence numbers. The checkpoint thread compacts these automatically, but if checkpointing itself failed due to disk pressure, compaction may be stuck.
    • If you have a recent backup, you can safely delete all WAL segments and let the database start from the fjall checkpoint on next open.
    • Clear temporary files, logs, or other non-tidalDB data from the volume.
  2. Once disk space is available, signal writes resume automatically. The WAL writer thread retries on the next signal event. No restart is needed.

  3. Verify recovery: check that tidaldb_checkpoint_failures_total stops incrementing and tidaldb_checkpoint_age_seconds returns to < 60 seconds.

Prevention: Monitor tidaldb_wal_lag_bytes and set alerts at 80% of your disk capacity. The WAL is the fastest-growing component. At 5M signals/day, the WAL grows ~200 MB/day before compaction.

6. TidalError::Backpressure during signal writes

Error message: backpressure: WAL queue full, retry after <N>ms

Cause: The WAL writer thread's channel is full. This means signal writes are arriving faster than the WAL can persist them to disk. This is NOT a data loss event -- the signal was never enqueued, so it can be safely retried.

Recovery: Retry the signal write after the suggested delay (retry_after_ms). If backpressure is sustained:

  • Check disk I/O latency (WAL writes are fsync-bound).
  • Check if another process is competing for disk bandwidth.
  • Consider faster storage (NVMe).

7. TidalError::RateLimited during session signal writes

Error message: rate limited: agent '<id>' at <limit> signals/sec, retry after <N>ms

Cause: An agent has exceeded its configured rate limit for the current session. The signal was NOT written.

Recovery: Back off and retry after retry_after_ms. If rate limiting is too aggressive, adjust the agent's rate limit in the schema's AgentPolicy configuration.


Safe Files to Delete

File/Directory Safe to delete? Notes
tidaldb.lock Yes, if no process is running Advisory lock file. Auto-recreated on next open. Verify with lsof first.
wal/segment_*.wal Only segments with sequence numbers below the last checkpoint Never delete the segment with the highest sequence number. To find the checkpoint sequence, check the last tidaldb-checkpoint log entry.
items/ NO, not without a backup Primary fjall keyspace. Contains item metadata, signal checkpoints, collections, cohort definitions, co-engagement data.
users/ NO, not without a backup Contains user relationship edges (follows, blocks, hides, interaction weights).
creators/ NO, not without a backup Contains creator metadata and embeddings.
text_index/ Yes Tantivy item text index. Rebuilt automatically from item metadata on next open. Rebuild cost is proportional to item count.
creator_text_index/ Yes Tantivy creator text index. Same as above but for creators.
cache/ Yes Temporary cache directory. Safe to delete at any time.

Backup and Restore

tidalDB's underlying storage engine (fjall 3.x) does not yet expose a native backup API (fjall issue #52). Until that ships, the recommended backup procedure is quiesce-and-copy.

Creating a Backup

  1. Stop writes to the database (either shut down the process or use an application-level write pause).
  2. Flush all state:
    • Call db.close() for a clean shutdown, which checkpoints the signal ledger, flushes fjall, and writes a WAL checkpoint marker. OR:
    • If keeping the process running: flush text indexes, then wait for the next checkpoint cycle (30 seconds).
  3. Copy the entire data directory recursively:
    cp -R {data_dir} {backup_dest}
    
    This copies: items/, users/, creators/, wal/, text_index/, creator_text_index/, and tidaldb.lock.
  4. Resume writes (restart the process or unpause).
  5. Do NOT copy while the database is actively writing without a quiesce step. Partial fjall SST files or WAL segments will produce corruption on restore.

Restoring from Backup

  1. Stop the current tidalDB process if running.
  2. Delete or move the current {data_dir}.
  3. Copy the backup into place: cp -R {backup_source} {data_dir}
  4. Remove the stale lock file: rm {data_dir}/tidaldb.lock
  5. Open tidalDB with the same schema. WAL replay will recover any signal events written between the backup's checkpoint and the end of the backup's WAL.

What Survives a Crash Without Backup

Component Survives unclean shutdown? Recovery mechanism
Item metadata Yes Stored in fjall (durable)
Relationships Yes Stored in fjall (durable)
Signal decay scores Yes (up to last checkpoint + WAL tail) Checkpoint + WAL replay
Windowed counts Approximately Checkpoint stores state; WAL replay re-applies events
Active sessions Yes Session open/signal events in WAL are replayed; sessions restored as active
Bitmap/range indexes Rebuilt on open Scanned from fjall metadata
USearch vectors Yes Loaded from saved .idx files
Tantivy text index Yes Opened from on-disk segments
Collections Yes Rebuilt from fjall on open
Suggestions Rebuilt on open Scanned from fjall item metadata