tidaldb/docs/planning/milestone-7/phase-4/task-05-tidalctl-diagnostics.md
2026-02-23 22:41:16 -07:00

190 lines
6.3 KiB
Markdown

# Task 05: `tidalctl diagnostics` Command
## Delivers
A `diagnostics` subcommand for `tidalctl` that reads the database's metrics state and persistent storage to print a human-readable health summary. Operators use this to triage production issues without attaching a debugger or parsing Prometheus output.
## Complexity: M
## Dependencies
- task-02 complete (signal + WAL metrics must be wired)
- task-03 complete (index health metrics must be wired)
- task-04 complete (session + cohort + degradation metrics must be wired)
- Existing `tidalctl` binary with `status` and `paths` subcommands (m0p2)
- `tidal/src/db/metrics.rs` -- `MetricsState` with all m7p4 metrics
## Technical Design
### 1. Add `diagnostics` subcommand to tidalctl
In the `tidalctl` binary (manual arg parsing), add a new match arm:
```rust
"diagnostics" => {
let path = parse_path_flag(&args)?;
run_diagnostics(&path, pretty)?;
}
```
### 2. Diagnostics data collection
The diagnostics command opens the database in read-only inspection mode. It does NOT start a full `TidalDb` instance. Instead, it reads:
1. **Config**: from `{data_dir}/config.json` (existing `tidalctl status` path)
2. **WAL state**: scan `{wal_dir}/` for segment files, compute total size and count
3. **Checkpoint age**: read `{wal_dir}/checkpoint` file, parse `CheckpointMeta`, compute age from `checkpoint_time_ns`
4. **Signal ledger size**: read the checkpoint file size (approximate; each entity-signal entry is ~983 bytes from m1p4 format)
5. **Tantivy index**: if `{data_dir}/text_index/` exists, open read-only, count segments and docs
6. **USearch index**: if `{data_dir}/vectors/` exists, report directory size
7. **Session count**: count entries in session journal (`{wal_dir}/session_journal.bin`)
8. **Collection count**: scan `{data_dir}/items/` for `Tag::Collection` keys
9. **Cohort count**: scan `{data_dir}/items/` for cohort-related keys
For items 5-9, if the directory or file does not exist, report "not available" rather than erroring.
### 3. Diagnostics output format
```
tidalDB Diagnostics
===================
Version: 0.7.0 (build: abc123)
Data dir: /var/lib/tidaldb/data
Storage mode: durable
WAL
---
Segments: 12
Total size: 48.3 MB
Lag (uncompacted): 12.1 MB
Checkpoint
----------
Last checkpoint: 2026-02-23 14:30:12 UTC (47s ago)
WAL sequence: 148293
Signal Ledger
-------------
Estimated entries: ~152,000
Text Index (Tantivy)
--------------------
Segments: 4
Indexed docs: 98,412
Vector Index (USearch)
---------------------
Directory size: 256.7 MB
Sessions
--------
Active: 3
Closed (total): 1,247
Auto-closed: 12
Degradation
-----------
Level: 0 (healthy)
Collections: 8
Cohorts: 3
```
When `--pretty` is NOT set, output machine-readable JSON:
```json
{
"version": "0.7.0",
"build_hash": "abc123",
"wal_segments": 12,
"wal_total_bytes": 50659328,
"wal_lag_bytes": 12689408,
"checkpoint_age_seconds": 47,
"checkpoint_wal_sequence": 148293,
"signal_estimated_entries": 152000,
"tantivy_segments": 4,
"tantivy_indexed_docs": 98412,
"usearch_directory_bytes": 269156352,
"sessions_active": 3,
"sessions_closed_total": 1247,
"sessions_auto_closed_total": 12,
"degradation_level": 0,
"collection_count": 8,
"cohort_count": 3
}
```
### 4. Exit codes
| Code | Meaning |
|---|---|
| 0 | Diagnostics completed successfully |
| 1 | Data directory does not exist or is not readable |
| 2 | WAL directory missing or corrupt (partial output still printed) |
### 5. No TidalDb instance required
The diagnostics command reads files directly. It does NOT call `TidalDb::builder().open()`. This means it can run against a database that is currently open by another process (read-only file access) or against a database that failed to start (helping debug startup failures).
The one exception: if a running `TidalDb` has the metrics HTTP server enabled, `tidalctl diagnostics` could alternatively fetch `/metrics` and format the output. Implement the file-based approach as the primary path; the HTTP-based approach is a future enhancement.
## Acceptance Criteria
- [ ] `tidalctl diagnostics --path <dir>` prints human-readable health summary
- [ ] `tidalctl diagnostics --path <dir>` (without `--pretty`) prints machine-readable JSON
- [ ] Output includes: WAL segment count, WAL total size, WAL lag, checkpoint age, checkpoint sequence, estimated signal entries, Tantivy segment count, Tantivy indexed docs, USearch directory size, active sessions, closed sessions, auto-closed sessions, degradation level, collection count, cohort count
- [ ] Missing subsystems (no text index, no vectors) show "not available" rather than error
- [ ] Works against a database currently open by another process (read-only access)
- [ ] Exit code 0 on success, 1 on missing data dir, 2 on WAL issues
- [ ] `cargo clippy -D warnings` and `cargo fmt --check` pass
## Test Strategy
```rust
// CLI integration test (runs the binary as a subprocess)
#[test]
fn diagnostics_json_output_valid() {
let db = make_test_db_with_items(10);
let data_dir = db.paths().data_dir().to_path_buf();
db.close().unwrap();
let output = Command::new(tidalctl_binary_path())
.args(["diagnostics", "--path", data_dir.to_str().unwrap()])
.output()
.unwrap();
assert!(output.status.success());
let json: serde_json::Value = serde_json::from_slice(&output.stdout).unwrap();
assert!(json["version"].is_string());
assert!(json["wal_segments"].is_number());
assert!(json["checkpoint_age_seconds"].is_number());
}
#[test]
fn diagnostics_pretty_output_readable() {
let db = make_test_db_with_items(10);
let data_dir = db.paths().data_dir().to_path_buf();
db.close().unwrap();
let output = Command::new(tidalctl_binary_path())
.args(["diagnostics", "--path", data_dir.to_str().unwrap(), "--pretty"])
.output()
.unwrap();
assert!(output.status.success());
let stdout = String::from_utf8_lossy(&output.stdout);
assert!(stdout.contains("tidalDB Diagnostics"));
assert!(stdout.contains("WAL"));
assert!(stdout.contains("Checkpoint"));
}
#[test]
fn diagnostics_missing_dir_exits_1() {
let output = Command::new(tidalctl_binary_path())
.args(["diagnostics", "--path", "/nonexistent/path"])
.output()
.unwrap();
assert_eq!(output.status.code(), Some(1));
}
```