tidaldb/docs/research/tidaldb_tooling_and_diagnostics.md
jordan 4f076c927d feat: M0p1 runtime skeleton, M0p2 tooling & diagnostics, m1p4 signal ledger
## M0p1 — Embeddable Runtime Skeleton (329 tests)
- TidalDb with builder(), health_check(), close(), and Drop-based cleanup
- TidalDbBuilder fluent API: ephemeral(), with_data_dir(), wal_dir(), cache_dir()
- Config, StorageMode, ConfigError types; Config(ConfigError) variant on LumenError
- Paths: single source of truth for directory layout (wal, items, users, creators, cache)
- TempTidalHome: test isolation helper gated behind #[cfg(test)] / test-utils feature
- 8 integration tests: tests/sandboxed_storage.rs

## M0p2 — Tooling & Diagnostics (349 tests)
- Workspace root Cargo.toml (members: ["tidal", "tidalctl"])
- tidal/build.rs: BUILD_HASH from GIT_HASH with option_env!() fallback to "dev"
- MetricsState: always-compiled Arc-shared atomics (uptime, health_ok)
- MetricsHandle (metrics feature): hand-rolled TcpListener HTTP, zero new deps
  - GET /healthz → {"status":"ok","uptime_secs":N}
  - GET /metrics → Prometheus text (tidaldb_uptime_seconds, health_ok, info)
- TidalDbBuilder.enable_metrics(addr) starts background metrics thread
- tidalctl binary: status + paths commands, manual std::env::args() parsing
- 7 metrics integration tests, 9 tidalctl CLI tests

## m1p4 Signal Ledger (in-progress)
- SignalLedger: DashMap<(EntityId, SignalTypeId), EntitySignalEntry>, WAL-first writes
- HotSignalState: #[repr(C, align(64))], lock-free CAS decay, out-of-order handling
- BucketedCounter: 60 per-minute + 168 per-hour circular buffers, trigger-based rotation
- CheckpointMeta + serialize/restore: 983-byte fixed records, atomic WriteBatch
- Property tests: running score matches analytical to 1e-6, decay monotonic, non-negative
- Proptest regression: signals/warm.txt

## Documentation and planning
- ROADMAP: m0p1 COMPLETE (329), m0p2 COMPLETE (349), product track milestones
- PRODUCT_ROADMAP: P0-P4 product milestone track (personal briefing beachhead)
- Milestone planning docs: milestone-0 (phases 1-3), milestone-p (phases 1-5)
- docs/research/tidaldb_tooling_and_diagnostics.md
- ARCHITECTURE.md, CLAUDE.md, VISION.md updates

## Site
- Blog: every-platform-builds-the-same-6-systems.mdx (new)
- Blog: why-tidaldb.mdx (updated)
- next.config.ts, layout.tsx, blog/page.tsx updates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 20:32:00 -07:00

37 KiB

Research: CLI Framework and Embedded HTTP for m0p2 Tooling & Diagnostics

Question

What is the minimum-viable set of dependencies and design patterns for:

  1. A tidalctl CLI binary (2 subcommands, 1 required arg, 1 optional flag, JSON output)
  2. An optional embedded HTTP endpoint (/healthz JSON, /metrics Prometheus text format)
  3. Prometheus text format output for 5-10 counters/gauges
  4. Config serialization for CLI-to-library communication

TidalDB Context

tidalDB is an embeddable, single-node-first Rust database. The dependency philosophy from CODING_GUIDELINES.md is explicit: "Every dependency must justify its existence against 'could we write this in 200 lines?'" The library crate has #![forbid(unsafe_code)] at crate level. MSRV is 1.91 (Rust 2024 edition).

m0p2 scope is narrow:

  • tidalctl status --path <dir> and tidalctl paths --path <dir> -- two subcommands, one required flag (--path), one optional flag (--pretty), JSON output
  • /healthz returning JSON health status
  • /metrics returning Prometheus text format with ~5-10 metrics (uptime, WAL sequence, queue depth, build hash)
  • The HTTP endpoint is feature-gated (metrics feature), disabled by default
  • Expected concurrent connections to the metrics endpoint: <10 (dev/ops tooling only)

Existing dependency context (from Cargo.lock): criterion (dev-dependency) already pulls in clap 4.5.60, serde 1.0.228, serde_json 1.0.149, and serde_derive 1.0.228. These are compiled in every cargo test and cargo bench invocation today. serde/serde_json are also listed as approved dependencies in CODING_GUIDELINES.md (line 296).


Question 1: CLI Argument Parsing for tidalctl

Approaches Surveyed

Approach 1: clap 4.x (derive API)

How it works: Declarative derive macros on structs generate a full argument parser with help text, error messages, completions, and subcommand routing. The derive API maps directly from struct fields to CLI flags.

Used by: TiKV (tikv-ctl), Meilisearch, SurrealDB, Vector, Nushell, ripgrep, bat, fd. The dominant choice in the Rust CLI ecosystem. Criterion (already a tidalDB dev-dep) uses clap 4 internally.

Evidence:

  • argparse-rosetta-rs benchmarks (2024): 3s full debug build, 392ms incremental. 654 KiB release binary overhead (full features) or 427 KiB (minimal features).
  • MSRV: 1.74. Compatible with tidalDB's 1.91.
  • Rain's Rust CLI Recommendations: "use clap unless you have a really simple application."

Strengths:

  • Auto-generated --help with subcommand tree, argument descriptions, and defaults.
  • Compile-time validation of argument structure via derive macros.
  • Shell completions via clap_complete.
  • Already in Cargo.lock via criterion -- zero additional compile-time cost in dev builds.

Weaknesses:

  • 654 KiB binary overhead (full) / 427 KiB (minimal) added to the tidalctl release binary.
  • Proc-macro dependency chain (syn, quote, proc-macro2) -- though these are already compiled for criterion.
  • Overkill for 2 subcommands.

Approach 2: argh 0.1.13 (Google's derive parser)

How it works: Derive-based parser optimized for code size, designed for Google Fuchsia's CLI conventions. Similar derive API to clap but with a smaller binary footprint.

Used by: Google Fuchsia tooling. Limited adoption outside Google's ecosystem.

Evidence:

  • argparse-rosetta-rs benchmarks: 3s full debug build (same as clap due to proc-macro overhead), 203ms incremental. 38 KiB binary overhead.
  • MSRV: not explicitly declared. Uses 2018 edition. Last release ~12 months ago.
  • License: BSD-3-Clause. "This is not an officially supported Google product."

Strengths:

  • Much smaller binary overhead than clap (38 KiB vs 427-654 KiB).
  • Derive-based API similar to clap.

Weaknesses:

  • Not in Cargo.lock -- adds a new dependency tree.
  • Fuchsia-specific conventions (not standard Unix --flag=value in all cases).
  • Lower community adoption; maintenance uncertain (not officially supported by Google).
  • No shell completions.
  • 3s initial compile (proc-macro overhead same as clap).

Approach 3: pico-args 0.5.0

How it works: Manual argument extraction via method calls. No derive, no proc-macros, no help generation. Parse arguments by calling opt_value_from_str("--path"), contains("--pretty"), and subcommand().

Used by: RazrFalcon's suite of tools (resvg, usvg, svgcleaner). Popular in the "small tool" Rust ecosystem. 11M+ total downloads on crates.io.

Evidence:

  • argparse-rosetta-rs benchmarks: 384ms full debug build, 185ms incremental. 23 KiB binary overhead.
  • Zero dependencies. Zero proc-macros. 666 lines of code.
  • MSRV: 1.32. Compatible with any Rust version.
  • License: MIT.
  • No unsafe code (#![forbid(unsafe_code)]).

Strengths:

  • Negligible compile-time and binary size impact.
  • Zero dependencies -- no transitive risk.
  • API is simple enough for 2 subcommands.
  • Matches tidalDB's dependency philosophy perfectly.

Weaknesses:

  • No auto-generated --help. Must be hand-written (10-15 lines for this CLI).
  • No derive -- argument parsing is imperative code.
  • Subcommand routing is manual string matching.
  • Error messages are less polished than clap.

Approach 4: lexopt 0.3.1

How it works: Low-level lexer that yields tokens (Short, Long, Value). The application matches on tokens in a loop. One file, zero dependencies, zero macros.

Used by: cargo (as clap_lex which is derived from lexopt's design), uutils.

Evidence:

  • argparse-rosetta-rs benchmarks: 385ms full debug build, 184ms incremental. 34 KiB binary overhead.
  • Zero dependencies. MSRV 1.31. License: MIT/Apache-2.0.

Strengths:

  • Handles OsString correctly (important for path arguments).
  • Slightly more structured than raw std::env::args().

Weaknesses:

  • More boilerplate than pico-args for the same result.
  • No subcommand abstraction -- everything is a token loop.
  • Slightly larger binary overhead than pico-args for less ergonomic API.

Approach 5: Manual (std::env::args())

How it works: Read std::env::args() into a Vec<String>, match on the first positional argument for the subcommand, iterate remaining args for flags.

Used by: Many internal tools. SQLite's CLI is hand-rolled in C (not using getopt). DuckDB's CLI is based on SQLite's hand-rolled parser.

Evidence:

  • Zero dependencies, zero binary overhead, zero compile time addition.
  • For 2 subcommands + 2 flags, this is approximately 50-80 lines of Rust.

Strengths:

  • Absolute minimum footprint.
  • No dependency to maintain, audit, or version-pin.
  • Complete control over error messages.

Weaknesses:

  • Must handle edge cases manually: --path=<dir> vs --path <dir>, -- separator, unknown flags.
  • No help generation.
  • More code to maintain than pico-args for equivalent behavior.
  • Easy to introduce subtle parsing bugs (e.g., --path at end of args without value).

Comparison

Criterion clap 4.x argh 0.1.13 pico-args 0.5.0 lexopt 0.3.1 Manual
Full debug build 3s 3s 384ms 385ms 0ms
Incremental build 392ms 203ms 185ms 184ms 0ms
Binary overhead (release) 427-654 KiB 38 KiB 23 KiB 34 KiB 0 KiB
Dependencies ~10 transitive ~3 (proc-macro) 0 0 0
Auto --help Yes Yes No No No
Subcommand support Native Native Manual matching Manual matching Manual matching
Proc-macros Yes (derive) Yes (derive) No No No
#![forbid(unsafe_code)] No (clap uses unsafe) Unknown Yes Yes Yes
MSRV 1.74 ~1.56 (2018 ed.) 1.32 1.31 N/A
Already in Cargo.lock Yes (via criterion) No No No N/A
License MIT/Apache-2.0 BSD-3-Clause MIT MIT/Apache-2.0 N/A
Lines of code (user-side) ~25 (derive struct) ~25 (derive struct) ~40 (imperative) ~50 (token loop) ~60-80

Recommendation: Manual std::env::args() for tidalctl

The case is clear when you look at the actual scope. tidalctl has 2 subcommands, 1 required flag, and 1 optional flag. This is a 60-line match statement, not a parser configuration problem.

The key arguments:

  1. The CODING_GUIDELINES.md test: "Could we write this in 200 lines?" -- Yes, in about 60 lines, including help text and error messages. No dependency passes this bar for this scope.

  2. tidalctl is a separate binary crate, not the library. It will have its own Cargo.toml. Even though clap is in the workspace Cargo.lock via criterion, tidalctl's release build would need to compile clap into the binary, adding 427+ KiB. The CLI binary should be small -- the status command reads a config file and prints JSON; it should not be a 1+ MiB binary.

  3. The "escape hatch" argument favors manual. If tidalctl grows to 5+ subcommands (e.g., tidalctl compact, tidalctl backup, tidalctl schema), switching from manual to pico-args or clap is a straightforward refactor. The reverse migration (clap to manual) is harder because derive macros become load-bearing.

  4. Production precedent: SQLite and DuckDB both use hand-rolled CLI parsers. For embedded database tooling with few commands, this is the norm, not the exception.

If the team prefers a library: pico-args 0.5.0 is the right choice. Zero dependencies, 23 KiB overhead, #![forbid(unsafe_code)], and the API is natural for this use case. Pin to pico-args = "0.5".

Do not use clap for tidalctl at this scope. It is the right tool for a CLI with 10+ subcommands and complex argument validation. It is overkill for 2 subcommands and would add 427 KiB to a binary that should be 100-200 KiB total.


Question 2: Sync Embedded HTTP for Metrics Endpoint

Design Tension

The m0p2 task document says: "Endpoint can run on the same Tokio runtime as host service (returns Future implementor)." But the research question notes: "Needs to work without Tokio as a hard dependency." These are in tension.

Resolution: The metrics endpoint should be designed as a synchronous server running on a background std::thread. When a host application has Tokio, it can tokio::task::spawn_blocking to move the sync server onto its runtime. The API should return std::thread::JoinHandle<()>, not a Future. This is simpler, avoids a Tokio dependency, and is compatible with both async and sync host applications.

A future metrics-tokio feature flag could add a Future-returning wrapper, but m0p2 does not need it.

Approaches Surveyed

Approach 1: tiny_http 0.12.0

How it works: Synchronous HTTP server using std::net::TcpListener internally with a thread pool. Handles HTTP/1.1 parsing, keep-alive, chunked transfer, content encoding. You call server.recv() in a loop and respond synchronously.

Used by: devserver, nickel (legacy), numerous internal tools. 1.1K GitHub stars, 395 downstream crates.

Evidence:

  • Version 0.12.0, released October 2022. Edition 2018. MSRV 1.57.
  • Core dependencies: ascii, chunked_transfer, httpdate -- minimal tree (~5 crates without TLS).
  • Size: 120 KB crate, ~2.5K source lines.
  • License: MIT/Apache-2.0.
  • No TLS needed for localhost metrics (disable all ssl-* features).
  • Uses some unsafe internally (HTTP parsing optimizations).

Strengths:

  • Fully synchronous -- no Tokio dependency.
  • Handles HTTP edge cases (keep-alive, chunked, pipelining) correctly.
  • Mature, battle-tested for low-traffic use cases.
  • Simple API: server.recv() -> Request -> request.respond(Response).

Weaknesses:

  • Last release October 2022 -- 3+ years old. Active maintenance is uncertain.
  • Internal thread pool adds complexity tidalDB does not need for 2 endpoints.
  • Pulls in ascii and chunked_transfer crates -- small but nonzero dependency surface.
  • Uses unsafe internally, which cannot be audited as easily as a hand-rolled solution.
  • MSRV 1.57 is fine, but edition 2018 is dated.

Approach 2: rouille 0.6.2

How it works: Macro-based synchronous web framework built on top of tiny_http. Adds routing macros, form parsing, and session handling.

Used by: Small Rust web projects. 1.1K GitHub stars.

Evidence:

  • Built on tiny_http -- inherits its HTTP handling.
  • Adds significant API surface (routing macros, sessions, forms) that tidalDB does not need.
  • Last commit activity has slowed.
  • License: MIT/Apache-2.0.

Strengths:

  • Routing macros reduce boilerplate for multi-endpoint servers.

Weaknesses:

  • Wrapper around tiny_http -- adds dependency on top of dependency.
  • Routing macros are unnecessary for 2 endpoints.
  • Maintenance status unclear.
  • Fails the "200 lines" test -- we are adding a framework when we need 2 if branches.

Approach 3: Hand-rolled (std::net::TcpListener)

How it works: Bind a TcpListener, accept connections in a loop on a background thread, parse the HTTP request line (just the method and path), write a raw HTTP response. For 2 endpoints with static-ish content, this is ~80-120 lines.

Used by: The Rust Book's web server tutorial uses this exact pattern. Prometheus client libraries in other languages often use minimal HTTP for the /metrics endpoint. SQLite does not embed an HTTP server, but the pattern is standard for database diagnostics (e.g., RocksDB statistics are often exposed via a hand-rolled HTTP endpoint in embedding applications).

Evidence:

  • Zero dependencies. Zero binary overhead.
  • The Rust standard library's TcpListener + BufReader handles everything needed for HTTP/1.1 request parsing at this scale.
  • For /healthz and /metrics with <10 concurrent connections, HTTP keep-alive and chunked transfer are unnecessary -- Connection: close on every response is acceptable.

Strengths:

  • Zero dependencies -- maximally embeddable.
  • Audit surface is 80-120 lines of code that the team wrote and understands.
  • No unsafe (stays within #![forbid(unsafe_code)]).
  • Thread model is explicit: one std::thread::spawn with a loop, one TcpListener.
  • Trivially testable: connect with std::net::TcpStream in integration tests.

Weaknesses:

  • Must handle HTTP parsing manually. But for this scope: read the first line, split on spaces, match path. Malformed requests get a 400 response. This is ~20 lines.
  • No keep-alive, no chunked transfer, no content encoding. Acceptable for dev/ops metrics endpoint at <10 connections.
  • If requirements grow (TLS, WebSocket, many endpoints), must migrate to a real server. But m0p2 has 2 endpoints.

Approach 4: axum + Tokio (async)

How it works: Full async web framework built on hyper and tokio. Tower middleware ecosystem, type-safe extractors, Router-based routing.

Used by: Most production Rust web services. The ecosystem standard for async HTTP.

Evidence:

  • Pulls in tokio, hyper, tower, http, and dozens of transitive dependencies.
  • Binary size impact: 1-3 MiB.
  • Compile time: 10-20s for a clean build.

Strengths:

  • Production-grade HTTP handling.
  • Seamless integration if the host application already runs Tokio.

Weaknesses:

  • Fundamentally incompatible with tidalDB's embeddable philosophy. Adding Tokio as a dependency means every embedder must link Tokio, even if they never enable metrics. Feature-gating mitigates this, but the metrics feature would still pull in the entire async runtime.
  • Massive dependency tree for 2 endpoints.
  • Does not pass the "200 lines" test by orders of magnitude.

Approach 5: warp (async, Tokio-based)

Same category as axum. Pulls Tokio. Same disqualification for the same reasons.

Comparison

Criterion tiny_http 0.12 rouille 0.6 Hand-rolled axum + Tokio
Async? No (sync) No (sync) No (sync) Yes
Dependencies ~5 crates ~8 crates (via tiny_http) 0 ~50+ crates
Binary size impact ~50-80 KiB ~80-120 KiB 0 KiB 1-3 MiB
Compile time impact ~1-2s ~2-3s 0s 10-20s
HTTP correctness Full HTTP/1.1 Full HTTP/1.1 Minimal (sufficient) Full HTTP/1.1 + HTTP/2
#![forbid(unsafe_code)] No (internal unsafe) No Yes No
MSRV 1.57 Unknown N/A (std only) ~1.70+
Maintenance Last release Oct 2022 Uncertain N/A (owned code) Active
License MIT/Apache-2.0 MIT/Apache-2.0 N/A MIT
Shutdown coordination server.unblock() server.unblock() AtomicBool flag tokio::sync::oneshot
Concurrent connections Thread pool Thread pool Sequential (acceptable) Async (unlimited)

Recommendation: Hand-rolled std::net::TcpListener

For 2 endpoints serving <10 concurrent connections in a dev/ops context, a hand-rolled HTTP listener is the correct choice.

The arguments:

  1. The "200 lines" test is decisive. The entire metrics HTTP server -- binding, accept loop, request parsing, routing, response formatting, graceful shutdown -- fits in ~100-120 lines of safe Rust. No dependency justifies its existence here.

  2. Zero dependency cost. The metrics feature flag should add only tidalDB's own code, not a third-party HTTP server. An embedder who enables metrics should not be surprised by new transitive dependencies.

  3. #![forbid(unsafe_code)] compatibility. tiny_http uses unsafe internally. A hand-rolled solution stays within tidalDB's safety guarantees.

  4. Shutdown is trivial with an AtomicBool. The background thread checks running.load(Ordering::Relaxed) on each accept iteration. TcpListener::set_nonblocking(true) with a 100ms poll interval, or use TcpListener with SO_REUSEADDR and connect-to-self to unblock. Alternatively, set a short accept timeout.

  5. The "escape hatch" works both directions. If m0p2 grows beyond 2 endpoints or needs TLS, migrating to tiny_http or axum is straightforward -- the endpoint handler functions remain the same, only the server harness changes.

API design:

/// Start the metrics HTTP server on a background thread.
///
/// Returns a handle that stops the server when dropped.
pub fn start_metrics_server(addr: std::net::SocketAddr, db: Arc<TidalDb>) -> MetricsHandle;

pub struct MetricsHandle {
    shutdown: Arc<AtomicBool>,
    thread: Option<std::thread::JoinHandle<()>>,
}

impl Drop for MetricsHandle {
    fn drop(&mut self) {
        self.shutdown.store(true, Ordering::Release);
        if let Some(handle) = self.thread.take() {
            let _ = handle.join();
        }
    }
}

Tokio compatibility: An embedder running Tokio can wrap this in tokio::task::spawn_blocking(|| start_metrics_server(...)). No tidalDB code needs to know about Tokio.


Question 3: Prometheus Text Format

Format Specification

The Prometheus text exposition format (version 0.0.4) is line-oriented, UTF-8 encoded, with \n line endings:

# HELP <metric_name> <docstring>
# TYPE <metric_name> <counter|gauge|histogram|summary|untyped>
<metric_name>{<label_name>="<label_value>",...} <value> [<timestamp>]

Rules:

  • # HELP and # TYPE must appear before the first sample for a metric.
  • Only one # HELP and one # TYPE per metric name.
  • If # TYPE is omitted, metric defaults to untyped.
  • Label values must escape \ as \\, " as \", \n as \\n.
  • Values are Go ParseFloat format: integers, floats, NaN, +Inf, -Inf.
  • Timestamp is optional (milliseconds since epoch). Prometheus will use scrape time if omitted.
  • Content-Type: text/plain; version=0.0.4; charset=utf-8.

Example for tidalDB's metrics

# HELP tidaldb_uptime_seconds Seconds since the database was opened.
# TYPE tidaldb_uptime_seconds gauge
tidaldb_uptime_seconds{partition_id="0"} 3723.5

# HELP tidaldb_wal_sequence Current WAL sequence number.
# TYPE tidaldb_wal_sequence counter
tidaldb_wal_sequence{partition_id="0"} 148293

# HELP tidaldb_wal_queue_depth Number of WAL entries pending flush.
# TYPE tidaldb_wal_queue_depth gauge
tidaldb_wal_queue_depth{partition_id="0"} 12

# HELP tidaldb_build_info Build metadata. Value is always 1.
# TYPE tidaldb_build_info gauge
tidaldb_build_info{version="0.1.0",build_hash="abc123",partition_id="0"} 1

# HELP tidaldb_open_segments Number of open WAL segments.
# TYPE tidaldb_open_segments gauge
tidaldb_open_segments{partition_id="0"} 3

Approaches Surveyed

Approach 1: prometheus crate (tikv/rust-prometheus) 0.13.x

How it works: Registry-based. Create Counter, Gauge, Histogram objects, register them with a Registry, call TextEncoder::encode() to produce the exposition format.

Used by: TiKV, Linkerd, numerous Rust services. The de facto standard.

Evidence:

  • Well-maintained (tikv organization). License: Apache-2.0.
  • Pulls in protobuf (for optional protobuf format), lazy_static, parking_lot, memchr.
  • Forces string allocations during metric collection (Collector trait limitation).
  • Binary size: ~100-200 KiB.
  • MSRV: 1.56.

Strengths:

  • Battle-tested encoding. Guaranteed format correctness.
  • Histogram and summary support built-in.

Weaknesses:

  • Significant dependency tree for 5 counters/gauges.
  • protobuf dependency is unnecessary for text-only exposition.
  • Allocation-heavy collector API (documented ~40% slower than prometheus-client).
  • Overkill: we need writeln! for 5 metrics, not a registry system.

Approach 2: prometheus-client crate 0.22.x

How it works: OpenMetrics-compatible. Type-safe labels via Rust type system (not string pairs). Visitor-based encoding (no allocations).

Used by: Official Prometheus Rust client. Recommended for new projects.

Evidence:

  • Prometheus organization maintained. License: Apache-2.0.
  • No unsafe code.
  • ~40% faster encoding than tikv/rust-prometheus due to visitor pattern.
  • Smaller dependency footprint than tikv version.

Strengths:

  • Type-safe labels catch errors at compile time.
  • No allocation during encoding.
  • Official Prometheus project.

Weaknesses:

  • Still a registry-based abstraction layer for 5 metrics.
  • Adds dependency tree that is not justified for the scope.

Approach 3: Hand-written format

How it works: Use write! / writeln! to a String or Vec<u8>, following the format spec directly. For 5 counters/gauges with static names and 1-2 labels, this is a function that reads metric values and formats them.

Evidence:

  • The format is trivially simple for counters and gauges. The complete formatting logic for 5 metrics is ~30-40 lines.
  • No histograms or summaries needed at m0p2 scope.
  • Validation: the output must match # HELP, # TYPE, then metric lines. A unit test can assert the format parses correctly (or simply check line structure).

Strengths:

  • Zero dependencies.
  • Complete control over output format.
  • Trivially auditable -- the format spec is 1 page.
  • No registry overhead, no trait objects, no allocations beyond the output buffer.

Weaknesses:

  • Must follow the spec precisely. If a label value contains " or \n, it must be escaped. For tidalDB's labels (partition_id="0", version="0.1.0"), these are compile-time string literals -- no escaping needed.
  • If tidalDB grows to 50+ metrics with histograms, a library becomes justified. But at 5-10 counters/gauges, it is not.

Comparison

Criterion prometheus (tikv) prometheus-client Hand-written
Dependencies ~8 (incl. protobuf) ~3 0
Binary size ~100-200 KiB ~50-100 KiB 0 KiB
Histogram support Yes Yes No (not needed)
Allocation during encode Yes (Collector trait) No (visitor pattern) No (write! to buffer)
Format correctness Guaranteed Guaranteed Unit-tested
Lines of code (user-side) ~30 (register + encode) ~30 (register + encode) ~40 (format directly)
#![forbid(unsafe_code)] Unknown Yes Yes

Recommendation: Hand-written Prometheus text format

For 5-10 counters and gauges with known-safe label values, hand-writing the exposition format is the clear choice. The implementation is approximately 40 lines:

use std::fmt::Write;

pub fn render_prometheus_metrics(metrics: &MetricsSnapshot) -> String {
    let mut out = String::with_capacity(1024);

    write_gauge(&mut out, "tidaldb_uptime_seconds",
        "Seconds since the database was opened",
        &[("partition_id", "0")], metrics.uptime_secs);

    write_counter(&mut out, "tidaldb_wal_sequence",
        "Current WAL sequence number",
        &[("partition_id", "0")], metrics.wal_sequence);

    // ... more metrics
    out
}

fn write_gauge(out: &mut String, name: &str, help: &str,
               labels: &[(&str, &str)], value: f64) {
    let _ = writeln!(out, "# HELP {name} {help}");
    let _ = writeln!(out, "# TYPE {name} gauge");
    write_sample(out, name, labels, value);
}

fn write_counter(out: &mut String, name: &str, help: &str,
                 labels: &[(&str, &str)], value: f64) {
    let _ = writeln!(out, "# HELP {name} {help}");
    let _ = writeln!(out, "# TYPE {name} counter");
    write_sample(out, name, labels, value);
}

fn write_sample(out: &mut String, name: &str,
                labels: &[(&str, &str)], value: f64) {
    let _ = write!(out, "{name}{{");
    for (i, (k, v)) in labels.iter().enumerate() {
        if i > 0 { let _ = write!(out, ","); }
        let _ = write!(out, "{k}=\"{v}\"");
    }
    let _ = writeln!(out, "}} {value}");
}

When to migrate: If tidalDB needs histograms (e.g., query latency distributions) or 50+ metrics, adopt prometheus-client (the official Prometheus crate, not tikv's). Pin to prometheus-client = "0.22". But that is a post-m0p2 decision.


Question 4: Serde for Config Serialization

Current State

Config is a 4-field struct (mode: StorageMode, data_dir: Option<PathBuf>, wal_dir: Option<PathBuf>, cache_dir: Option<PathBuf>). It currently has no serialization support. The CLI needs to read a serialized config snapshot from disk.

Approaches Surveyed

Approach 1: serde + serde_json (feature-gated on library crate)

How it works: Add #[derive(Serialize, Deserialize)] to Config and StorageMode behind a serde feature flag. The CLI binary depends on the library with the serde feature enabled. serde_json handles the JSON encoding.

Evidence:

  • serde (1.0.228) and serde_json (1.0.149) are already in Cargo.lock via criterion.
  • CODING_GUIDELINES.md line 296 explicitly approves serde/serde_json: "serialization (at API boundaries only, not in hot paths)."
  • Best practice from Rust API Guidelines and community consensus: library crates should feature-gate serde behind an optional serde feature.
  • Binary size: serde_json adds ~70-100 KiB to release binaries. serde_derive's proc-macro adds ~5-10s to initial compile, but is already compiled for criterion.
  • fjall (tidalDB's storage engine) does not use serde -- adding it to tidalDB does not create a circular dependency or conflict.

Strengths:

  • Industry standard. Every Rust developer knows serde.
  • Already approved in CODING_GUIDELINES.md.
  • Already compiled in dev builds (via criterion).
  • Feature-gated: embedders who do not need serialization pay zero cost.
  • Config is at an API boundary (CLI reads library's config), exactly where serde belongs.

Weaknesses:

  • serde_derive adds proc-macro compile time. Mitigated by: already compiled for criterion.
  • Monomorphization can bloat binary. Mitigated by: Config is a small struct with 4 fields; the generated code is minimal.

Approach 2: miniserde

How it works: Lightweight alternative to serde that uses trait objects instead of monomorphization. ~12x less code than serde + serde_derive + serde_json combined.

Evidence:

  • JSON-only. No format plugins.
  • No error messages on deserialization failure.
  • Does not support enums with data (only C-style enums). StorageMode is C-style, so this works.
  • Does not support #[serde(rename)] or most serde attributes.
  • Limited type support (no tuple structs, no enums with variant data).

Strengths:

  • Smaller binary size than serde.
  • Faster compile time (no proc-macro overhead comparable to serde_derive).

Weaknesses:

  • serde is already compiled in the workspace. miniserde adds a new dependency tree rather than reusing what exists.
  • No error messages -- if the CLI reads a corrupt config file, it gets None with no indication of what went wrong.
  • Would become a migration tax later when tidalDB needs serde for other types (e.g., schema definitions, ranking profiles).

Approach 3: Hand-written JSON serialization

How it works: Implement Display for Config that writes JSON manually, and a from_json_str function that parses it. For a 4-field struct, this is ~50-80 lines.

Evidence:

  • Zero dependencies.
  • But: manual JSON parsing is error-prone. Escaping, nested objects, null handling, and whitespace tolerance all need implementation.
  • tidalDB will need JSON serialization in multiple places beyond Config (API responses, query results, schema export). Implementing a JSON parser from scratch to avoid an already-approved dependency is false economy.

Strengths:

  • Zero dependency cost.

Weaknesses:

  • JSON parsing is not a 200-line problem if done correctly. Escaping, unicode, nested structures, error reporting -- this is exactly what serde_json solves.
  • Creates maintenance burden that serde eliminates.
  • CODING_GUIDELINES.md already approved serde for this exact use case.

Comparison

Criterion serde + serde_json miniserde Hand-written
Already in Cargo.lock Yes (via criterion) No N/A
Approved in CODING_GUIDELINES Yes (explicitly) No N/A
Error messages on parse failure Yes (detailed) None Custom
Enum support Full C-style only Custom
Future reuse in tidalDB High (schema, API, query results) Low Low
Binary size overhead ~70-100 KiB ~30-50 KiB 0 KiB
Compile time overhead 0s (already compiled) New compilation 0s
Correctness risk None (battle-tested) Low Medium (hand-rolled parser)

Recommendation: serde + serde_json, feature-gated

This is the one dependency question where the answer is unambiguously "use the library."

  1. Already approved. CODING_GUIDELINES.md says: "serde / serde_json -- serialization (at API boundaries only, not in hot paths)." Config serialization for CLI communication is the textbook API boundary use case.

  2. Already compiled. Both crates are in Cargo.lock via criterion. Adding them as optional dependencies of the main crate adds zero compile time for developers who are already running tests and benchmarks.

  3. Future-proof. tidalDB will need JSON serialization for: config export, schema definitions, query result formatting, API responses, ranking profile serialization. Every one of these will use serde. Starting with Config establishes the pattern.

  4. Feature-gate it. The library crate adds:

[dependencies]
serde = { version = "1", features = ["derive"], optional = true }
serde_json = { version = "1", optional = true }

[features]
serde = ["dep:serde", "dep:serde_json"]

And on the struct:

#[derive(Debug, Clone)]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
pub struct Config {
    pub mode: StorageMode,
    pub data_dir: Option<PathBuf>,
    pub wal_dir: Option<PathBuf>,
    pub cache_dir: Option<PathBuf>,
}

Embedders who do not need serialization pay nothing. The tidalctl binary crate depends on tidaldb = { path = "../tidal", features = ["serde"] }.


Open Questions

  1. Config file format and location. m0p2 task-01 says the CLI reads a "Config dump." Where does the running database write this? Likely {data_dir}/config.json written atomically during TidalDb::open(). The exact path should be a Paths method (e.g., paths.config_file()). This is an implementation decision for the engineer, not a research question.

  2. Metrics collection mechanism. The hand-rolled metrics HTTP server needs to read metrics from the database. What is the interface? Options: (a) TidalDb exposes a pub fn metrics_snapshot(&self) -> MetricsSnapshot method; (b) a shared Arc<AtomicU64> counter registry. Option (a) is simpler and keeps the metrics code behind the public API. The engineer should decide based on what metrics are available at m0p2 (uptime and build info are trivial; WAL sequence requires WAL to be wired up).

  3. Graceful shutdown of the HTTP listener. std::net::TcpListener::accept() blocks. To unblock it for shutdown, three options: (a) set_nonblocking(true) with a polling loop (simple, slight CPU waste); (b) connect-to-self to unblock accept (clever, no CPU waste); (c) use SO_REUSEADDR + shutdown on a cloned socket. Option (a) with a 200ms sleep is the simplest and sufficient for a diagnostics endpoint. Benchmark the CPU overhead if concerned -- it will be negligible for a 200ms poll.

  4. When to add clap. If tidalctl grows beyond 5 subcommands or needs dynamic completions, switch to clap. The migration from manual to clap is a single-commit refactor: define a derive struct matching the existing match arms. Document this as the escape hatch in the tidalctl crate README.

  5. When to add prometheus-client. If tidalDB needs histograms (query latency distributions, signal write latency distributions) or exceeds 20 metrics, adopt prometheus-client = "0.22". The hand-written format functions become a MetricFamily registration. Document the threshold.

  6. Integration testing the HTTP endpoint. The test should start_metrics_server on an ephemeral port, GET /metrics with std::net::TcpStream, and assert the response contains expected metric lines. This is straightforward with the hand-rolled approach and does not require an HTTP client library -- raw TCP + string matching is sufficient.


Summary of Recommendations

Component Recommendation Justification
CLI argument parsing Manual std::env::args() 2 subcommands, 60 lines. "200 lines" test passes. Upgrade path to pico-args/clap exists.
HTTP metrics server Hand-rolled std::net::TcpListener 2 endpoints, <10 connections. ~100 lines of safe Rust. Zero dependencies.
Prometheus text format Hand-written write! formatting 5-10 counters/gauges. ~40 lines. Format spec is trivial for this scope.
Config serialization serde + serde_json, feature-gated Already approved, already compiled, future-proof. Feature-gate as serde.

Total new dependencies for m0p2: One optional dependency pair (serde + serde_json) that is already in Cargo.lock and already approved. Everything else is standard library code.

Estimated code footprint for m0p2 tooling:

  • tidalctl binary: ~150-200 lines (arg parsing + config reading + JSON output)
  • Metrics HTTP server: ~100-120 lines (listener + routing + response)
  • Prometheus formatter: ~40-50 lines (metric rendering)
  • Config serde derives: ~5 lines (derive attributes + feature gate)

Sources

CLI Argument Parsing

HTTP Servers

Prometheus Text Format

Serialization