- m0p3: CONTRIBUTING.md with run-samples checklist, all 4 examples (quickstart, cli_embedding, axum_embedding, actix_embedding), doc-test coverage for every public API surface - m1p5: TidalDb public API — write_item, signal, read_decay_score, read_windowed_count, read_velocity; StorageBox enum routing memory vs fjall; WalSender/WalHandleWriter bridge; WAL replay on open - Periodic checkpoint: 30s background thread for persistent+schema mode; FjallBackend::Clone (O(1), fjall::Keyspace is ref-counted); graceful shutdown via Arc<AtomicBool> + join before final checkpoint - ROADMAP.md: M0 and M1 fully marked COMPLETE (341 tests passing) - Milestone 2 planning scaffolding added under docs/planning/milestone-2/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
821 lines
30 KiB
Markdown
821 lines
30 KiB
Markdown
# Task 03: Embedding Lifecycle + Slot Registry
|
|
|
|
## Context
|
|
|
|
**Milestone:** 2 -- Ranked Retrieval
|
|
**Phase:** m2p1 -- Vector Index Integration (USearch)
|
|
**Depends On:** Task 01 (VectorIndex trait, VectorError, VectorIndexConfig, QuantizationLevel)
|
|
**Blocks:** m2p5 (RETRIEVE executor -- needs embedding insert path for write_item with embeddings)
|
|
**Complexity:** M
|
|
|
|
## Objective
|
|
|
|
Deliver the embedding lifecycle operations (`l2_normalize`, insert, update, delete) and the `EmbeddingSlotRegistry` that maps named embedding slots to their HNSW indexes. This is the layer between the entity write API (`write_item()` with an embedding) and the raw `VectorIndex` trait.
|
|
|
|
When an application writes an item with an embedding, the lifecycle layer:
|
|
1. Validates that the dimensions match the slot definition.
|
|
2. L2-normalizes the vector to unit length (so L2 distance = cosine similarity).
|
|
3. Stores the full-precision (f32) normalized vector in the entity store as the source of truth.
|
|
4. Inserts the vector into the HNSW index (which quantizes to f16/int8 internally).
|
|
|
|
The `EmbeddingSlotRegistry` is the central authority for embedding slot configuration. It maps `(EntityKind, slot_name)` to `EmbeddingSlotState` which contains the HNSW index, dimensions, quantization level, and HNSW parameters. The registry is constructed from the schema at `TidalDB::open()` time.
|
|
|
|
Embeddings in the entity store use the key format `encode_key(entity_id, Tag::Meta, b"EMB:slot_name")`. This co-locates embedding data with entity metadata under the same entity prefix, enabling efficient prefix scans for entity-level operations.
|
|
|
|
## Requirements
|
|
|
|
- `l2_normalize(v: &[f32]) -> Result<Vec<f32>, VectorError>` normalizes to unit length
|
|
- `l2_normalize` fails with `VectorError::ZeroNormVector` on zero-norm input
|
|
- `l2_normalize` verifies the result: `|1.0 - ||result||| < 1e-5`
|
|
- `EmbeddingSlotRegistry` maps `(EntityKind, String)` to `EmbeddingSlotState`
|
|
- `EmbeddingSlotState` holds: `Box<dyn VectorIndex>`, `dimensions`, `quantization`, `source`, `params`
|
|
- `EmbeddingSource` enum: `External` (provided by application), `DatabaseManaged` (computed by tidalDB)
|
|
- Insert path: validate dims, normalize, store in entity store, insert into HNSW
|
|
- Update path: validate dims, normalize, update entity store, tombstone old in HNSW, insert new
|
|
- Delete path: tombstone in HNSW, optionally remove from entity store
|
|
- Entity store key format: `encode_key(entity_id, Tag::Meta, b"EMB:slot_name")`
|
|
- Entity store value format: `[dimensions: 4 bytes LE][vector: dimensions * 4 bytes, f32 LE]`
|
|
- No `unsafe` code
|
|
|
|
## Technical Design
|
|
|
|
### Module Structure
|
|
|
|
```
|
|
tidal/src/storage/vector/
|
|
lifecycle.rs -- l2_normalize, EmbeddingOps (insert/update/delete helpers)
|
|
registry.rs -- EmbeddingSlotRegistry, EmbeddingSlotState, EmbeddingSource, HnswParams
|
|
```
|
|
|
|
### Public API
|
|
|
|
```rust
|
|
// === storage/vector/lifecycle.rs ===
|
|
|
|
use super::{VectorError, VectorId, VectorIndex};
|
|
use crate::schema::EntityId;
|
|
use crate::storage::{StorageEngine, Tag, encode_key};
|
|
|
|
/// L2-normalize a vector to unit length.
|
|
///
|
|
/// Computes `v[i] = v[i] / ||v||` where `||v|| = sqrt(sum(v[i]^2))`.
|
|
///
|
|
/// For L2-normalized vectors, L2 distance is equivalent to cosine distance:
|
|
/// `||a - b||^2 = 2 - 2 * cos(a, b)`.
|
|
///
|
|
/// # Errors
|
|
///
|
|
/// Returns `VectorError::ZeroNormVector` if the vector has zero norm (all zeros).
|
|
/// A zero vector has no direction and cannot participate in cosine similarity.
|
|
///
|
|
/// # Post-conditions
|
|
///
|
|
/// The returned vector has L2 norm within `1e-5` of 1.0.
|
|
pub fn l2_normalize(v: &[f32]) -> Result<Vec<f32>, VectorError> {
|
|
let norm_sq: f32 = v.iter().map(|x| x * x).sum();
|
|
if norm_sq < f32::EPSILON {
|
|
return Err(VectorError::ZeroNormVector);
|
|
}
|
|
let norm = norm_sq.sqrt();
|
|
let result: Vec<f32> = v.iter().map(|x| x / norm).collect();
|
|
|
|
// Post-condition: verify normalization
|
|
debug_assert!({
|
|
let result_norm: f32 = result.iter().map(|x| x * x).sum::<f32>().sqrt();
|
|
(1.0 - result_norm).abs() < 1e-5
|
|
});
|
|
|
|
Ok(result)
|
|
}
|
|
|
|
/// Build the entity store key for an embedding slot.
|
|
///
|
|
/// Format: `encode_key(entity_id, Tag::Meta, b"EMB:slot_name")`
|
|
pub fn embedding_store_key(entity_id: EntityId, slot_name: &str) -> Vec<u8> {
|
|
let suffix = format!("EMB:{slot_name}");
|
|
encode_key(entity_id, Tag::Meta, suffix.as_bytes())
|
|
}
|
|
|
|
/// Serialize an embedding vector for entity store storage.
|
|
///
|
|
/// Format: `[dimensions: 4 bytes LE][vector: dimensions * 4 bytes, f32 LE]`
|
|
pub fn serialize_embedding(v: &[f32]) -> Vec<u8> {
|
|
let mut buf = Vec::with_capacity(4 + v.len() * 4);
|
|
buf.extend_from_slice(&(v.len() as u32).to_le_bytes());
|
|
for &x in v {
|
|
buf.extend_from_slice(&x.to_le_bytes());
|
|
}
|
|
buf
|
|
}
|
|
|
|
/// Deserialize an embedding vector from entity store storage.
|
|
///
|
|
/// Returns the f32 vector or an error if the data is corrupt.
|
|
pub fn deserialize_embedding(bytes: &[u8]) -> Result<Vec<f32>, VectorError> {
|
|
if bytes.len() < 4 {
|
|
return Err(VectorError::CorruptedIndex(
|
|
"embedding data too short for dimension header".into()));
|
|
}
|
|
let dim = u32::from_le_bytes([bytes[0], bytes[1], bytes[2], bytes[3]]) as usize;
|
|
let expected_len = 4 + dim * 4;
|
|
if bytes.len() != expected_len {
|
|
return Err(VectorError::CorruptedIndex(
|
|
format!("embedding data length {} != expected {expected_len}", bytes.len())));
|
|
}
|
|
let mut v = Vec::with_capacity(dim);
|
|
for i in 0..dim {
|
|
let offset = 4 + i * 4;
|
|
let x = f32::from_le_bytes([
|
|
bytes[offset], bytes[offset + 1], bytes[offset + 2], bytes[offset + 3],
|
|
]);
|
|
v.push(x);
|
|
}
|
|
Ok(v)
|
|
}
|
|
|
|
/// Insert an embedding for an entity.
|
|
///
|
|
/// 1. Validates dimensions match the expected `dimensions`.
|
|
/// 2. L2-normalizes the vector.
|
|
/// 3. Stores the normalized f32 vector in the entity store.
|
|
/// 4. Inserts the normalized vector into the HNSW index.
|
|
///
|
|
/// The entity store is the source of truth. The HNSW index is derived state.
|
|
pub fn insert_embedding(
|
|
entity_id: EntityId,
|
|
slot_name: &str,
|
|
raw_vector: &[f32],
|
|
expected_dimensions: usize,
|
|
index: &dyn VectorIndex,
|
|
storage: &dyn StorageEngine,
|
|
) -> Result<(), VectorError> {
|
|
// Validate dimensions
|
|
if raw_vector.len() != expected_dimensions {
|
|
return Err(VectorError::DimensionMismatch {
|
|
expected: expected_dimensions,
|
|
got: raw_vector.len(),
|
|
});
|
|
}
|
|
|
|
// Normalize
|
|
let normalized = l2_normalize(raw_vector)?;
|
|
|
|
// Store in entity store (source of truth)
|
|
let key = embedding_store_key(entity_id, slot_name);
|
|
let value = serialize_embedding(&normalized);
|
|
storage.put(&key, &value)
|
|
.map_err(|e| VectorError::Backend(format!("entity store write failed: {e}")))?;
|
|
|
|
// Insert into HNSW index
|
|
index.insert(entity_id.as_u64(), &normalized)?;
|
|
|
|
Ok(())
|
|
}
|
|
|
|
/// Update an embedding for an entity.
|
|
///
|
|
/// 1. Validates dimensions.
|
|
/// 2. L2-normalizes the new vector.
|
|
/// 3. Updates the entity store.
|
|
/// 4. Tombstones the old vector in HNSW.
|
|
/// 5. Inserts the new vector into HNSW.
|
|
///
|
|
/// Note: Between steps 4 and 5, the entity is absent from ANN results.
|
|
/// This window is microseconds and is acceptable per Spec 07, Section 6.
|
|
pub fn update_embedding(
|
|
entity_id: EntityId,
|
|
slot_name: &str,
|
|
raw_vector: &[f32],
|
|
expected_dimensions: usize,
|
|
index: &dyn VectorIndex,
|
|
storage: &dyn StorageEngine,
|
|
) -> Result<(), VectorError> {
|
|
if raw_vector.len() != expected_dimensions {
|
|
return Err(VectorError::DimensionMismatch {
|
|
expected: expected_dimensions,
|
|
got: raw_vector.len(),
|
|
});
|
|
}
|
|
|
|
let normalized = l2_normalize(raw_vector)?;
|
|
|
|
// Update entity store
|
|
let key = embedding_store_key(entity_id, slot_name);
|
|
let value = serialize_embedding(&normalized);
|
|
storage.put(&key, &value)
|
|
.map_err(|e| VectorError::Backend(format!("entity store write failed: {e}")))?;
|
|
|
|
// Tombstone old in HNSW, insert new
|
|
// delete() may return NotFound if the entity was never indexed (first embedding).
|
|
// That is fine -- ignore NotFound on the delete step.
|
|
let _ = index.delete(entity_id.as_u64());
|
|
index.insert(entity_id.as_u64(), &normalized)?;
|
|
|
|
Ok(())
|
|
}
|
|
|
|
/// Delete an embedding for an entity.
|
|
///
|
|
/// 1. Tombstones the vector in HNSW.
|
|
/// 2. Optionally removes the embedding from the entity store.
|
|
///
|
|
/// For archive (soft delete): tombstone HNSW only, keep entity store data.
|
|
/// For hard delete: tombstone HNSW and remove entity store key.
|
|
pub fn delete_embedding(
|
|
entity_id: EntityId,
|
|
slot_name: &str,
|
|
index: &dyn VectorIndex,
|
|
storage: &dyn StorageEngine,
|
|
hard_delete: bool,
|
|
) -> Result<(), VectorError> {
|
|
// Tombstone in HNSW
|
|
index.delete(entity_id.as_u64())?;
|
|
|
|
// Optionally remove from entity store
|
|
if hard_delete {
|
|
let key = embedding_store_key(entity_id, slot_name);
|
|
storage.delete(&key)
|
|
.map_err(|e| VectorError::Backend(format!("entity store delete failed: {e}")))?;
|
|
}
|
|
|
|
Ok(())
|
|
}
|
|
```
|
|
|
|
### EmbeddingSlotRegistry
|
|
|
|
```rust
|
|
// === storage/vector/registry.rs ===
|
|
|
|
use std::collections::HashMap;
|
|
use crate::schema::EntityKind;
|
|
use super::{VectorIndex, VectorIndexConfig, QuantizationLevel};
|
|
|
|
/// HNSW parameters for an embedding slot.
|
|
#[derive(Debug, Clone)]
|
|
pub struct HnswParams {
|
|
pub connectivity: usize,
|
|
pub ef_construction: usize,
|
|
pub ef_search: usize,
|
|
}
|
|
|
|
impl Default for HnswParams {
|
|
fn default() -> Self {
|
|
Self {
|
|
connectivity: 16,
|
|
ef_construction: 200,
|
|
ef_search: 200,
|
|
}
|
|
}
|
|
}
|
|
|
|
/// Source of an embedding slot's vectors.
|
|
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
|
pub enum EmbeddingSource {
|
|
/// Provided by the application via `write_item()` or `write_user()`.
|
|
External,
|
|
/// Computed and maintained by tidalDB (e.g., user preference vector,
|
|
/// creator catalog embedding).
|
|
DatabaseManaged,
|
|
}
|
|
|
|
/// State for a single embedding slot.
|
|
pub struct EmbeddingSlotState {
|
|
/// The HNSW index for this slot.
|
|
pub index: Box<dyn VectorIndex>,
|
|
/// Number of dimensions for this slot.
|
|
pub dimensions: usize,
|
|
/// Quantization level used in the HNSW index.
|
|
pub quantization: QuantizationLevel,
|
|
/// Whether this embedding is externally provided or database-managed.
|
|
pub source: EmbeddingSource,
|
|
/// HNSW graph parameters.
|
|
pub params: HnswParams,
|
|
}
|
|
|
|
/// Registry of all embedding slots across all entity types.
|
|
///
|
|
/// Constructed from the schema at `TidalDB::open()` time. Each entity type
|
|
/// can define up to 4 embedding slots (per Entity Model Specification).
|
|
///
|
|
/// # Example
|
|
///
|
|
/// ```text
|
|
/// Item "content" -> 1536d, f16, External, M=16
|
|
/// Item "visual" -> 512d, f16, External, M=16
|
|
/// User "preference" -> 1536d, f16, DatabaseManaged, M=16
|
|
/// ```
|
|
pub struct EmbeddingSlotRegistry {
|
|
slots: HashMap<(EntityKind, String), EmbeddingSlotState>,
|
|
}
|
|
|
|
impl EmbeddingSlotRegistry {
|
|
/// Create an empty registry.
|
|
pub fn new() -> Self {
|
|
Self { slots: HashMap::new() }
|
|
}
|
|
|
|
/// Register an embedding slot.
|
|
///
|
|
/// # Errors
|
|
///
|
|
/// Returns an error if a slot with the same `(entity_kind, slot_name)` already exists.
|
|
pub fn register(
|
|
&mut self,
|
|
entity_kind: EntityKind,
|
|
slot_name: String,
|
|
state: EmbeddingSlotState,
|
|
) -> Result<(), VectorError>;
|
|
|
|
/// Look up an embedding slot by entity kind and slot name.
|
|
///
|
|
/// Returns `None` if the slot is not registered.
|
|
pub fn get(&self, entity_kind: EntityKind, slot_name: &str) -> Option<&EmbeddingSlotState>;
|
|
|
|
/// Look up an embedding slot mutably.
|
|
pub fn get_mut(
|
|
&mut self,
|
|
entity_kind: EntityKind,
|
|
slot_name: &str,
|
|
) -> Option<&mut EmbeddingSlotState>;
|
|
|
|
/// List all slot names for a given entity kind.
|
|
pub fn slots_for(&self, entity_kind: EntityKind) -> Vec<&str>;
|
|
|
|
/// Total number of registered slots across all entity kinds.
|
|
pub fn slot_count(&self) -> usize {
|
|
self.slots.len()
|
|
}
|
|
|
|
/// Save all indexes to disk under the given directory.
|
|
///
|
|
/// File naming: `{data_dir}/vector/{entity_kind}_{slot_name}.usearch`
|
|
pub fn save_all(&self, data_dir: &std::path::Path) -> Result<(), VectorError>;
|
|
|
|
/// Load all indexes from disk.
|
|
///
|
|
/// Uses `view()` for immediate read serving, then optionally `load()` for
|
|
/// writable access in the background.
|
|
pub fn load_all(&mut self, data_dir: &std::path::Path) -> Result<(), VectorError>;
|
|
}
|
|
```
|
|
|
|
### Error Handling
|
|
|
|
- `l2_normalize()` with zero vector: returns `VectorError::ZeroNormVector`.
|
|
- Dimension mismatch on insert/update: returns `VectorError::DimensionMismatch`.
|
|
- Entity store I/O failure: returns `VectorError::Backend` wrapping the storage error.
|
|
- Corrupt embedding data on deserialize: returns `VectorError::CorruptedIndex`.
|
|
- Duplicate slot registration: returns `VectorError::Backend("slot already registered: ...")`.
|
|
- Slot not found in registry: returns `None` (not an error -- callers check before use).
|
|
|
|
## Test Strategy
|
|
|
|
### Property Tests
|
|
|
|
```rust
|
|
use proptest::prelude::*;
|
|
|
|
// l2_normalize produces unit vectors.
|
|
proptest! {
|
|
#[test]
|
|
fn normalize_produces_unit_vector(
|
|
v in prop::collection::vec(-100.0f32..100.0, 2..256),
|
|
) {
|
|
// Skip zero vectors (they fail normalization, which is correct)
|
|
let norm_sq: f32 = v.iter().map(|x| x * x).sum();
|
|
prop_assume!(norm_sq > f32::EPSILON);
|
|
|
|
let normalized = l2_normalize(&v).unwrap();
|
|
let result_norm: f32 = normalized.iter().map(|x| x * x).sum::<f32>().sqrt();
|
|
prop_assert!(
|
|
(1.0 - result_norm).abs() < 1e-5,
|
|
"norm was {result_norm}, expected ~1.0"
|
|
);
|
|
}
|
|
}
|
|
|
|
// l2_normalize is idempotent: normalizing a unit vector returns the same vector.
|
|
proptest! {
|
|
#[test]
|
|
fn normalize_idempotent(
|
|
v in prop::collection::vec(-100.0f32..100.0, 2..256),
|
|
) {
|
|
let norm_sq: f32 = v.iter().map(|x| x * x).sum();
|
|
prop_assume!(norm_sq > f32::EPSILON);
|
|
|
|
let first = l2_normalize(&v).unwrap();
|
|
let second = l2_normalize(&first).unwrap();
|
|
|
|
for (a, b) in first.iter().zip(second.iter()) {
|
|
prop_assert!((a - b).abs() < 1e-5,
|
|
"idempotent check failed: {a} vs {b}");
|
|
}
|
|
}
|
|
}
|
|
|
|
// l2_normalize preserves direction (cosine similarity with original = 1.0).
|
|
proptest! {
|
|
#[test]
|
|
fn normalize_preserves_direction(
|
|
v in prop::collection::vec(1.0f32..100.0, 2..256),
|
|
) {
|
|
let normalized = l2_normalize(&v).unwrap();
|
|
|
|
// Cosine similarity between v and normalized(v) should be ~1.0
|
|
let dot: f32 = v.iter().zip(normalized.iter()).map(|(a, b)| a * b).sum();
|
|
let norm_v: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
|
|
let cosine = dot / norm_v; // normalized already has norm 1
|
|
|
|
prop_assert!(
|
|
(1.0 - cosine).abs() < 1e-4,
|
|
"cosine similarity with original was {cosine}, expected ~1.0"
|
|
);
|
|
}
|
|
}
|
|
|
|
// Embedding serialize/deserialize roundtrip.
|
|
proptest! {
|
|
#[test]
|
|
fn embedding_serde_roundtrip(
|
|
v in prop::collection::vec(-1.0f32..1.0, 1..512),
|
|
) {
|
|
let bytes = serialize_embedding(&v);
|
|
let restored = deserialize_embedding(&bytes).unwrap();
|
|
prop_assert_eq!(v.len(), restored.len());
|
|
for (a, b) in v.iter().zip(restored.iter()) {
|
|
prop_assert!((a - b).abs() < 1e-7,
|
|
"serde mismatch: {a} vs {b}");
|
|
}
|
|
}
|
|
}
|
|
|
|
// Insert + search roundtrip via BruteForceIndex.
|
|
proptest! {
|
|
#[test]
|
|
fn insert_embedding_searchable(
|
|
dim in 2usize..64,
|
|
n in 1usize..50,
|
|
) {
|
|
let config = VectorIndexConfig {
|
|
dimensions: dim,
|
|
..VectorIndexConfig::default()
|
|
};
|
|
let index = BruteForceIndex::new(config);
|
|
let storage = InMemoryBackend::new();
|
|
|
|
for id in 0..n as u64 {
|
|
let raw: Vec<f32> = (0..dim).map(|i| ((id as usize + i) % 100) as f32 / 100.0 + 0.01).collect();
|
|
insert_embedding(
|
|
EntityId::new(id + 1),
|
|
"content",
|
|
&raw,
|
|
dim,
|
|
&index,
|
|
&storage,
|
|
).unwrap();
|
|
}
|
|
|
|
// Verify all are searchable
|
|
prop_assert_eq!(index.len(), n);
|
|
|
|
// Verify entity store has the normalized vectors
|
|
for id in 0..n as u64 {
|
|
let key = embedding_store_key(EntityId::new(id + 1), "content");
|
|
let bytes = storage.get(&key).unwrap();
|
|
prop_assert!(bytes.is_some(), "entity store should have embedding for id {id}");
|
|
let stored = deserialize_embedding(&bytes.unwrap()).unwrap();
|
|
let norm: f32 = stored.iter().map(|x| x * x).sum::<f32>().sqrt();
|
|
prop_assert!((1.0 - norm).abs() < 1e-5, "stored embedding should be normalized");
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Unit Tests
|
|
|
|
```rust
|
|
#[test]
|
|
fn l2_normalize_unit_vector() {
|
|
let v = vec![1.0, 0.0, 0.0];
|
|
let normalized = l2_normalize(&v).unwrap();
|
|
assert!((normalized[0] - 1.0).abs() < 1e-6);
|
|
assert!(normalized[1].abs() < 1e-6);
|
|
assert!(normalized[2].abs() < 1e-6);
|
|
}
|
|
|
|
#[test]
|
|
fn l2_normalize_non_unit_vector() {
|
|
let v = vec![3.0, 4.0]; // norm = 5
|
|
let normalized = l2_normalize(&v).unwrap();
|
|
assert!((normalized[0] - 0.6).abs() < 1e-5);
|
|
assert!((normalized[1] - 0.8).abs() < 1e-5);
|
|
let norm: f32 = normalized.iter().map(|x| x * x).sum::<f32>().sqrt();
|
|
assert!((1.0 - norm).abs() < 1e-5);
|
|
}
|
|
|
|
#[test]
|
|
fn l2_normalize_zero_vector_fails() {
|
|
let v = vec![0.0, 0.0, 0.0];
|
|
let result = l2_normalize(&v);
|
|
assert!(matches!(result, Err(VectorError::ZeroNormVector)));
|
|
}
|
|
|
|
#[test]
|
|
fn l2_normalize_near_zero_vector_fails() {
|
|
let v = vec![1e-40, 0.0, 0.0]; // norm^2 < f32::EPSILON
|
|
let result = l2_normalize(&v);
|
|
assert!(matches!(result, Err(VectorError::ZeroNormVector)));
|
|
}
|
|
|
|
#[test]
|
|
fn serialize_deserialize_embedding() {
|
|
let v = vec![1.0, 2.0, 3.0];
|
|
let bytes = serialize_embedding(&v);
|
|
assert_eq!(bytes.len(), 4 + 3 * 4); // 4 dim header + 12 data
|
|
let restored = deserialize_embedding(&bytes).unwrap();
|
|
assert_eq!(v, restored);
|
|
}
|
|
|
|
#[test]
|
|
fn deserialize_embedding_truncated() {
|
|
let result = deserialize_embedding(&[0x03, 0x00, 0x00]); // too short for header
|
|
assert!(matches!(result, Err(VectorError::CorruptedIndex(_))));
|
|
}
|
|
|
|
#[test]
|
|
fn deserialize_embedding_wrong_length() {
|
|
let mut bytes = serialize_embedding(&[1.0, 2.0]);
|
|
bytes.pop(); // truncate one byte
|
|
let result = deserialize_embedding(&bytes);
|
|
assert!(matches!(result, Err(VectorError::CorruptedIndex(_))));
|
|
}
|
|
|
|
#[test]
|
|
fn embedding_store_key_format() {
|
|
let key = embedding_store_key(EntityId::new(42), "content");
|
|
let (eid, tag, suffix) = parse_key(&key).unwrap();
|
|
assert_eq!(eid, EntityId::new(42));
|
|
assert_eq!(tag, Tag::Meta);
|
|
assert_eq!(suffix, b"EMB:content");
|
|
}
|
|
|
|
#[test]
|
|
fn embedding_store_key_different_slots() {
|
|
let key_content = embedding_store_key(EntityId::new(1), "content");
|
|
let key_visual = embedding_store_key(EntityId::new(1), "visual");
|
|
assert_ne!(key_content, key_visual);
|
|
}
|
|
|
|
#[test]
|
|
fn insert_embedding_validates_dimensions() {
|
|
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
|
|
let index = BruteForceIndex::new(config);
|
|
let storage = InMemoryBackend::new();
|
|
|
|
let result = insert_embedding(
|
|
EntityId::new(1), "content", &[1.0, 2.0], 3, &index, &storage,
|
|
);
|
|
assert!(matches!(result, Err(VectorError::DimensionMismatch { expected: 3, got: 2 })));
|
|
}
|
|
|
|
#[test]
|
|
fn insert_embedding_stores_normalized_vector() {
|
|
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
|
|
let index = BruteForceIndex::new(config);
|
|
let storage = InMemoryBackend::new();
|
|
|
|
insert_embedding(
|
|
EntityId::new(1), "content", &[3.0, 4.0, 0.0], 3, &index, &storage,
|
|
).unwrap();
|
|
|
|
// Read from entity store
|
|
let key = embedding_store_key(EntityId::new(1), "content");
|
|
let bytes = storage.get(&key).unwrap().unwrap();
|
|
let stored = deserialize_embedding(&bytes).unwrap();
|
|
|
|
// Should be normalized (norm = 5, so [0.6, 0.8, 0.0])
|
|
assert!((stored[0] - 0.6).abs() < 1e-5);
|
|
assert!((stored[1] - 0.8).abs() < 1e-5);
|
|
assert!(stored[2].abs() < 1e-5);
|
|
}
|
|
|
|
#[test]
|
|
fn insert_embedding_zero_vector_fails() {
|
|
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
|
|
let index = BruteForceIndex::new(config);
|
|
let storage = InMemoryBackend::new();
|
|
|
|
let result = insert_embedding(
|
|
EntityId::new(1), "content", &[0.0, 0.0, 0.0], 3, &index, &storage,
|
|
);
|
|
assert!(matches!(result, Err(VectorError::ZeroNormVector)));
|
|
}
|
|
|
|
#[test]
|
|
fn update_embedding_replaces_vector() {
|
|
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
|
|
let index = BruteForceIndex::new(config);
|
|
let storage = InMemoryBackend::new();
|
|
|
|
// Insert original
|
|
insert_embedding(
|
|
EntityId::new(1), "content", &[1.0, 0.0, 0.0], 3, &index, &storage,
|
|
).unwrap();
|
|
|
|
// Update
|
|
update_embedding(
|
|
EntityId::new(1), "content", &[0.0, 1.0, 0.0], 3, &index, &storage,
|
|
).unwrap();
|
|
|
|
// Search should find the updated vector
|
|
let results = index.search(&[0.0, 1.0, 0.0], 1, 0).unwrap();
|
|
assert_eq!(results[0].id, 1);
|
|
assert!(results[0].distance < 1e-5, "should match updated vector");
|
|
}
|
|
|
|
#[test]
|
|
fn delete_embedding_removes_from_index() {
|
|
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
|
|
let index = BruteForceIndex::new(config);
|
|
let storage = InMemoryBackend::new();
|
|
|
|
insert_embedding(
|
|
EntityId::new(1), "content", &[1.0, 0.0, 0.0], 3, &index, &storage,
|
|
).unwrap();
|
|
|
|
delete_embedding(EntityId::new(1), "content", &index, &storage, false).unwrap();
|
|
|
|
// Should not appear in search results
|
|
let results = index.search(&[1.0, 0.0, 0.0], 10, 0).unwrap();
|
|
assert!(results.is_empty());
|
|
|
|
// Soft delete: entity store still has the embedding
|
|
let key = embedding_store_key(EntityId::new(1), "content");
|
|
assert!(storage.get(&key).unwrap().is_some());
|
|
}
|
|
|
|
#[test]
|
|
fn delete_embedding_hard_removes_from_store() {
|
|
let config = VectorIndexConfig { dimensions: 3, ..VectorIndexConfig::default() };
|
|
let index = BruteForceIndex::new(config);
|
|
let storage = InMemoryBackend::new();
|
|
|
|
insert_embedding(
|
|
EntityId::new(1), "content", &[1.0, 0.0, 0.0], 3, &index, &storage,
|
|
).unwrap();
|
|
|
|
delete_embedding(EntityId::new(1), "content", &index, &storage, true).unwrap();
|
|
|
|
// Entity store should not have the embedding
|
|
let key = embedding_store_key(EntityId::new(1), "content");
|
|
assert!(storage.get(&key).unwrap().is_none());
|
|
}
|
|
|
|
#[test]
|
|
fn registry_register_and_lookup() {
|
|
let mut registry = EmbeddingSlotRegistry::new();
|
|
let config = VectorIndexConfig { dimensions: 1536, ..VectorIndexConfig::default() };
|
|
let state = EmbeddingSlotState {
|
|
index: Box::new(BruteForceIndex::new(config)),
|
|
dimensions: 1536,
|
|
quantization: QuantizationLevel::F16,
|
|
source: EmbeddingSource::External,
|
|
params: HnswParams::default(),
|
|
};
|
|
|
|
registry.register(EntityKind::Item, "content".into(), state).unwrap();
|
|
|
|
let slot = registry.get(EntityKind::Item, "content");
|
|
assert!(slot.is_some());
|
|
assert_eq!(slot.unwrap().dimensions, 1536);
|
|
assert_eq!(slot.unwrap().source, EmbeddingSource::External);
|
|
}
|
|
|
|
#[test]
|
|
fn registry_duplicate_slot_fails() {
|
|
let mut registry = EmbeddingSlotRegistry::new();
|
|
let config = VectorIndexConfig { dimensions: 1536, ..VectorIndexConfig::default() };
|
|
|
|
let state1 = EmbeddingSlotState {
|
|
index: Box::new(BruteForceIndex::new(config.clone())),
|
|
dimensions: 1536,
|
|
quantization: QuantizationLevel::F16,
|
|
source: EmbeddingSource::External,
|
|
params: HnswParams::default(),
|
|
};
|
|
let state2 = EmbeddingSlotState {
|
|
index: Box::new(BruteForceIndex::new(config)),
|
|
dimensions: 1536,
|
|
quantization: QuantizationLevel::F16,
|
|
source: EmbeddingSource::External,
|
|
params: HnswParams::default(),
|
|
};
|
|
|
|
registry.register(EntityKind::Item, "content".into(), state1).unwrap();
|
|
let result = registry.register(EntityKind::Item, "content".into(), state2);
|
|
assert!(result.is_err());
|
|
}
|
|
|
|
#[test]
|
|
fn registry_different_entity_kinds_same_name() {
|
|
let mut registry = EmbeddingSlotRegistry::new();
|
|
let config = VectorIndexConfig { dimensions: 1536, ..VectorIndexConfig::default() };
|
|
|
|
let state_item = EmbeddingSlotState {
|
|
index: Box::new(BruteForceIndex::new(config.clone())),
|
|
dimensions: 1536,
|
|
quantization: QuantizationLevel::F16,
|
|
source: EmbeddingSource::External,
|
|
params: HnswParams::default(),
|
|
};
|
|
let state_user = EmbeddingSlotState {
|
|
index: Box::new(BruteForceIndex::new(config)),
|
|
dimensions: 1536,
|
|
quantization: QuantizationLevel::F16,
|
|
source: EmbeddingSource::DatabaseManaged,
|
|
params: HnswParams::default(),
|
|
};
|
|
|
|
registry.register(EntityKind::Item, "content".into(), state_item).unwrap();
|
|
registry.register(EntityKind::User, "content".into(), state_user).unwrap();
|
|
|
|
let item_slot = registry.get(EntityKind::Item, "content").unwrap();
|
|
let user_slot = registry.get(EntityKind::User, "content").unwrap();
|
|
assert_eq!(item_slot.source, EmbeddingSource::External);
|
|
assert_eq!(user_slot.source, EmbeddingSource::DatabaseManaged);
|
|
}
|
|
|
|
#[test]
|
|
fn registry_slots_for_entity_kind() {
|
|
let mut registry = EmbeddingSlotRegistry::new();
|
|
let config = VectorIndexConfig { dimensions: 128, ..VectorIndexConfig::default() };
|
|
|
|
for name in &["content", "visual", "audio"] {
|
|
let state = EmbeddingSlotState {
|
|
index: Box::new(BruteForceIndex::new(config.clone())),
|
|
dimensions: 128,
|
|
quantization: QuantizationLevel::F16,
|
|
source: EmbeddingSource::External,
|
|
params: HnswParams::default(),
|
|
};
|
|
registry.register(EntityKind::Item, (*name).to_string(), state).unwrap();
|
|
}
|
|
|
|
let slots = registry.slots_for(EntityKind::Item);
|
|
assert_eq!(slots.len(), 3);
|
|
assert!(slots.contains(&"content"));
|
|
assert!(slots.contains(&"visual"));
|
|
assert!(slots.contains(&"audio"));
|
|
|
|
// No user slots
|
|
let user_slots = registry.slots_for(EntityKind::User);
|
|
assert!(user_slots.is_empty());
|
|
}
|
|
|
|
#[test]
|
|
fn registry_nonexistent_slot_returns_none() {
|
|
let registry = EmbeddingSlotRegistry::new();
|
|
assert!(registry.get(EntityKind::Item, "content").is_none());
|
|
}
|
|
```
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `l2_normalize()` normalizes vectors to unit length within `1e-5` tolerance
|
|
- [ ] `l2_normalize()` fails with `VectorError::ZeroNormVector` on zero-norm input
|
|
- [ ] `l2_normalize()` is idempotent (normalizing a unit vector returns the same vector)
|
|
- [ ] `serialize_embedding()` / `deserialize_embedding()` roundtrip produces identical vectors
|
|
- [ ] `embedding_store_key()` produces correct key: `[entity_id][NUL][Tag::Meta][EMB:slot_name]`
|
|
- [ ] `insert_embedding()` validates dimensions, normalizes, stores in entity store, inserts into HNSW
|
|
- [ ] `update_embedding()` tombstones old vector, inserts new, updates entity store
|
|
- [ ] `delete_embedding()` with `hard_delete=false` tombstones HNSW only, preserves entity store
|
|
- [ ] `delete_embedding()` with `hard_delete=true` removes from both HNSW and entity store
|
|
- [ ] Entity store always contains the full-precision normalized f32 vector (source of truth)
|
|
- [ ] `EmbeddingSlotRegistry::register()` stores slot state, rejects duplicates
|
|
- [ ] `EmbeddingSlotRegistry::get()` returns the correct slot by `(EntityKind, name)`
|
|
- [ ] `EmbeddingSlotRegistry::slots_for()` lists all slots for an entity kind
|
|
- [ ] Different entity kinds can have same-named slots without collision
|
|
- [ ] All property tests pass: normalize produces unit vectors, normalize is idempotent, serde roundtrip, insert+search roundtrip
|
|
- [ ] No `unsafe` code
|
|
- [ ] `cargo clippy -- -D warnings` passes
|
|
- [ ] All unit and property tests pass
|
|
|
|
## Research References
|
|
|
|
- [docs/research/ann_for_tidaldb.md](../../../research/ann_for_tidaldb.md) -- "Normalize vectors at insertion time and use L2 distance (equivalent to cosine for unit vectors, and more SIMD-friendly)", capacity planning with 2x over-provision
|
|
|
|
## Spec References
|
|
|
|
- [docs/specs/07-vector-retrieval.md](../../../specs/07-vector-retrieval.md) -- Section 1 (design principle: "Embeddings are L2-normalized at insertion. Cosine similarity is computed as L2 distance over unit vectors"), Section 5 (multiple embedding spaces: EmbeddingSlotRegistry, slot configuration per entity type, up to 4 slots), Section 6 (embedding lifecycle: insert path steps 1-6, update path, delete path, batch operations, normalization edge case), Section 11 (BruteForceIndex as correctness verifier)
|
|
- [docs/specs/02-entity-model.md](../../../specs/) -- Embedding slot constraints (up to 4 per entity type), embedding source (External vs DatabaseManaged)
|
|
|
|
## Implementation Notes
|
|
|
|
- `l2_normalize` uses `f32::EPSILON` (~1.19e-7) as the zero-norm threshold. This catches both exact zero vectors and vectors with components so small that normalization would overflow or produce denormalized results.
|
|
- The entity store key uses `Tag::Meta` (not a new tag) because embeddings are entity metadata. The `EMB:` prefix in the suffix distinguishes embedding keys from other metadata keys. This keeps the key encoding scheme from m1p3 intact without adding new Tag variants.
|
|
- `EmbeddingSlotRegistry` is NOT `Send + Sync` by default because `Box<dyn VectorIndex>` behind a `HashMap` requires external synchronization. In production, the registry is owned by `TidalDB` which provides appropriate access control. The registry is constructed once at startup and then used for reads only (except for index persistence operations).
|
|
- Do NOT implement batch insert via rayon in this task. Batch insert is an optimization for initial data load that can be added when the RETRIEVE executor (m2p5) needs it. The sequential insert path is correct and sufficient for M2 acceptance criteria.
|
|
- Do NOT implement the delta journal for incremental persistence. Full `save()` is fast enough at 10K vectors. Delta journal is deferred to M7 per Open Question 4 in OVERVIEW.md.
|
|
- The `save_all()` / `load_all()` methods on the registry coordinate persistence across all embedding slots. The directory structure follows Spec 07, Section 7: `{data_dir}/vector/{entity_kind}_{slot_name}.usearch`.
|