# Corpus Feature Status ## Summary ✅ **Corpus API works** - `/v1/aphoria/patterns` endpoint is functional ✅ **Community sharing configured** - Maxwell has `community.enabled = true` ✅ **Hosted mode configured** - Points to `http://localhost:18180` ✅ **Observations pushed** - 66 observations successfully sent to API ❌ **Corpus still empty** - Observations went to wrong endpoint ## Root Cause The `HostedClient` posts observations to: ``` POST /v1/aphoria/observations ``` But the corpus aggregator reads from: ``` POST /v1/aphoria/community/observations GET /v1/aphoria/patterns ``` **Two different endpoints:** 1. `/v1/aphoria/observations` - Stores observations as assertions (for hosted team mode) 2. `/v1/aphoria/community/observations` - Aggregates into pattern corpus (for community patterns) ## What Happened When we ran: ```bash aphoria scan --persist --sync --config /home/jml/Workspace/maxwell/.aphoria/config.toml /home/jml/Workspace/maxwell ``` **Logs showed:** ``` [INFO] Using persistent mode (with Episteme storage) sync=true hosted=true [INFO] Pushed observations to hosted server accepted=66 deduplicated=0 ``` **Observations were successfully pushed to `/v1/aphoria/observations`** **But corpus reads from `/v1/aphoria/community/observations`** ❌ ## Architecture Gap There are **two separate features** that got conflated: ### 1. Hosted Mode (Team Server) - **Purpose**: Teams run private StemeDB server - **Endpoint**: `/v1/aphoria/observations` - **What it does**: Stores observations for team projects - **Use case**: "Acme Corp runs StemeDB for all their projects" ### 2. Community Corpus (Public Patterns) - **Purpose**: Anonymized pattern aggregation across projects - **Endpoint**: `/v1/aphoria/community/observations` - **What it does**: Aggregates by (subject, predicate, value), counts projects - **Use case**: "Show me what 100+ projects do for TLS settings" ## Current State **Observations Storage:** ```bash curl http://localhost:18180/v1/aphoria/observations # Returns: 66 observations stored as assertions ``` **Corpus (Pattern Aggregates):** ```bash curl http://localhost:18180/v1/aphoria/patterns # Returns: {"patterns": [], "total_matching": 0} ``` ## Solution Options ### Option 1: Fix HostedClient (Proper Fix) Update `HostedClient` to post to community endpoint when `community.enabled = true`: ```rust // In hosted.rs push_observations() let endpoint = if config.community.is_enabled() { "/v1/aphoria/community/observations" } else { "/v1/aphoria/observations" }; let url = format!("{}{}", self.base_url, endpoint); ``` ### Option 2: Add Aggregation Job (Alternative) Create a background job that reads from `/v1/aphoria/observations` and aggregates into patterns: ```rust // Periodically aggregate stored observations into patterns fn aggregate_observations_to_patterns() { // Read observations from assertions store // Group by (subject, predicate, value) // Update pattern_aggregate_store } ``` ### Option 3: Manual Aggregation (Workaround) Directly call the community observations endpoint with anonymized data: ```bash curl -X POST http://localhost:18180/v1/aphoria/community/observations \ -H "Content-Type: application/json" \ -d '{ "observations": [...], "project_hash": "...", "client_version": "0.1.0" }' ``` ## Recommendation **Option 1** is the cleanest. The flow should be: ``` [User enables community.enabled = true in config] ↓ [Runs: aphoria scan --persist --sync] ↓ [HostedClient checks config.community.enabled] ↓ [If true: POST to /v1/aphoria/community/observations] [If false: POST to /v1/aphoria/observations] ↓ [Community endpoint aggregates patterns] ↓ [Dashboard queries GET /v1/aphoria/patterns] ↓ [Shows community consensus] ``` ## Documentation We Created - ✅ **[DOCUMENTATION_INDEX.md](./DOCUMENTATION_INDEX.md)** - Complete guide index - ✅ **Community sharing enabled** in Maxwell config - ✅ **Hosted mode configured** to localhost - ✅ **Scan successfully pushed** 66 observations ## Testing After Fix Once Option 1 is implemented: ```bash # 1. Run scan with community enabled cargo run --bin aphoria -- scan \ --config /home/jml/Workspace/maxwell/.aphoria/config.toml \ --persist --sync \ /home/jml/Workspace/maxwell # 2. Verify corpus populated curl http://localhost:18180/v1/aphoria/patterns | jq '.patterns | length' # Should show: 66 (or aggregated count) # 3. View in dashboard open http://aphoria.local/corpus # Should show Maxwell patterns with project_count=1 ``` ## Files Changed 1. **Maxwell config**: `/home/jml/Workspace/maxwell/.aphoria/config.toml` - Added `[hosted]` section - Enabled `community.enabled = true` 2. **Documentation**: Created comprehensive docs - `DOCUMENTATION_INDEX.md` - `CORPUS_IMPROVEMENTS.md` - `PROJECT_PATH_IMPLEMENTATION.md` ## Next Steps 1. **Fix HostedClient** to use community endpoint when community sharing enabled 2. **Run scan again** - observations will go to correct endpoint 3. **Verify corpus** populated in dashboard 4. **Scan StemeDB itself** to add more patterns to corpus --- **Status**: Observations successfully pushed ✅ **Issue**: Wrong endpoint (architecture gap) ❌ **Fix**: Update HostedClient routing logic **ETA**: ~30 minutes to implement Option 1 **Last Updated**: 2026-02-08