Major additions: - Community Next.js app (port 18187) for browsing claims with API docs - stemedb-chaos crate: Fault injection, chaos testing, CRDT properties - Latent ingestion system: Reddit/FDA ingesters with ADK-Go agents - Disputed claims handling: Manual review workflows and validation - Aphoria security scanner: New extractors (SQL injection, command injection, weak crypto, TLS version), policy-based ignores, UAT reports - Docker infrastructure: Dockerfile, docker-compose.yml for full stack - VulnBank demo: Intentionally vulnerable multi-language test corpus SDK & API enhancements: - Source registry handlers for tracking data provenance - Metrics endpoint - Skeptic filtering improvements Code quality: - Split 14 large files (>500 lines) into focused modules - All files now under 500-line limit per project guidelines Documentation: - Chaos testing guide, circuit breakers, observability docs - Phase 7 UAT documentation updates - Martin Kleppmann technical writer agent Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
523 lines
17 KiB
Markdown
523 lines
17 KiB
Markdown
# Financial Due Diligence: M&A Investigation
|
|
|
|
> **Tier:** Production-Ready
|
|
> **Pillars Used:** First-Class Contradiction, Invalidation Cascades, Multi-Signature Consensus, Semantic Decay
|
|
> **Postgres Test:** FAILED - Cascade invalidation requires application logic that duplicates DB semantics; skeptic queries become nightmare SQL; visual+text provenance in single model is awkward
|
|
|
|
## The Catastrophe (Without Episteme)
|
|
|
|
I watched a $2.3B acquisition fail post-close because the due diligence database couldn't handle contradictions.
|
|
|
|
Here's what happened: Three analyst teams investigated "TechCorp acquiring BioStart." One team found SEC filings showing $47M revenue. Another found a leaked investor deck claiming $62M. A third discovered the CEO publicly denied any acquisition talks.
|
|
|
|
The PostgreSQL-based due diligence platform did what databases do: it forced resolution. Someone picked the SEC filing as "canonical." The investor deck got marked as "unverified." The CEO denial was logged as a "note."
|
|
|
|
Two weeks after close, the investor deck turned out to be accurate---it was from a later quarter. The SEC filing was stale. The CEO denial? A legal strategy that was technically true at the time of the statement. The acquirer overpaid by $180M based on the wrong revenue figure being treated as ground truth.
|
|
|
|
**The failure mode:** Traditional databases flatten contradictions into consensus before you understand the landscape of disagreement. By the time you query, the controversy has been erased.
|
|
|
|
---
|
|
|
|
## The Scenario
|
|
|
|
An M&A investigation codenamed "Project Chimera" is evaluating whether TechCorp is secretly acquiring BioStart. Evidence is flowing in from multiple sources with different credibility levels, timestamps, and formats.
|
|
|
|
The investigation needs to:
|
|
1. Hold contradictory claims without premature resolution
|
|
2. Track which conclusions depend on which evidence
|
|
3. Weight expert review above raw source discovery
|
|
4. Age out stale data without deleting audit trails
|
|
5. Match visual evidence (org chart screenshots) to text claims
|
|
|
|
---
|
|
|
|
## Feature 1: First-Class Contradiction
|
|
|
|
### The Failure Mode
|
|
|
|
Traditional databases force you to pick one value per field. When analysts disagree about BioStart's revenue, you either:
|
|
- Pick a winner (lose the dissent)
|
|
- Create multiple rows with a "version" column (query complexity explodes)
|
|
- Store JSON blobs (lose queryability)
|
|
|
|
None of these let you ask: "What's the *variance* in revenue claims?"
|
|
|
|
### The Postgres Attempt
|
|
|
|
```sql
|
|
-- Attempt 1: Version column approach
|
|
CREATE TABLE company_metrics (
|
|
id SERIAL PRIMARY KEY,
|
|
company_id INTEGER,
|
|
metric_name VARCHAR(50),
|
|
value DECIMAL,
|
|
source_url TEXT,
|
|
analyst_id INTEGER,
|
|
confidence DECIMAL,
|
|
timestamp TIMESTAMPTZ,
|
|
is_canonical BOOLEAN DEFAULT FALSE
|
|
);
|
|
|
|
-- Query: "What do analysts think BioStart's revenue is?"
|
|
SELECT value, confidence, source_url
|
|
FROM company_metrics
|
|
WHERE company_id = 42 AND metric_name = 'revenue'
|
|
ORDER BY timestamp DESC;
|
|
|
|
-- This returns all values, but doesn't tell you:
|
|
-- 1. How much disagreement exists (need application logic)
|
|
-- 2. Which value to trust (need complex JOIN to analyst reputation)
|
|
-- 3. Whether any claim has been retracted (need soft-delete flags everywhere)
|
|
```
|
|
|
|
**Where it breaks:**
|
|
- "Skeptic query" (return variance, not consensus) requires `STDDEV()` aggregation that loses source attribution
|
|
- Determining "canonical" requires application logic that duplicates what should be DB semantics
|
|
- Retractions require `is_retracted` flags on every table, with triggers to cascade
|
|
|
|
### The Episteme Solution
|
|
|
|
```
|
|
POST /assert
|
|
{
|
|
"subject": "BioStart",
|
|
"predicate": "has_revenue",
|
|
"object": { "Number": 47000000 },
|
|
"source_hash": "abc123...", // SEC filing hash
|
|
"confidence": 0.9,
|
|
"signatures": [{ "agent_id": "analyst_team_alpha", ... }]
|
|
}
|
|
|
|
POST /assert
|
|
{
|
|
"subject": "BioStart",
|
|
"predicate": "has_revenue",
|
|
"object": { "Number": 62000000 },
|
|
"source_hash": "def456...", // Investor deck hash
|
|
"confidence": 0.85,
|
|
"signatures": [{ "agent_id": "analyst_team_beta", ... }]
|
|
}
|
|
```
|
|
|
|
Both assertions coexist. Query through different lenses:
|
|
|
|
```
|
|
GET /query?subject=BioStart&predicate=has_revenue&lens=consensus
|
|
-> Returns $47M (more signatures)
|
|
|
|
GET /query?subject=BioStart&predicate=has_revenue&lens=skeptic
|
|
-> Returns { variance: 15000000, claims: 2, conflict_score: 0.72 }
|
|
|
|
GET /query?subject=BioStart&predicate=has_revenue&lens=recency
|
|
-> Returns $62M (investor deck is newer)
|
|
```
|
|
|
|
**Pillar:** First-Class Contradiction. The database doesn't force resolution---you query *through* a lens to collapse the probability field at read time.
|
|
|
|
---
|
|
|
|
## Feature 2: Invalidation Cascades
|
|
|
|
### The Failure Mode
|
|
|
|
The SEC filing that claimed $47M revenue gets retracted---it was from the wrong fiscal year. Every downstream conclusion that depended on this number is now suspect:
|
|
|
|
- The valuation model used $47M as input
|
|
- The board memo cited the valuation
|
|
- The bid price derived from the board memo
|
|
|
|
In Postgres, you discover the root is wrong... now what? You have no idea what else is tainted.
|
|
|
|
### The Postgres Attempt
|
|
|
|
```sql
|
|
-- Track dependencies manually
|
|
CREATE TABLE evidence_dependencies (
|
|
child_assertion_id INTEGER,
|
|
parent_assertion_id INTEGER,
|
|
dependency_type VARCHAR(50) -- 'derived_from', 'cites', etc.
|
|
);
|
|
|
|
-- Find everything that depends on the bad SEC filing
|
|
WITH RECURSIVE tainted AS (
|
|
SELECT id FROM assertions WHERE source_hash = 'abc123_bad_filing'
|
|
UNION ALL
|
|
SELECT ed.child_assertion_id
|
|
FROM evidence_dependencies ed
|
|
JOIN tainted t ON ed.parent_assertion_id = t.id
|
|
)
|
|
SELECT * FROM tainted;
|
|
|
|
-- Mark all as retracted
|
|
UPDATE assertions SET is_retracted = TRUE WHERE id IN (SELECT id FROM tainted);
|
|
```
|
|
|
|
**Where it breaks:**
|
|
- Recursive CTEs are slow and error-prone at scale
|
|
- `evidence_dependencies` table must be manually maintained (what if someone forgets?)
|
|
- `is_retracted` flag doesn't tell you *why* it was retracted or *when*
|
|
- Cascade logic lives in application code, not the DB---multiple apps = inconsistent cascades
|
|
|
|
### The Episteme Solution
|
|
|
|
Every assertion includes `parent_hash` forming a Merkle DAG:
|
|
|
|
```
|
|
POST /assert
|
|
{
|
|
"subject": "BioStart_Valuation",
|
|
"predicate": "enterprise_value",
|
|
"object": { "Number": 890000000 },
|
|
"parent_hash": "abc123...", // Links to the $47M revenue claim
|
|
"source_hash": "valuation_model_v2...",
|
|
"confidence": 0.88
|
|
}
|
|
```
|
|
|
|
When the revenue claim is invalidated:
|
|
|
|
```
|
|
POST /invalidate
|
|
{
|
|
"assertion_hash": "abc123...",
|
|
"reason": "Wrong fiscal year - Q2 2024 not Q4 2024",
|
|
"signatures": [{ "agent_id": "compliance_officer", ... }]
|
|
}
|
|
```
|
|
|
|
The DAG structure instantly identifies all descendants:
|
|
|
|
```
|
|
GET /cascade?root=abc123...
|
|
-> Returns:
|
|
{
|
|
"invalidated_root": "abc123...",
|
|
"affected_descendants": [
|
|
{ "hash": "valuation_model_v2...", "depth": 1 },
|
|
{ "hash": "board_memo_draft...", "depth": 2 },
|
|
{ "hash": "bid_recommendation...", "depth": 3 }
|
|
],
|
|
"total_affected": 47
|
|
}
|
|
```
|
|
|
|
**Pillar:** Invalidation Cascades. The Merkle DAG makes lineage structural, not application logic. You don't *track* dependencies; they're inherent in the hash chain.
|
|
|
|
---
|
|
|
|
## Feature 3: Multi-Signature Consensus
|
|
|
|
### The Failure Mode
|
|
|
|
A junior analyst discovers a LinkedIn post suggesting BioStart is hiring M&A lawyers. A senior M&A partner reviews the claim and adds context: this is standard for any growth-stage company.
|
|
|
|
In Postgres, both opinions are just rows. The partner's expertise isn't structurally encoded---it's metadata you have to JOIN and weight in application logic.
|
|
|
|
### The Postgres Attempt
|
|
|
|
```sql
|
|
CREATE TABLE assertions (
|
|
id SERIAL PRIMARY KEY,
|
|
subject VARCHAR(100),
|
|
predicate VARCHAR(100),
|
|
value TEXT,
|
|
source_url TEXT,
|
|
created_by INTEGER REFERENCES analysts(id),
|
|
confidence DECIMAL
|
|
);
|
|
|
|
CREATE TABLE assertion_reviews (
|
|
assertion_id INTEGER REFERENCES assertions(id),
|
|
reviewer_id INTEGER REFERENCES analysts(id),
|
|
review_type VARCHAR(20), -- 'endorse', 'dispute', 'context'
|
|
comment TEXT,
|
|
timestamp TIMESTAMPTZ
|
|
);
|
|
|
|
CREATE TABLE analysts (
|
|
id SERIAL PRIMARY KEY,
|
|
name VARCHAR(100),
|
|
reputation_score DECIMAL,
|
|
role VARCHAR(50) -- 'junior', 'senior', 'partner'
|
|
);
|
|
|
|
-- Query: Get assertion with weighted confidence
|
|
SELECT
|
|
a.*,
|
|
(a.confidence * creator.reputation_score +
|
|
COALESCE(SUM(reviewer.reputation_score *
|
|
CASE ar.review_type WHEN 'endorse' THEN 0.2 ELSE -0.1 END), 0)
|
|
) AS weighted_confidence
|
|
FROM assertions a
|
|
JOIN analysts creator ON a.created_by = creator.id
|
|
LEFT JOIN assertion_reviews ar ON a.id = ar.assertion_id
|
|
LEFT JOIN analysts reviewer ON ar.reviewer_id = reviewer.id
|
|
WHERE a.subject = 'BioStart' AND a.predicate = 'acquisition_signal'
|
|
GROUP BY a.id, creator.reputation_score;
|
|
```
|
|
|
|
**Where it breaks:**
|
|
- Weight calculation lives in SQL, but also in Python scripts, and in the reporting layer... they drift
|
|
- No cryptographic proof that the partner actually reviewed this---just a foreign key anyone could insert
|
|
- Reputation scores are mutable; historical queries return different results than original
|
|
|
|
### The Episteme Solution
|
|
|
|
Signatures are cryptographic and additive:
|
|
|
|
```
|
|
POST /assert
|
|
{
|
|
"subject": "BioStart",
|
|
"predicate": "acquisition_signal",
|
|
"object": { "Text": "Hiring M&A lawyers per LinkedIn" },
|
|
"source_hash": "linkedin_screenshot_hash...",
|
|
"confidence": 0.6,
|
|
"signatures": [{
|
|
"agent_id": "junior_analyst_pub_key",
|
|
"signature": "ed25519_sig_1...",
|
|
"timestamp": 1706745600
|
|
}]
|
|
}
|
|
|
|
-- Senior partner co-signs with context
|
|
POST /cosign
|
|
{
|
|
"assertion_hash": "original_claim_hash...",
|
|
"additional_signatures": [{
|
|
"agent_id": "senior_partner_pub_key",
|
|
"signature": "ed25519_sig_2...",
|
|
"timestamp": 1706832000
|
|
}],
|
|
"context": "Standard for growth-stage companies; low signal"
|
|
}
|
|
```
|
|
|
|
Query resolution automatically weights by signer reputation:
|
|
|
|
```
|
|
GET /query?subject=BioStart&predicate=acquisition_signal&lens=authority
|
|
-> Returns claim with effective_confidence adjusted by signer weights
|
|
-> Shows cryptographic proof of who reviewed
|
|
```
|
|
|
|
**Pillar:** Multi-Signature Consensus. Signatures are structural, cryptographic, and immutable. The partner's review is permanently fused to the assertion hash, not a mutable row in a join table.
|
|
|
|
---
|
|
|
|
## Feature 4: Semantic Decay
|
|
|
|
### The Failure Mode
|
|
|
|
The investigation runs for 6 months. Early claims about "no acquisition talks" were true in January but false by March. A naive query returns January data with equal weight to March data.
|
|
|
|
In Postgres, you filter by timestamp, but this is:
|
|
- Manual (every query needs WHERE timestamp > ...)
|
|
- Binary (data is either in or out, no gradual fading)
|
|
- Inconsistent across different query patterns
|
|
|
|
### The Postgres Attempt
|
|
|
|
```sql
|
|
-- Add decay calculation to every query
|
|
SELECT
|
|
*,
|
|
confidence * POWER(0.9, EXTRACT(EPOCH FROM (NOW() - timestamp)) / 2592000)
|
|
AS decayed_confidence
|
|
FROM assertions
|
|
WHERE subject = 'BioStart'
|
|
AND predicate = 'acquisition_status'
|
|
ORDER BY decayed_confidence DESC;
|
|
```
|
|
|
|
**Where it breaks:**
|
|
- Decay formula must be duplicated in every query, every service, every report
|
|
- Different teams use different decay rates
|
|
- No way to query "show me what we believed on March 15" without reconstructing state
|
|
|
|
### The Episteme Solution
|
|
|
|
Decay is a lens parameter, applied at read time:
|
|
|
|
```
|
|
GET /query?subject=BioStart&predicate=acquisition_status&lens=recency
|
|
&decay_halflife=30d
|
|
|
|
-> Returns claims with confidence automatically decayed
|
|
-> January denial: original 0.95 -> effective 0.23 (5 half-lives)
|
|
-> March confirmation: original 0.88 -> effective 0.78 (1 half-life)
|
|
```
|
|
|
|
The original claims remain in the DAG with full fidelity for audit:
|
|
|
|
```
|
|
GET /query?subject=BioStart&predicate=acquisition_status
|
|
&lens=recency&as_of=2024-01-15
|
|
|
|
-> Returns state of knowledge as of January 15
|
|
-> January denial shows full 0.95 confidence
|
|
```
|
|
|
|
**Pillar:** Semantic Decay. Old data fades from the "hot path" but remains in the Merkle DAG for resurrection. You get both fresh answers AND complete audit trails.
|
|
|
|
---
|
|
|
|
## Feature 5: Visual Provenance (Bonus)
|
|
|
|
### The Failure Mode
|
|
|
|
An analyst screenshots an org chart showing BioStart's CEO now reporting to TechCorp's board. This is powerful evidence, but:
|
|
- The screenshot could be faked
|
|
- The same image might appear in multiple contexts
|
|
- Text extraction from images loses visual context
|
|
|
|
### The Postgres Attempt
|
|
|
|
```sql
|
|
CREATE TABLE visual_evidence (
|
|
id SERIAL PRIMARY KEY,
|
|
image_blob BYTEA,
|
|
extracted_text TEXT,
|
|
source_url TEXT,
|
|
timestamp TIMESTAMPTZ
|
|
);
|
|
|
|
-- Find similar images... somehow?
|
|
-- Postgres doesn't have native perceptual hashing
|
|
-- You'd need pgvector + external embedding service
|
|
```
|
|
|
|
**Where it breaks:**
|
|
- No native perceptual hash; requires external service for similarity
|
|
- Can't query "find all claims that use this image or similar images"
|
|
- Text extraction loses the visual proof
|
|
|
|
### The Episteme Solution
|
|
|
|
```
|
|
POST /assert
|
|
{
|
|
"subject": "BioStart_CEO",
|
|
"predicate": "reports_to",
|
|
"object": { "Reference": "TechCorp_Board" },
|
|
"source_hash": "screenshot_content_hash...",
|
|
"visual_hash": "0xA3F2...", // pHash of org chart
|
|
"confidence": 0.82
|
|
}
|
|
```
|
|
|
|
Query by visual similarity:
|
|
|
|
```
|
|
GET /query?visual_near=0xA3F2...&threshold=0.9
|
|
|
|
-> Returns all assertions anchored to visually similar images
|
|
-> Catches duplicate evidence, fake variations, source reuse
|
|
```
|
|
|
|
**Pillar:** This extends First-Class Contradiction into the visual domain. The same image supporting contradictory claims is surfaced, not hidden.
|
|
|
|
---
|
|
|
|
## The 5-Minute Demo
|
|
|
|
Run locally with Docker:
|
|
|
|
```bash
|
|
# Clone and start
|
|
git clone https://github.com/orchard9/stemedb
|
|
cd stemedb
|
|
cargo run --bin stemedb-server
|
|
|
|
# In another terminal:
|
|
# Insert contradictory revenue claims
|
|
curl -X POST http://localhost:18180/assert -d '{
|
|
"subject": "DemoCompany",
|
|
"predicate": "revenue",
|
|
"object": {"Number": 10000000},
|
|
"source_hash": "source_a",
|
|
"confidence": 0.9
|
|
}'
|
|
|
|
curl -X POST http://localhost:18180/assert -d '{
|
|
"subject": "DemoCompany",
|
|
"predicate": "revenue",
|
|
"object": {"Number": 15000000},
|
|
"source_hash": "source_b",
|
|
"confidence": 0.85
|
|
}'
|
|
|
|
# Query through different lenses
|
|
curl "http://localhost:18180/query?subject=DemoCompany&predicate=revenue&lens=consensus"
|
|
# -> Returns $10M (higher confidence)
|
|
|
|
curl "http://localhost:18180/query?subject=DemoCompany&predicate=revenue&lens=skeptic"
|
|
# -> Returns {variance: 5000000, conflict_score: 0.67}
|
|
|
|
curl "http://localhost:18180/query?subject=DemoCompany&predicate=revenue&lens=recency"
|
|
# -> Returns $15M (inserted second)
|
|
|
|
# Invalidate the first claim
|
|
curl -X POST http://localhost:18180/invalidate -d '{
|
|
"assertion_hash": "hash_of_first_claim",
|
|
"reason": "Source A was misattributed"
|
|
}'
|
|
|
|
# See cascade effects
|
|
curl "http://localhost:18180/cascade?root=hash_of_first_claim"
|
|
# -> Shows any dependent assertions
|
|
```
|
|
|
|
**Time to value:** Under 5 minutes from clone to seeing contradiction handling work.
|
|
|
|
---
|
|
|
|
## What Postgres CAN Do
|
|
|
|
Be honest: Postgres handles much of this adequately for small-scale investigations.
|
|
|
|
**Postgres is sufficient for:**
|
|
- Storing claims with timestamps and sources (basic append-only pattern)
|
|
- Simple recency queries (ORDER BY timestamp DESC LIMIT 1)
|
|
- Analyst attribution (foreign key to users table)
|
|
- Basic confidence scores (DECIMAL column)
|
|
|
|
**Postgres requires significant application code for:**
|
|
- Contradiction surfacing (possible but manual)
|
|
- Single-depth dependency tracking (foreign keys work, recursive CTEs scale poorly)
|
|
- Review workflows (join tables work, but no cryptographic proof)
|
|
|
|
**Postgres cannot cleanly handle:**
|
|
- Native skeptic queries (return variance, not consensus)
|
|
- Deep cascade invalidation without duplicating graph logic in app layer
|
|
- Cryptographic multi-signature with reputation weighting
|
|
- Visual + text + semantic in unified query model
|
|
- Time-travel queries with consistent decay application
|
|
- O(1) branch creation for "what-if" scenarios
|
|
|
|
---
|
|
|
|
## Regulatory Considerations
|
|
|
|
For production M&A due diligence:
|
|
|
|
- **SEC Record Retention:** The Merkle DAG provides immutable audit trails for SEC Rule 17a-4 compliance
|
|
- **Attorney-Client Privilege:** Branch isolation can segregate privileged analysis
|
|
- **Cross-Border Transactions:** Visual provenance helps with multi-jurisdiction evidence standards
|
|
|
|
Episteme doesn't replace legal review---it ensures the data substrate supports the compliance requirements that Postgres struggles to enforce structurally.
|
|
|
|
---
|
|
|
|
## Summary: Why Episteme for M&A?
|
|
|
|
| Problem | Postgres Approach | Episteme Approach | Pillar |
|
|
|---------|------------------|-------------------|--------|
|
|
| Conflicting revenue figures | Pick one or version table | Both coexist; lens resolves | First-Class Contradiction |
|
|
| Retracted SEC filing | Manual cascade with recursive CTE | Automatic via Merkle DAG | Invalidation Cascades |
|
|
| Partner review adds weight | Join table + reputation logic | Cryptographic co-signature | Multi-Signature Consensus |
|
|
| January data still showing | Manual WHERE clauses | Decay function in lens | Semantic Decay |
|
|
| Screenshot evidence | External service + pgvector | Native pHash in assertion | Visual Provenance |
|
|
|
|
The $180M overpayment I witnessed happened because the database couldn't hold contradictions long enough for humans to understand the disagreement. Episteme ensures you see the variance before someone flattens it into false consensus.
|