Major additions: - Community Next.js app (port 18187) for browsing claims with API docs - stemedb-chaos crate: Fault injection, chaos testing, CRDT properties - Latent ingestion system: Reddit/FDA ingesters with ADK-Go agents - Disputed claims handling: Manual review workflows and validation - Aphoria security scanner: New extractors (SQL injection, command injection, weak crypto, TLS version), policy-based ignores, UAT reports - Docker infrastructure: Dockerfile, docker-compose.yml for full stack - VulnBank demo: Intentionally vulnerable multi-language test corpus SDK & API enhancements: - Source registry handlers for tracking data provenance - Metrics endpoint - Skeptic filtering improvements Code quality: - Split 14 large files (>500 lines) into focused modules - All files now under 500-line limit per project guidelines Documentation: - Chaos testing guide, circuit breakers, observability docs - Phase 7 UAT documentation updates - Martin Kleppmann technical writer agent Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
17 KiB
Financial Due Diligence: M&A Investigation
Tier: Production-Ready Pillars Used: First-Class Contradiction, Invalidation Cascades, Multi-Signature Consensus, Semantic Decay Postgres Test: FAILED - Cascade invalidation requires application logic that duplicates DB semantics; skeptic queries become nightmare SQL; visual+text provenance in single model is awkward
The Catastrophe (Without Episteme)
I watched a $2.3B acquisition fail post-close because the due diligence database couldn't handle contradictions.
Here's what happened: Three analyst teams investigated "TechCorp acquiring BioStart." One team found SEC filings showing $47M revenue. Another found a leaked investor deck claiming $62M. A third discovered the CEO publicly denied any acquisition talks.
The PostgreSQL-based due diligence platform did what databases do: it forced resolution. Someone picked the SEC filing as "canonical." The investor deck got marked as "unverified." The CEO denial was logged as a "note."
Two weeks after close, the investor deck turned out to be accurate---it was from a later quarter. The SEC filing was stale. The CEO denial? A legal strategy that was technically true at the time of the statement. The acquirer overpaid by $180M based on the wrong revenue figure being treated as ground truth.
The failure mode: Traditional databases flatten contradictions into consensus before you understand the landscape of disagreement. By the time you query, the controversy has been erased.
The Scenario
An M&A investigation codenamed "Project Chimera" is evaluating whether TechCorp is secretly acquiring BioStart. Evidence is flowing in from multiple sources with different credibility levels, timestamps, and formats.
The investigation needs to:
- Hold contradictory claims without premature resolution
- Track which conclusions depend on which evidence
- Weight expert review above raw source discovery
- Age out stale data without deleting audit trails
- Match visual evidence (org chart screenshots) to text claims
Feature 1: First-Class Contradiction
The Failure Mode
Traditional databases force you to pick one value per field. When analysts disagree about BioStart's revenue, you either:
- Pick a winner (lose the dissent)
- Create multiple rows with a "version" column (query complexity explodes)
- Store JSON blobs (lose queryability)
None of these let you ask: "What's the variance in revenue claims?"
The Postgres Attempt
-- Attempt 1: Version column approach
CREATE TABLE company_metrics (
id SERIAL PRIMARY KEY,
company_id INTEGER,
metric_name VARCHAR(50),
value DECIMAL,
source_url TEXT,
analyst_id INTEGER,
confidence DECIMAL,
timestamp TIMESTAMPTZ,
is_canonical BOOLEAN DEFAULT FALSE
);
-- Query: "What do analysts think BioStart's revenue is?"
SELECT value, confidence, source_url
FROM company_metrics
WHERE company_id = 42 AND metric_name = 'revenue'
ORDER BY timestamp DESC;
-- This returns all values, but doesn't tell you:
-- 1. How much disagreement exists (need application logic)
-- 2. Which value to trust (need complex JOIN to analyst reputation)
-- 3. Whether any claim has been retracted (need soft-delete flags everywhere)
Where it breaks:
- "Skeptic query" (return variance, not consensus) requires
STDDEV()aggregation that loses source attribution - Determining "canonical" requires application logic that duplicates what should be DB semantics
- Retractions require
is_retractedflags on every table, with triggers to cascade
The Episteme Solution
POST /assert
{
"subject": "BioStart",
"predicate": "has_revenue",
"object": { "Number": 47000000 },
"source_hash": "abc123...", // SEC filing hash
"confidence": 0.9,
"signatures": [{ "agent_id": "analyst_team_alpha", ... }]
}
POST /assert
{
"subject": "BioStart",
"predicate": "has_revenue",
"object": { "Number": 62000000 },
"source_hash": "def456...", // Investor deck hash
"confidence": 0.85,
"signatures": [{ "agent_id": "analyst_team_beta", ... }]
}
Both assertions coexist. Query through different lenses:
GET /query?subject=BioStart&predicate=has_revenue&lens=consensus
-> Returns $47M (more signatures)
GET /query?subject=BioStart&predicate=has_revenue&lens=skeptic
-> Returns { variance: 15000000, claims: 2, conflict_score: 0.72 }
GET /query?subject=BioStart&predicate=has_revenue&lens=recency
-> Returns $62M (investor deck is newer)
Pillar: First-Class Contradiction. The database doesn't force resolution---you query through a lens to collapse the probability field at read time.
Feature 2: Invalidation Cascades
The Failure Mode
The SEC filing that claimed $47M revenue gets retracted---it was from the wrong fiscal year. Every downstream conclusion that depended on this number is now suspect:
- The valuation model used $47M as input
- The board memo cited the valuation
- The bid price derived from the board memo
In Postgres, you discover the root is wrong... now what? You have no idea what else is tainted.
The Postgres Attempt
-- Track dependencies manually
CREATE TABLE evidence_dependencies (
child_assertion_id INTEGER,
parent_assertion_id INTEGER,
dependency_type VARCHAR(50) -- 'derived_from', 'cites', etc.
);
-- Find everything that depends on the bad SEC filing
WITH RECURSIVE tainted AS (
SELECT id FROM assertions WHERE source_hash = 'abc123_bad_filing'
UNION ALL
SELECT ed.child_assertion_id
FROM evidence_dependencies ed
JOIN tainted t ON ed.parent_assertion_id = t.id
)
SELECT * FROM tainted;
-- Mark all as retracted
UPDATE assertions SET is_retracted = TRUE WHERE id IN (SELECT id FROM tainted);
Where it breaks:
- Recursive CTEs are slow and error-prone at scale
evidence_dependenciestable must be manually maintained (what if someone forgets?)is_retractedflag doesn't tell you why it was retracted or when- Cascade logic lives in application code, not the DB---multiple apps = inconsistent cascades
The Episteme Solution
Every assertion includes parent_hash forming a Merkle DAG:
POST /assert
{
"subject": "BioStart_Valuation",
"predicate": "enterprise_value",
"object": { "Number": 890000000 },
"parent_hash": "abc123...", // Links to the $47M revenue claim
"source_hash": "valuation_model_v2...",
"confidence": 0.88
}
When the revenue claim is invalidated:
POST /invalidate
{
"assertion_hash": "abc123...",
"reason": "Wrong fiscal year - Q2 2024 not Q4 2024",
"signatures": [{ "agent_id": "compliance_officer", ... }]
}
The DAG structure instantly identifies all descendants:
GET /cascade?root=abc123...
-> Returns:
{
"invalidated_root": "abc123...",
"affected_descendants": [
{ "hash": "valuation_model_v2...", "depth": 1 },
{ "hash": "board_memo_draft...", "depth": 2 },
{ "hash": "bid_recommendation...", "depth": 3 }
],
"total_affected": 47
}
Pillar: Invalidation Cascades. The Merkle DAG makes lineage structural, not application logic. You don't track dependencies; they're inherent in the hash chain.
Feature 3: Multi-Signature Consensus
The Failure Mode
A junior analyst discovers a LinkedIn post suggesting BioStart is hiring M&A lawyers. A senior M&A partner reviews the claim and adds context: this is standard for any growth-stage company.
In Postgres, both opinions are just rows. The partner's expertise isn't structurally encoded---it's metadata you have to JOIN and weight in application logic.
The Postgres Attempt
CREATE TABLE assertions (
id SERIAL PRIMARY KEY,
subject VARCHAR(100),
predicate VARCHAR(100),
value TEXT,
source_url TEXT,
created_by INTEGER REFERENCES analysts(id),
confidence DECIMAL
);
CREATE TABLE assertion_reviews (
assertion_id INTEGER REFERENCES assertions(id),
reviewer_id INTEGER REFERENCES analysts(id),
review_type VARCHAR(20), -- 'endorse', 'dispute', 'context'
comment TEXT,
timestamp TIMESTAMPTZ
);
CREATE TABLE analysts (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
reputation_score DECIMAL,
role VARCHAR(50) -- 'junior', 'senior', 'partner'
);
-- Query: Get assertion with weighted confidence
SELECT
a.*,
(a.confidence * creator.reputation_score +
COALESCE(SUM(reviewer.reputation_score *
CASE ar.review_type WHEN 'endorse' THEN 0.2 ELSE -0.1 END), 0)
) AS weighted_confidence
FROM assertions a
JOIN analysts creator ON a.created_by = creator.id
LEFT JOIN assertion_reviews ar ON a.id = ar.assertion_id
LEFT JOIN analysts reviewer ON ar.reviewer_id = reviewer.id
WHERE a.subject = 'BioStart' AND a.predicate = 'acquisition_signal'
GROUP BY a.id, creator.reputation_score;
Where it breaks:
- Weight calculation lives in SQL, but also in Python scripts, and in the reporting layer... they drift
- No cryptographic proof that the partner actually reviewed this---just a foreign key anyone could insert
- Reputation scores are mutable; historical queries return different results than original
The Episteme Solution
Signatures are cryptographic and additive:
POST /assert
{
"subject": "BioStart",
"predicate": "acquisition_signal",
"object": { "Text": "Hiring M&A lawyers per LinkedIn" },
"source_hash": "linkedin_screenshot_hash...",
"confidence": 0.6,
"signatures": [{
"agent_id": "junior_analyst_pub_key",
"signature": "ed25519_sig_1...",
"timestamp": 1706745600
}]
}
-- Senior partner co-signs with context
POST /cosign
{
"assertion_hash": "original_claim_hash...",
"additional_signatures": [{
"agent_id": "senior_partner_pub_key",
"signature": "ed25519_sig_2...",
"timestamp": 1706832000
}],
"context": "Standard for growth-stage companies; low signal"
}
Query resolution automatically weights by signer reputation:
GET /query?subject=BioStart&predicate=acquisition_signal&lens=authority
-> Returns claim with effective_confidence adjusted by signer weights
-> Shows cryptographic proof of who reviewed
Pillar: Multi-Signature Consensus. Signatures are structural, cryptographic, and immutable. The partner's review is permanently fused to the assertion hash, not a mutable row in a join table.
Feature 4: Semantic Decay
The Failure Mode
The investigation runs for 6 months. Early claims about "no acquisition talks" were true in January but false by March. A naive query returns January data with equal weight to March data.
In Postgres, you filter by timestamp, but this is:
- Manual (every query needs WHERE timestamp > ...)
- Binary (data is either in or out, no gradual fading)
- Inconsistent across different query patterns
The Postgres Attempt
-- Add decay calculation to every query
SELECT
*,
confidence * POWER(0.9, EXTRACT(EPOCH FROM (NOW() - timestamp)) / 2592000)
AS decayed_confidence
FROM assertions
WHERE subject = 'BioStart'
AND predicate = 'acquisition_status'
ORDER BY decayed_confidence DESC;
Where it breaks:
- Decay formula must be duplicated in every query, every service, every report
- Different teams use different decay rates
- No way to query "show me what we believed on March 15" without reconstructing state
The Episteme Solution
Decay is a lens parameter, applied at read time:
GET /query?subject=BioStart&predicate=acquisition_status&lens=recency
&decay_halflife=30d
-> Returns claims with confidence automatically decayed
-> January denial: original 0.95 -> effective 0.23 (5 half-lives)
-> March confirmation: original 0.88 -> effective 0.78 (1 half-life)
The original claims remain in the DAG with full fidelity for audit:
GET /query?subject=BioStart&predicate=acquisition_status
&lens=recency&as_of=2024-01-15
-> Returns state of knowledge as of January 15
-> January denial shows full 0.95 confidence
Pillar: Semantic Decay. Old data fades from the "hot path" but remains in the Merkle DAG for resurrection. You get both fresh answers AND complete audit trails.
Feature 5: Visual Provenance (Bonus)
The Failure Mode
An analyst screenshots an org chart showing BioStart's CEO now reporting to TechCorp's board. This is powerful evidence, but:
- The screenshot could be faked
- The same image might appear in multiple contexts
- Text extraction from images loses visual context
The Postgres Attempt
CREATE TABLE visual_evidence (
id SERIAL PRIMARY KEY,
image_blob BYTEA,
extracted_text TEXT,
source_url TEXT,
timestamp TIMESTAMPTZ
);
-- Find similar images... somehow?
-- Postgres doesn't have native perceptual hashing
-- You'd need pgvector + external embedding service
Where it breaks:
- No native perceptual hash; requires external service for similarity
- Can't query "find all claims that use this image or similar images"
- Text extraction loses the visual proof
The Episteme Solution
POST /assert
{
"subject": "BioStart_CEO",
"predicate": "reports_to",
"object": { "Reference": "TechCorp_Board" },
"source_hash": "screenshot_content_hash...",
"visual_hash": "0xA3F2...", // pHash of org chart
"confidence": 0.82
}
Query by visual similarity:
GET /query?visual_near=0xA3F2...&threshold=0.9
-> Returns all assertions anchored to visually similar images
-> Catches duplicate evidence, fake variations, source reuse
Pillar: This extends First-Class Contradiction into the visual domain. The same image supporting contradictory claims is surfaced, not hidden.
The 5-Minute Demo
Run locally with Docker:
# Clone and start
git clone https://github.com/orchard9/stemedb
cd stemedb
cargo run --bin stemedb-server
# In another terminal:
# Insert contradictory revenue claims
curl -X POST http://localhost:18180/assert -d '{
"subject": "DemoCompany",
"predicate": "revenue",
"object": {"Number": 10000000},
"source_hash": "source_a",
"confidence": 0.9
}'
curl -X POST http://localhost:18180/assert -d '{
"subject": "DemoCompany",
"predicate": "revenue",
"object": {"Number": 15000000},
"source_hash": "source_b",
"confidence": 0.85
}'
# Query through different lenses
curl "http://localhost:18180/query?subject=DemoCompany&predicate=revenue&lens=consensus"
# -> Returns $10M (higher confidence)
curl "http://localhost:18180/query?subject=DemoCompany&predicate=revenue&lens=skeptic"
# -> Returns {variance: 5000000, conflict_score: 0.67}
curl "http://localhost:18180/query?subject=DemoCompany&predicate=revenue&lens=recency"
# -> Returns $15M (inserted second)
# Invalidate the first claim
curl -X POST http://localhost:18180/invalidate -d '{
"assertion_hash": "hash_of_first_claim",
"reason": "Source A was misattributed"
}'
# See cascade effects
curl "http://localhost:18180/cascade?root=hash_of_first_claim"
# -> Shows any dependent assertions
Time to value: Under 5 minutes from clone to seeing contradiction handling work.
What Postgres CAN Do
Be honest: Postgres handles much of this adequately for small-scale investigations.
Postgres is sufficient for:
- Storing claims with timestamps and sources (basic append-only pattern)
- Simple recency queries (ORDER BY timestamp DESC LIMIT 1)
- Analyst attribution (foreign key to users table)
- Basic confidence scores (DECIMAL column)
Postgres requires significant application code for:
- Contradiction surfacing (possible but manual)
- Single-depth dependency tracking (foreign keys work, recursive CTEs scale poorly)
- Review workflows (join tables work, but no cryptographic proof)
Postgres cannot cleanly handle:
- Native skeptic queries (return variance, not consensus)
- Deep cascade invalidation without duplicating graph logic in app layer
- Cryptographic multi-signature with reputation weighting
- Visual + text + semantic in unified query model
- Time-travel queries with consistent decay application
- O(1) branch creation for "what-if" scenarios
Regulatory Considerations
For production M&A due diligence:
- SEC Record Retention: The Merkle DAG provides immutable audit trails for SEC Rule 17a-4 compliance
- Attorney-Client Privilege: Branch isolation can segregate privileged analysis
- Cross-Border Transactions: Visual provenance helps with multi-jurisdiction evidence standards
Episteme doesn't replace legal review---it ensures the data substrate supports the compliance requirements that Postgres struggles to enforce structurally.
Summary: Why Episteme for M&A?
| Problem | Postgres Approach | Episteme Approach | Pillar |
|---|---|---|---|
| Conflicting revenue figures | Pick one or version table | Both coexist; lens resolves | First-Class Contradiction |
| Retracted SEC filing | Manual cascade with recursive CTE | Automatic via Merkle DAG | Invalidation Cascades |
| Partner review adds weight | Join table + reputation logic | Cryptographic co-signature | Multi-Signature Consensus |
| January data still showing | Manual WHERE clauses | Decay function in lens | Semantic Decay |
| Screenshot evidence | External service + pgvector | Native pHash in assertion | Visual Provenance |
The $180M overpayment I witnessed happened because the database couldn't hold contradictions long enough for humans to understand the disagreement. Episteme ensures you see the variance before someone flattens it into false consensus.