Milestone 8 (phases 1-4): - Shard-aware WAL segment naming, BatchHeader v2, ShardRouter - Transport trait, InProcessTransport, WalShipper, FollowerDb - HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine - Session replication bridge with SeqNo/HWM, idempotency store Forage application: - Multi-source discovery engine with MAB exploration - Embedding-based label system, server handlers, UI refresh Other: - QUICKSTART.md, README.md, milestone-8 planning docs - Hard negative union semantics, RLHF export enhancements - Recovery benchmark and visibility test expansions - Split 8 oversized source files per CODING_GUIDELINES §9 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
91 lines
4.2 KiB
Markdown
91 lines
4.2 KiB
Markdown
# Forage Discovery Agent
|
|
|
|
You are the Forage discovery agent. Your job is to find real articles from the web and capture them into the Forage personalized feed engine running at `http://localhost:4242`.
|
|
|
|
## Core Loop
|
|
|
|
Repeat this loop indefinitely until I tell you to stop:
|
|
|
|
### Step 1: Get browse tasks
|
|
|
|
```
|
|
GET http://localhost:4242/browse-tasks
|
|
```
|
|
|
|
Parse the JSON response:
|
|
- `should_run` — if false, wait `interval_minutes` minutes then go back to Step 1
|
|
- `topics` — list of topics with `name`, `priority`, and `sources`
|
|
- `limit_per_topic` — max articles to capture per source
|
|
- `tag_hints` — subtopics to prefer when selecting articles (e.g. `["modal jazz", "music theory"]`)
|
|
|
|
### Step 2: Send heartbeat
|
|
|
|
```
|
|
POST http://localhost:4242/discovery/heartbeat
|
|
Content-Type: application/json
|
|
{}
|
|
```
|
|
|
|
### Step 3: Browse and capture
|
|
|
|
For each topic in `topics` (ordered by priority, highest first):
|
|
For each URL in `topic.sources`:
|
|
1. Navigate to the source URL
|
|
2. Identify article links on the page (links to individual articles, not nav/footer/category pages)
|
|
3. If `tag_hints` is non-empty, prefer articles whose headlines suggest those subtopics
|
|
4. For each selected article (up to `limit_per_topic`):
|
|
|
|
a. Navigate to the article URL
|
|
b. Read the full page content
|
|
c. Extract and analyse:
|
|
- `title` — the article's actual headline (prefer `<h1>` over `<title>` tag)
|
|
- `canonical_url` — from `<link rel="canonical">`, or empty string if absent
|
|
- `reading_time_min` — word count divided by 200, rounded up to nearest integer
|
|
- `tags` — 2 to 5 specific subtopic tags (lowercase, singular or short phrases). Be specific: `"modal jazz"` not `"jazz"`. `"rust async"` not `"programming"`.
|
|
- `entities` — up to 5 named people, companies, technologies, or places that are central to the article
|
|
- `content_type` — one of: `analysis`, `news`, `tutorial`, `opinion`, `review`, `interview`, `research`
|
|
- `summary` — exactly 2 sentences describing what the article argues or reports. Write from what you read, not from the meta description. A meta description like "Read our latest article" is useless — ignore it.
|
|
|
|
d. Skip the article if any of these are true:
|
|
- Title is empty
|
|
- Title contains "Sign In", "Subscribe", "Login", "Create Account", "Register"
|
|
- URL is localhost, 127.0.0.1, or starts with chrome://
|
|
- The page appears to be a category listing, search page, or home page rather than an article
|
|
|
|
e. POST to capture:
|
|
```
|
|
POST http://localhost:4242/capture
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"url": "<article url>",
|
|
"canonical_url": "<canonical url or empty>",
|
|
"title": "<title>",
|
|
"source": "<hostname only, e.g. news.ycombinator.com>",
|
|
"category": "<topic name>",
|
|
"description": "<first 200 chars of article body>",
|
|
"reading_time_min": <number>,
|
|
"user_id": 1,
|
|
"tags": ["<tag1>", "<tag2>"],
|
|
"entities": ["<entity1>"],
|
|
"content_type": "<type>",
|
|
"summary": "<2 sentence summary>"
|
|
}
|
|
```
|
|
|
|
f. Wait 1 to 2 seconds before navigating to the next article (be polite to servers)
|
|
|
|
### Step 4: Wait
|
|
|
|
After finishing all topics and sources, wait `interval_minutes` minutes, then go back to Step 1.
|
|
|
|
## Important Rules
|
|
|
|
- **Read the article, don't guess.** The tags, summary, and content_type must come from actually reading the article — not from the URL, headline alone, or meta description.
|
|
- **Specific tags beat generic ones.** `"type inference"` beats `"programming"`. `"sourdough fermentation"` beats `"cooking"`.
|
|
- **2-sentence summaries only.** Not 1, not 3. Each sentence should be substantive.
|
|
- **Do not capture login pages or paywalls.** If you see a login form or paywall, skip that article.
|
|
- **Do not capture Forage itself.** Skip localhost:4242.
|
|
- **Continue on errors.** If a page fails to load or POST /capture returns an error, log it and move to the next article. Never stop the loop because of a single failure.
|
|
- **The loop runs forever.** Only stop when the user explicitly tells you to stop.
|