jordan f4cfd6c81f feat: complete M8 replication primitives + forage enhancements + docs

Milestone 8 (phases 1-4):
- Shard-aware WAL segment naming, BatchHeader v2, ShardRouter
- Transport trait, InProcessTransport, WalShipper, FollowerDb
- HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine
- Session replication bridge with SeqNo/HWM, idempotency store

Forage application:
- Multi-source discovery engine with MAB exploration
- Embedding-based label system, server handlers, UI refresh

Other:
- QUICKSTART.md, README.md, milestone-8 planning docs
- Hard negative union semantics, RLHF export enhancements
- Recovery benchmark and visibility test expansions
- Split 8 oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-24 13:17:19 -07:00

4.7 KiB

Raw Blame History

iknowyou — Dev Setup

Infrastructure

GPU Server


Host	`msd5685.mjhst.com`
SSH	`ssh ubuntu@msd5685.mjhst.com`
GPU	NVIDIA RTX 6000 Ada Generation (48 GB VRAM)
RAM	94 GB
CPUs	20
Disk	243 GB (172 GB free)
OS	Ubuntu 22.04, kernel 5.15.0-161
CUDA	13.0 (nvcc 13.0.88)
Driver	535.288.01
Public IP	208.122.213.81

vLLM + Qwen3-8B

Model: Qwen/Qwen3-8B (BF16, ~15.3 GB on GPU)

API: OpenAI-compatible at http://msd5685.mjhst.com:8000/v1

Service: systemd unit vllm.service — starts on boot, restarts on failure.

# Check status
ssh ubuntu@msd5685.mjhst.com "sudo systemctl status vllm"

# View logs
ssh ubuntu@msd5685.mjhst.com "sudo journalctl -u vllm -f"

# Restart
ssh ubuntu@msd5685.mjhst.com "sudo systemctl restart vllm"

Config: /etc/systemd/system/vllm.service

[Service]
ExecStart=/home/ubuntu/vllm-env/bin/vllm serve Qwen/Qwen3-8B \
  --host 0.0.0.0 \
  --port 8000 \
  --reasoning-parser qwen3 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85

Python env: /home/ubuntu/vllm-env (Python 3.10, vLLM 0.15.1)

Using the API

Chat completion

curl http://msd5685.mjhst.com:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello"}
    ],
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 512
  }'

Thinking mode

Qwen3 supports a /think and /no_think toggle in the user message, or via chat_template_kwargs:

# Thinking enabled (default — model reasons in <think> blocks before answering)
curl http://msd5685.mjhst.com:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "What is 23 * 47?"}],
    "temperature": 0.6,
    "top_p": 0.95
  }'

# Thinking disabled (faster, no reasoning trace)
curl http://msd5685.mjhst.com:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "What is 23 * 47?"}],
    "temperature": 0.7,
    "top_p": 0.8,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Recommended sampling:

Thinking mode: temperature=0.6, top_p=0.95, top_k=20
Non-thinking mode: temperature=0.7, top_p=0.8, top_k=20

Structured output (for Observer)

curl http://msd5685.mjhst.com:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "Extract sentiment from: I love this idea!"}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "sentiment",
        "schema": {
          "type": "object",
          "properties": {
            "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
            "confidence": {"type": "number"}
          },
          "required": ["sentiment", "confidence"]
        }
      }
    },
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Streaming

curl http://msd5685.mjhst.com:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "Tell me a short story."}],
    "stream": true,
    "temperature": 0.7
  }'

Check model status

curl http://msd5685.mjhst.com:8000/v1/models
curl http://msd5685.mjhst.com:8000/health

NVIDIA Driver Notes

The server had a driver version mismatch (kernel module 535.274 vs userspace 535.288) on first setup. Fixed by:

# Unload old modules
sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia
# Reload with new version
sudo modprobe nvidia && sudo modprobe nvidia_uvm

After a reboot, the DKMS-built 535.288 module loads automatically. If nvidia-smi ever shows "Driver/library version mismatch" again, either reboot or run the rmmod/modprobe sequence above.

Topology

Local machine (macOS)
  │
  │  SSH tunnel or direct HTTP
  │
  ▼
msd5685.mjhst.com (Ubuntu 22.04)
  │
  ├── vLLM (systemd, port 8000)
  │     └── Qwen/Qwen3-8B (BF16, 48GB RTX 6000 Ada)
  │
  └── [future] iknowyou server (port TBD)
        └── embedded tidalDB

For local development, use an SSH tunnel to reach the API:

ssh -L 8000:localhost:8000 ubuntu@msd5685.mjhst.com
# Then: curl http://localhost:8000/v1/models

Or hit it directly at http://msd5685.mjhst.com:8000 (port must be open in firewall).

4.7 KiB Raw Blame History