# iknowyou — Dev Setup ## Infrastructure ### Local Personalization Engine (tidalDB-backed) Run the personalization engine server locally (default bind: `127.0.0.1:7777`): ```bash cargo run -p iknowyou-engine --bin server --features synap-aux ``` Environment variables: - `IKY_ENGINE_BIND` (default `127.0.0.1:7777`) - `IKY_ENGINE_DATA_DIR` (default temp dir `iknowyou_engine_data`) - `IKY_ENGINE_URL` (used by Next.js API route; default `http://127.0.0.1:7777`) - `SYNAP_URL` / `SYNAP_API_KEY` (optional; enables auxiliary memory writes only) Health check: ```bash curl http://127.0.0.1:7777/healthz ``` The `app/api/chat/route.ts` path now writes observer-driven personalization feedback to this service (`/v1/feedback`, `/v1/sessions/*`) while Synap remains optional auxiliary memory. ### GPU Server | | | |---|---| | **Host** | `msd5685.mjhst.com` | | **SSH** | `ssh ubuntu@msd5685.mjhst.com` | | **GPU** | NVIDIA RTX 6000 Ada Generation (48 GB VRAM) | | **RAM** | 94 GB | | **CPUs** | 20 | | **Disk** | 243 GB (172 GB free) | | **OS** | Ubuntu 22.04, kernel 5.15.0-161 | | **CUDA** | 13.0 (nvcc 13.0.88) | | **Driver** | 535.288.01 | | **Public IP** | 208.122.213.81 | ### vLLM + Qwen3-8B **Model:** `Qwen/Qwen3-8B` (BF16, ~15.3 GB on GPU) **API:** OpenAI-compatible at `http://msd5685.mjhst.com:8000/v1` **Service:** systemd unit `vllm.service` — starts on boot, restarts on failure. ``` # Check status ssh ubuntu@msd5685.mjhst.com "sudo systemctl status vllm" # View logs ssh ubuntu@msd5685.mjhst.com "sudo journalctl -u vllm -f" # Restart ssh ubuntu@msd5685.mjhst.com "sudo systemctl restart vllm" ``` **Config:** `/etc/systemd/system/vllm.service` ```ini [Service] ExecStart=/home/ubuntu/vllm-env/bin/vllm serve Qwen/Qwen3-8B \ --host 0.0.0.0 \ --port 8000 \ --reasoning-parser qwen3 \ --max-model-len 32768 \ --gpu-memory-utilization 0.85 ``` **Python env:** `/home/ubuntu/vllm-env` (Python 3.10, vLLM 0.15.1) ## Using the API ### Chat completion ```bash curl http://msd5685.mjhst.com:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-8B", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello"} ], "temperature": 0.7, "top_p": 0.8, "max_tokens": 512 }' ``` ### Thinking mode Qwen3 supports a `/think` and `/no_think` toggle in the user message, or via `chat_template_kwargs`: ```bash # Thinking enabled (default — model reasons in blocks before answering) curl http://msd5685.mjhst.com:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-8B", "messages": [{"role": "user", "content": "What is 23 * 47?"}], "temperature": 0.6, "top_p": 0.95 }' # Thinking disabled (faster, no reasoning trace) curl http://msd5685.mjhst.com:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-8B", "messages": [{"role": "user", "content": "What is 23 * 47?"}], "temperature": 0.7, "top_p": 0.8, "chat_template_kwargs": {"enable_thinking": false} }' ``` **Recommended sampling:** - Thinking mode: `temperature=0.6, top_p=0.95, top_k=20` - Non-thinking mode: `temperature=0.7, top_p=0.8, top_k=20` ### Structured output (for Observer) ```bash curl http://msd5685.mjhst.com:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-8B", "messages": [{"role": "user", "content": "Extract sentiment from: I love this idea!"}], "response_format": { "type": "json_schema", "json_schema": { "name": "sentiment", "schema": { "type": "object", "properties": { "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}, "confidence": {"type": "number"} }, "required": ["sentiment", "confidence"] } } }, "chat_template_kwargs": {"enable_thinking": false} }' ``` ### Streaming ```bash curl http://msd5685.mjhst.com:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-8B", "messages": [{"role": "user", "content": "Tell me a short story."}], "stream": true, "temperature": 0.7 }' ``` ### Check model status ```bash curl http://msd5685.mjhst.com:8000/v1/models curl http://msd5685.mjhst.com:8000/health ``` ## NVIDIA Driver Notes The server had a driver version mismatch (kernel module 535.274 vs userspace 535.288) on first setup. Fixed by: ```bash # Unload old modules sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia # Reload with new version sudo modprobe nvidia && sudo modprobe nvidia_uvm ``` After a reboot, the DKMS-built 535.288 module loads automatically. If `nvidia-smi` ever shows "Driver/library version mismatch" again, either reboot or run the rmmod/modprobe sequence above. ## Topology ``` Local machine (macOS) │ │ SSH tunnel or direct HTTP │ ▼ msd5685.mjhst.com (Ubuntu 22.04) │ ├── vLLM (systemd, port 8000) │ └── Qwen/Qwen3-8B (BF16, 48GB RTX 6000 Ada) │ └── [future] iknowyou server (port TBD) └── embedded tidalDB ``` For local development, use an SSH tunnel to reach the API: ```bash ssh -L 8000:localhost:8000 ubuntu@msd5685.mjhst.com # Then: curl http://localhost:8000/v1/models ``` Or hit it directly at `http://msd5685.mjhst.com:8000` (port must be open in firewall).