tidaldb/applications/iknowyou/devsetup.md

# iknowyou — Dev Setup

## Infrastructure

### GPU Server

| | |
|---|---|
| **Host** | `msd5685.mjhst.com` |
| **SSH** | `ssh ubuntu@msd5685.mjhst.com` |
| **GPU** | NVIDIA RTX 6000 Ada Generation (48 GB VRAM) |
| **RAM** | 94 GB |
| **CPUs** | 20 |
| **Disk** | 243 GB (172 GB free) |
| **OS** | Ubuntu 22.04, kernel 5.15.0-161 |
| **CUDA** | 13.0 (nvcc 13.0.88) |
| **Driver** | 535.288.01 |
| **Public IP** | 208.122.213.81 |

### vLLM + Qwen3-8B

**Model:** `Qwen/Qwen3-8B` (BF16, ~15.3 GB on GPU)

**API:** OpenAI-compatible at `http://msd5685.mjhst.com:8000/v1`

**Service:** systemd unit `vllm.service` — starts on boot, restarts on failure.

```
# Check status
ssh ubuntu@msd5685.mjhst.com "sudo systemctl status vllm"

# View logs
ssh ubuntu@msd5685.mjhst.com "sudo journalctl -u vllm -f"

# Restart
ssh ubuntu@msd5685.mjhst.com "sudo systemctl restart vllm"
```

**Config:** `/etc/systemd/system/vllm.service`

```ini
[Service]
ExecStart=/home/ubuntu/vllm-env/bin/vllm serve Qwen/Qwen3-8B \
  --host 0.0.0.0 \
  --port 8000 \
  --reasoning-parser qwen3 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85
```

**Python env:** `/home/ubuntu/vllm-env` (Python 3.10, vLLM 0.15.1)

## Using the API

### Chat completion

```bash
curl http://msd5685.mjhst.com:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello"}
    ],
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 512
  }'
```

### Thinking mode

Qwen3 supports a `/think` and `/no_think` toggle in the user message, or via `chat_template_kwargs`:

```bash
# Thinking enabled (default — model reasons in <think> blocks before answering)
curl http://msd5685.mjhst.com:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "What is 23 * 47?"}],
    "temperature": 0.6,
    "top_p": 0.95
  }'

# Thinking disabled (faster, no reasoning trace)
curl http://msd5685.mjhst.com:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "What is 23 * 47?"}],
    "temperature": 0.7,
    "top_p": 0.8,
    "chat_template_kwargs": {"enable_thinking": false}
  }'
```

**Recommended sampling:**
- Thinking mode: `temperature=0.6, top_p=0.95, top_k=20`
- Non-thinking mode: `temperature=0.7, top_p=0.8, top_k=20`

### Structured output (for Observer)

```bash
curl http://msd5685.mjhst.com:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "Extract sentiment from: I love this idea!"}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "sentiment",
        "schema": {
          "type": "object",
          "properties": {
            "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
            "confidence": {"type": "number"}
          },
          "required": ["sentiment", "confidence"]
        }
      }
    },
    "chat_template_kwargs": {"enable_thinking": false}
  }'
```

### Streaming

```bash
curl http://msd5685.mjhst.com:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "Tell me a short story."}],
    "stream": true,
    "temperature": 0.7
  }'
```

### Check model status

```bash
curl http://msd5685.mjhst.com:8000/v1/models
curl http://msd5685.mjhst.com:8000/health
```

## NVIDIA Driver Notes

The server had a driver version mismatch (kernel module 535.274 vs userspace 535.288) on first setup. Fixed by:

```bash
# Unload old modules
sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia
# Reload with new version
sudo modprobe nvidia && sudo modprobe nvidia_uvm
```

After a reboot, the DKMS-built 535.288 module loads automatically. If `nvidia-smi` ever shows "Driver/library version mismatch" again, either reboot or run the rmmod/modprobe sequence above.

## Topology

```
Local machine (macOS)
  │
  │  SSH tunnel or direct HTTP
  │
  ▼
msd5685.mjhst.com (Ubuntu 22.04)
  │
  ├── vLLM (systemd, port 8000)
  │     └── Qwen/Qwen3-8B (BF16, 48GB RTX 6000 Ada)
  │
  └── [future] iknowyou server (port TBD)
        └── embedded tidalDB
```

For local development, use an SSH tunnel to reach the API:

```bash
ssh -L 8000:localhost:8000 ubuntu@msd5685.mjhst.com
# Then: curl http://localhost:8000/v1/models
```

Or hit it directly at `http://msd5685.mjhst.com:8000` (port must be open in firewall).