LoRA/Adapters for Persistent Intelligence¶

Purpose¶

Clarify the role of LoRA/adapter fine-tunes in the Persistent Intelligence MVP and beyond. Define where adapters add the most value, the minimal data needed, how to evaluate impact, and practical deployment paths for local vs cloud LLMs.

Background: What are LoRA/Adapters?¶

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) method. Instead of updating all model weights, we insert small low-rank matrices (adapters) into specific layers (e.g., attention projections) and train only those. At inference, the base model + adapter weights are combined to produce adapted behavior with minimal additional memory/compute.

Core idea: ΔW ≈ A · B where A∈R^{d×r}, B∈R^{r×k}, rank r≪min(d,k)
We train A and B while keeping original W frozen
Typical target modules: q_proj, k_proj, v_proj, o_proj, sometimes ffn/down/up

Key terms: - PEFT: Family of methods like LoRA, Prefix-Tuning, P-Tuning, Adapters - Rank (r): Adapter capacity. Higher r = more capacity, more VRAM - Alpha: Scaling factor for the adapter contribution - Target Modules: Which layers get adapters (attention vs MLP) - Merged vs Dynamic: Merge adapters into base weights offline vs load adapters dynamically at runtime

Why this matters here: We can steer specific agent behaviors (how memories are written/queried) without retraining the whole model, keeping deployment lightweight on local hardware.

When to Use LoRA vs Alternatives¶

Use LoRA when you need the model to: - Consistently emit strict schemas (JSON) for memory writes - Reformulate queries and synthesize retrieved context reliably - Reduce prompt dependence for specialized tasks you repeat often

Prefer Prompting/RAG alone when: - Behavior is already acceptable with few-shot prompting - You need maximum flexibility across many tasks without model variants

Prefer Full Finetuning when: - You control ample compute and need broad behavior changes across many tasks (out of scope for our hardware/time budget)

Prefer Small Tooling Changes when: - Issues are primarily in retrieval quality (embedding model choice, index hygiene, thresholds) rather than LLM behavior

Rule of thumb for ShadowHound: - Start with RAG + strong prompting templates - If memory JSON validity < 98% or retrieval success@k plateaus → add LoRA

Pros and Cons (quick)¶

Pros: - Lightweight to train and serve (few additional MB) - Task-specific improvements without altering base model - Reversible and swappable per request (if server supports)

Cons: - More artifacts to manage (which adapter for which task) - Runtime support varies (dynamic adapters not always available) - Risk of overfitting to narrow data if not curated well

Where LoRA Helps Most¶

Memory Writing Adapter: produce compact, schema-consistent memory records from observations (↓ tokens, ↑ retrievability)
Memory Recall Adapter: reformulate queries and synthesize retrieved memories into grounded answers (↑ hit rate, ↑ factuality)
Skills/Schema Grounding Adapter: reduce tool-call and JSON schema drift (↓ execution errors)

When to Apply¶

After baseline memory scaffolding (backends, persistence, spatial tags, RAG thresholds) is stable
Start with Memory Writing, then Recall; defer Skills/Schema if tool-call errors become a bottleneck

Decision Guide (ShadowHound)¶

1) Backend/Serving - Cloud (OpenAI/Anthropic): No custom adapters → use prompts/few-shot - Local (vLLM/other): Adapters OK → check dynamic LoRA support; else prepare merged variants

2) Pain Point - Messy memory JSON or long/wordy entries → Memory Writing adapter - Weak recall (missed context, poor synthesis) → Recall adapter - Tool-call schema deviations → Skills/Schema adapter (optional)

3) Resources - GPU VRAM ≥ 12–16 GB recommended for 7B with adapters - If constrained, choose smaller base (Mistral 7B over 13B+) and keep r small (8–16)

4) Deployment - Dynamic adapter loading available? Use per-request adapter routing - Else: serve merged-weight variants per adapter as distinct model IDs

Data & Curation¶

Use internal logs and synthetic scenes
Memory Writing pairs: (observation+pose+mission_ctx) → normalized JSON memory
Recall pairs: (question + top-k memory set) → grounded answer with citations
Quality gates: JSON validity, retrieval success@k, grounded accuracy, token length per memory

Minimal Data Format¶

JSONL rows for supervised finetuning
Memory Writing example: {"input": "Saw a red ball on table at (x=1.2,y=3.5)", "output": {"event":"object_seen","text":"Red ball on table.","pose":{"x":1.2,"y":3.5,"yaw":1.57},"tags":["object:red_ball","room:living_room"],"mission_id":"m-2025-10-15-1","timestamp":1697370000.1}}
Recall example (with retrieved docs snippet): {"input": {"question":"Where did we see the red ball?","docs":["Red ball on table. pose=(1.2,3.5)"]}, "output": "You saw the red ball on the table near x=1.2, y=3.5."}

Training (Local Models)¶

Framework: PEFT (LoRA) with HF Transformers
Base models: Mistral-7B, Llama 3.x, Qwen 2.x (choose by GPU budget)
Hyperparams (starting): r=8–16, alpha=16–32, dropout=0.05, lr=1e-4, batch=64 eff., epochs=1–3
Eval set: held-out missions with gold memories and QA

Minimal Training Recipe (pseudo)¶

1) Prepare JSONL train/val 2) Tokenize input/output with chat template or instruct format 3) Freeze base weights, enable LoRA on target modules (attention proj) 4) Train 1–3 epochs with early stopping by eval loss/metrics 5) Save adapter weights (.safetensors) and card metadata

Deployment Patterns¶

vLLM with adapter support (if available): route per-request to adapter ("memory-writing", "recall")
If not supported: serve merged-weights variants as separate endpoints (model-id-suffix)
Cloud models: no custom adapters → use prompts/few-shot; adapters apply only to local models

Inference Routing¶

Adapter selection by task:
memory-writing: apply writing adapter
recall: apply recall adapter
none: base model
If dynamic loading unsupported: call specific model endpoint (e.g., mistral7b-memory-writer)

Integration Points¶

Agent: add adapter selection by task (writing vs recall)
MemoryManager: enforce schema when adapter absent (validator), log JSON validity
Config: MEMORY_BACKEND (openai|local|skip), ADAPTER_MODE (none|writing|recall)

Metrics¶

JSON validity rate (%) for written memories
Retrieval success@k and similarity
RAG token budget usage and cap compliance
Answer groundedness (manual rubric or weak-supervision checks)

Risks & Mitigations¶

Adapter serving limitations: use merged models if dynamic LoRA not supported
Data leakage: isolate mission collections; anonymize where needed
Overfitting to synthetic: blend real and synthetic, use paraphrasing

Phased Plan¶

1) Baseline (no LoRA): implement MemoryManager, backends, persistence, spatial tags, tests 2) Memory Writing Adapter (local): train/eval, deploy; measure JSON validity↑, tokens↓ 3) Recall Adapter (local): train/eval; measure success@k↑, groundedness↑ 4) Optional: Skills/Schema Adapter; prioritize only if tool-call errors persist

Try Next¶

Define the normalized memory schema and validator
Collect 200 synthetic memory-writing pairs
Stand up a small local embedding (bge-small/e5-small) and run baseline metrics
Prototype adapter training on a 1k-sample subset; compare metrics vs baseline

System Architecture (High-Level)¶

Context in ShadowHound Stack¶

Application (launch/config) → Agent (DIMOS) → Skills → Robot
Persistent Intelligence spans Agent + Memory layer; LoRA enhances Agent behavior
Memory backends (local/cloud) are pluggable, unaffected by LoRA choice

Serving Topologies¶

1) Cloud-first - LLM: Cloud - Embeddings: Cloud or Local - LoRA: Not applicable (use prompting) - Pros: simplicity, low ops; Cons: latency, cost, no adapters

2) Local-first with Dynamic Adapters - LLM: Local server supports per-request adapter loading - Embeddings: Local - LoRA: Apply per task (writing vs recall) - Pros: flexible, single base model; Cons: requires server support

3) Local-first with Merged Variants - LLM: Multiple served variants (base, writer, recall) - Embeddings: Local - Pros: simple serving; Cons: more endpoints to manage

Agent Integration (Conceptual)¶

1) Plan → Observe → Write Memory - Agent routes “write” prompts through writing adapter (if enabled) 2) Query → Retrieve → Answer - Agent routes “recall” prompts through recall adapter (if enabled) 3) Tool Calling - Optional schema-grounding adapter to reduce drift; validators remain primary guardrail

Data & Observability¶

Memory records carry mission_id, timestamp, pose, tags
Metrics: JSON validity, retrieval success@k, token usage, adapter usage rates
Traces: request → retrieved docs → final answer (with citations)

Lifecycle & Safety¶

Modes: cloud | local | skip-memory
Fallbacks: if adapter unavailable, use base model + strict prompting
Retention: time-based purge + quotas; mission-scoped collections

Risks (Architectural)¶

Adapter sprawl (too many variants) → use dynamic adapters or strict naming/versioning
Inconsistent behavior across backends → define capability matrix and fallbacks
Data governance → ensure anonymization where needed, clear retention policy

FAQ¶

Q: Do we need LoRA if we have RAG? A: RAG retrieves facts; LoRA improves how the model writes/reads those facts. They’re complementary. Start with RAG; add LoRA when behavior needs to be consistent and compact.

Q: Will adapters help tool-calling reliability? A: A schema-grounding adapter can reduce deviations, but start with strict prompts and validators. Use an adapter only if drift remains a top error source.

Q: Which layers to target? A: Start with attention projections (q_proj, v_proj, o_proj). Add MLP only if gains stall.

Q: How big should rank r be? A: Start with r=8–16 for 7B models. Increase gradually if underfitting; watch VRAM.

Q: Can we use LoRA with cloud models? A: Generally no—use prompting for cloud models. LoRA applies to local models you serve.

Glossary¶

PEFT: Parameter-Efficient Fine-Tuning
LoRA: Low-Rank Adaptation
Adapter: Trainable module injected into a frozen base model
Dynamic Adapters: Load/unload adapters at runtime without merging
RAG: Retrieval-Augmented Generation (memory retrieval → prompt)