Technical Brief
Not a chatbot, not a RAG app.
AKLUS is a memory-first system. The LLM is the muscle; the memory layer is the brain. Every conversation deepens a structured model of the user — their goals, patterns, and reflections — across years, not turns.
Most AI assistants have no memory between sessions. AKLUS is the opposite: the memory layer is the product. The LLM is just the generation surface — smart, but stateless by itself. AKLUS makes it stateful.
The system is designed around three principles:
- Memory compounds. Each conversation adds structured knowledge about the user. Episodic events abstract into semantic truths over time. Patterns surface. Reflections form.
- It works on our hardware. Local-first by design. The RTX 5090 runs LLM, TTS, and embeddings inference. No cloud dependency for the core intelligence loop.
- It thinks on its own overnight. A nightly background job reviews the day's episodes, generates reflections, consolidates near-duplicate memories, and surfaces a weekly insight digest. The user wakes up to a system that has done its own thinking.
The result is an AI companion that knows you — not from your current session, but from the accumulated record of your conversations, goals, and patterns over time. This is not a chatbot with context window tricks. It's a real memory system.
What it is not
- Not a RAG app (there is no static document corpus to query)
- Not a chatbot (it has persistent state that compounds)
- Not an agent (it doesn't act autonomously on your behalf — not yet)
- Not a journal app (it extracts structure from natural conversation)
- Not a therapy tool (it avoids clinically-toned or presumptuous statements by design)
The differentiator
Most memory-augmented systems bolt memory onto an LLM as a feature. AKLUS is architected memory-first: the schema, the lifecycle, the retrieval logic, the reflection loop — these are the product. The LLM is swappable. The memory layer is not.
Local-first by default. Existing hardware in-house: Hostinger VPS + a workstation with RTX 5090 (16 GB VRAM) and 64 GB RAM. The GPU box runs local LLM, TTS, and embeddings inference. Cloud APIs are kept as a small fallback budget for quality-critical paths.
16 GB VRAM · 64 GB RAM
Full layer breakdown
| Layer | Local-first (primary) | Cloud alternative | Trade-off |
|---|---|---|---|
| Backend | Python | (same) | No diff. |
| Frontend | React + Next.js | (same) | No diff. |
| LLM | Qwen 2.5 14B (4-bit quant) or Llama 3.1 8B via Ollama / vLLM on the RTX 5090 | Claude Sonnet 4.5 (reflections), GPT-4o-mini (orchestration) | Local: zero per-token cost + full data privacy. 16 GB VRAM caps us at ~14B quantized; weaker than Claude on nuanced reflection (~70–85% of quality). |
| Structured memory DB | PostgreSQL on the Hostinger VPS | RDS / Supabase | Local: free, but manual backups and WAL config. Already in-house. |
| Vector search | pgvector on the same Postgres | Pinecone / Qdrant Cloud | No separate service. Sufficient up to millions of vectors. |
| Orchestration | LangGraph (OSS library) | (same) | Library runs anywhere. No diff. |
| Embeddings | bge-large-en-v1.5 or nomic-embed-text via Sentence Transformers on the GPU box | Voyage AI (voyage-4) | Local: free. ~5–10% lower retrieval quality on MTEB benchmarks; uses ~2 GB VRAM. |
| Memory framework | Custom build on Postgres + pgvector. Self-host mem0 OSS or Letta OSS as starting template. | mem0.ai cloud (~€17/mo), letta.com cloud (~€18/mo) | Memory IS the product — we don't want to outsource it. Both OSS versions run on the VPS. |
| Background jobs | Celery + Redis on the VPS | (same) | Already local. |
| Observability | Langfuse self-hosted on the VPS | LangSmith ($39/seat/mo) | Langfuse OSS is feature-complete. Saves ~€72/mo; eats ~30 min/week of upkeep. |
| Voice (TTS) | Coqui XTTS v2 or F5-TTS on the RTX 5090 | ElevenLabs Pro (€91/mo) | Local: ~94% of ElevenLabs quality on naturalness; weaker on emotional consistency at scale. |
| Avatar | Custom Unreal Engine + MetaHuman + Audio2Face. | ElevenLabs LiveAvatar / Hedra (~€0.09–€0.90 / min) | Three delivery paths — see callout below. |
| Hosting | Hostinger VPS + RTX 5090 workstation. Cloudflare Tunnel for routing. | AWS Fargate + RDS + ElastiCache (~€118/mo) | Local: hardware already paid. Single-region, single-box, manual ops. |
| Auth | Authentik or Supertokens self-hosted | Clerk / Auth0 (~€23/mo) | Saves €23/mo; adds Docker stack to maintain. |
| Object storage | MinIO self-hosted on the VPS | S3 + CloudFront (~€14/mo) | Saves €14/mo; lose CDN edge speed for voice/avatar assets. |
| Error tracking | GlitchTip self-hosted (Sentry-compatible) | Sentry cloud (~€24/mo) | Saves €24/mo; adds Docker containers to maintain. |
| Transactional email | (cannot go local cleanly) | AWS SES or Postmark (~€2–€14/mo) | Self-hosting SMTP is a deliverability nightmare. Use a provider. |
| DNS + CDN | Cloudflare (free) | (same) | Free tier is fine. |
| Path | Who renders | Customer needs | AKLUS cost | Trade-off |
|---|---|---|---|---|
| A. Customer-GPU Unreal | Customer's machine | Dedicated GPU (RTX 3060+ / M1 Pro+), 16 GB RAM | €0 / min | Best quality + zero ongoing cost, but cuts off low-spec users. |
| B. Cloud-GPU Unreal (streamed) | Our cloud GPU, streamed as video | Just internet + video decode | ~€0.01 / min per session | Accessible to anyone, but scales linearly with concurrent users. ~€450/mo at 50 users × 30 min/day. |
| C. ElevenLabs LiveAvatar / HeyGen | Their pooled GPUs | Just internet + video decode | ~€0.09/min (LiveAvatar) to €0.90/min (HeyGen) | No infra needed, locked to preset avatars. ~€4,000/mo at same usage on LiveAvatar. |
Recommended for premium-persona MVP: Path A (customer-GPU Unreal). Target users have the hardware, per-session cost stays at zero. Keep Path C wired in as fallback.
- ElevenLabs-tier voice consistency — local TTS hits ~94% on naturalness but loses on long-form emotional consistency. If voice is core to the product, keep a small ElevenLabs plan as fallback.
Five memory types, separated by lifecycle, retrieval shape, and update rules. This is the schema that makes the system actually know someone over time.
| Type | Purpose | Example |
|---|---|---|
| Episodic | Specific events and conversations | "User felt burned out after client meeting." |
| Semantic | Abstracted truths from many episodes | "User dislikes micromanagement." |
| Procedural | How the user works | "User performs best in short focused bursts." |
| Goal | Long-term objectives | "Reach passive income through apps." |
| Reflection | AI-generated observations | "User overthinks before publishing." |
Conversation turn — the intelligence loop
Every message runs through a LangGraph graph. The system retrieves relevant memories before generating, then extracts new memories from the response and writes them back. Memory compounds with every turn.
Background jobs (nightly)
- Reflection generation — reviews recent episodic memories, detects patterns, writes reflection memories
- Memory consolidation — promotes repeated episodic signals into semantic memories
- Deduplication — merges near-duplicate memories by cosine threshold
- Embedding generation for any un-embedded entries
- Weekly insight summary — a short digest surfaced to the user
Retrieval strategy: hybrid, 6 signals
Every retrieval blends six weighted signals. The composite score determines which memories surface for the current turn. Everything is logged so the system is debuggable.
created_at. Recent memories score higher unless overridden by importance.A clear-eyed view of what the current state of LLMs and memory systems can actually deliver vs what remains research-grade.
Possible now (MVP)
- Persistent memory across sessions
- Long-term user profiles
- Reflection generation
- Goal tracking
- Pattern detection
- Weekly insights
- Strategic questioning
- Cross-session continuity
- Adaptive tone
- Context-aware coaching
Hard / research-grade
- Deep emotional understanding
- Robust forgetting
- Hallucination-free psychology
- Lifelong identity modeling
- Truly autonomous reasoning
Benchmarks still show weakness in long-horizon memory, memory updates, stale memory removal, and temporal reasoning. We build around these limits, not through them.
In the MVP vs deferred to v2
| In the MVP (8 weeks) | Deferred to v2 |
|---|---|
|
|
AI Systems Engineer / Architect
For MVP, Davide or Ash can handle this. For production, we may need a specialist in this domain later.
Owns: memory abstraction, behavioral modeling, reflection quality, evaluation frameworks, cognitive architectures, memory pipelines, orchestration, retrieval systems, agent workflows.
Profile: applied AI infrastructure engineer, not a pure ML scientist.
Role cards
- Memory abstraction · behavioral modeling · reflection quality
- Evaluation frameworks · cognitive architectures · memory pipelines
- Orchestration · retrieval · agent workflows
- Voice + avatar pipeline (memory context → ElevenLabs → phoneme stream)


- Personality, tone, conversational rhythm · how the system "thinks" and responds
- Reflection prompts · memory framing · emotional calibration
- Behavioral guardrails · psychological surface · what the AI should and shouldn't do

- APIs · databases · scaling · auth · queues · realtime systems · data pipelines


- Conversation UI · timeline UI · memory visualization · reflection interfaces
- Frictionless journaling · emotional tone balance
- Avatar assistant UI (phoneme/viseme animation, lip-sync, idle states)


- Avatar character design (look, expressions, idle behavior, on-brand presence)



- Test each feature · memory consistency · reflection quality · regressions

- Communication · task management · deadlines · reviews and tests

Ownership summary
| Area | Owners |
|---|---|
| AI / memory architecture | New hire (lead) · Davide · Ash |
| AI behavior / cognitive design | Ace · Alessandra (with Ash) |
| Local LLM + TTS + embeddings inference | New hire (lead) · Davide |
| Backend | Davide · Ash |
| Frontend / Product engineering | Ash · Gray |
| Voice + avatar pipeline (backend) | New hire (lead) · Davide |
| Avatar (Unreal + MetaHuman + Audio2Face) | Contractor / Unreal specialist · Ash · Gray (integration) |
| Avatar character design | Ace · Gray · Ash |
| Self-hosted ops (Langfuse, GlitchTip, Authentik, MinIO, backups) | Davide · Ash |
| Product design | Ash · Gray · Ace |
| QA | Erica · Gray · Elena |
| Team lead | Gray |
Two scenarios: local-first leans on in-house hardware and self-hosted services; cloud-managed leans on AWS + paid APIs. Beta-scale: ~50 active users, moderate LLM traffic, nightly reflection jobs, voice + avatar on conversation surfaces.
Excludes sunk costs already in-house (Hostinger VPS, RTX 5090 workstation, Cloudflare, AWS SES, domains, dev tooling, design tools). These are not counted as incremental AKLUS cost.
Scenario summary
| Scenario | Monthly | 3-month total |
|---|---|---|
| Local-first — existing hardware, self-hosted services, no paid APIs | €0 net new | €0 net new |
| Cloud-managed — AWS + Claude + ElevenLabs + LangSmith + Clerk | ~€757 | ~€2,271 |
| Hybrid (recommended) — local primary, cloud only for quality-critical paths | ~€225–€380 | ~€680–€1,140 |
Hardware caps: 16 GB VRAM limits us to Qwen 2.5 14B (4-bit) or Llama 3.1 8B. TTS, embeddings, and LLM share the GPU — concurrency is the binding constraint. Single point of failure: one workstation.
Cloud-managed breakdown (alternative)
| Category | Monthly | 3-month total |
|---|---|---|
| AI / LLM API (Claude + GPT-4o-mini + Voyage) | €212 | €636 |
| Voice + avatar (ElevenLabs Pro + LiveAvatar) | €165 | €495 |
| AWS infrastructure (Fargate + RDS + ElastiCache + S3/CloudFront) | €118 | €354 |
| Observability + reliability (LangSmith + Sentry + uptime) | €113 | €339 |
| Auth (Clerk) | €23 | €69 |
| Subtotal | €631 | €1,893 |
| Contingency buffer (20%) | €126 | €378 |
| Cloud-managed total | ~€757 / mo | ~€2,271 |
Optional adds
| Item | Notes | Monthly est. |
|---|---|---|
| mem0.ai cloud (if not self-hosting OSS) | swap-out, decide after spike | €17 |
| letta.com cloud (memory-native alt) | swap-out, decide after spike | €18 |
| Heavier API fallback (Claude on full reflection load) | shifts toward cloud-managed scenario | +€100–€180 |
| Cloud GPU rental during traffic spikes (RunPod / Vast.ai) | RTX A6000 / A100 hourly | €0.50–€2 / hour |
Cost levers
- Hybrid routing: local Qwen handles 80–90% of turns, Claude reserved for memory consolidation and nightly reflections that matter. Keeps quality where it counts, cost where it doesn't.
- Self-host mem0 OSS or Letta OSS on the VPS: zero subscription cost; ~1 day of setup.
- Self-host Langfuse instead of LangSmith: saves €72 / month.
- Defer Unreal Engine avatar; ship three.js + MetaHuman blendshapes for v1: cuts ~€170/month and weeks of engineering.
- Voyage embeddings: first 200M tokens free — effectively €0 for the full MVP.
Text + voice (TTS and STT), no avatar in this phase. The goal is a working brain: persistent memory, smart retrieval, and overnight reflections, used daily on our own hardware. Planned at 8 weeks with a 1–2 week buffer, so realistically expect 8–10 weeks.
Roadmap
Primary focus Supporting work Buffer Milestone
Day-1 tech locks (decide once, do not revisit)
| Layer | Choice | Why |
|---|---|---|
| Backend | FastAPI (Python), async, SSE streaming | Fast to write, great for LLM streaming, one language end to end. |
| Frontend | Next.js + React, Tailwind | Web first. Wrap in Tauri later if a desktop app is needed. |
| LLM | Ollama — Qwen 2.5 14B (4-bit) on RTX 5090 | Fastest path to a local API. Swap to vLLM only if throughput hurts. |
| Fallback LLM | Claude API, behind a feature flag | For reflections that need more nuance. Keep it optional. |
| Embeddings | bge-large-en-v1.5 via sentence-transformers on GPU | Free, local, strong. ~2 GB VRAM alongside the LLM. |
| DB | Postgres 16 + pgvector on Hostinger VPS | One store for structured memory and vectors. Already in-house. |
| Orchestration | LangGraph | The conversation graph (retrieve, generate, extract, store) lives here. |
| TTS | Coqui XTTS v2 on RTX 5090 (ElevenLabs as fallback) | Local first and free. Chunk LLM stream into sentences, synthesize as it generates. |
| STT | faster-whisper on RTX 5090 | Local, fast, accurate. Record in browser, transcribe server-side. |
| Scheduled jobs | APScheduler or cron script (not Celery yet) | Nightly reflection is one job. Celery is overkill for MVP. |
| Auth | JWT + bcrypt, single user to start | Do not burn days on auth. Lock it down properly in v2. |
| Tracing | Structured logging to Postgres (every retrieval, prompt, output) | Get the data first. Langfuse can read it later. |
Week by week
Click any week to expand tasks and definition of done.
Week 1 — Foundation and thin vertical sliceEnd-to-end conversation
Tasks
- Repo, monorepo or two folders, env config, Makefile / scripts
- FastAPI app, health check, settings, Postgres connection
- Postgres + pgvector on the VPS; create
users,conversations,messagestables - Ollama on the RTX 5090 serving Qwen 2.5 14B; expose to the VPS over Cloudflare Tunnel or Tailscale
- Chat endpoint: POST message, stream tokens back (SSE)
- Next.js chat screen: input, streaming bubble, message history from DB
- Persist every user and assistant message
Week 2 — Memory write and basic retrievalIt starts to remember
Tasks
- Embedding service: bge-large on the GPU, batch endpoint, cache
memoriestable: type, content, embedding (vector), importance, created_at, source_message_id, metadata JSONB- Memory extraction step: after each turn, a prompt asks the LLM to extract candidate memories (start with episodic + semantic + goal) as structured JSON
- Write extracted memories with embeddings
- pgvector similarity search (cosine, IVFFlat or HNSW index)
- Inject top-k retrieved memories into the system prompt before generation
- Manual test: tell it a fact in session A, start session B, confirm it recalls
Week 3 — Memory types, hybrid retrieval, LangGraphStructured + smart
Tasks
- Finalize the 5 memory types with tagging rules
- Importance scoring at write time (LLM-rated 1–5, plus heuristics)
- Hybrid retrieval scorer: semantic similarity + recency decay + importance (add emotional and goal relevance if time)
- Rewrite the turn as a LangGraph graph: retrieve → assemble context → generate → extract → store
- Dedup on write (skip near-duplicate memories by cosine threshold)
- Log every retrieval: which memories, what scores, why selected
Week 4 — Reflections and the nightly jobIt thinks on its own
Tasks
- Scheduled job (APScheduler) running nightly per user
- Reflection generation: summarize recent episodes, detect patterns, write reflection memories
- Consolidation: promote repeated episodic signals into semantic memories; merge near-duplicates
- Weekly insight summary (a short digest)
- Reflections stored with provenance (which episodes triggered them) — explainability built in
- Optional: route reflection generation to Claude behind the flag for higher quality
Week 5 — Reflection quality and cognitive designThe soul of the product
Tasks
- Define personality and reflection tone with Ace and Alessandra: how it speaks, how blunt it is, when it stays quiet
- Iterate the extraction and reflection prompts against real journaling data
- Build the eval harness early: a golden set of reflection cases scored for accuracy, safety, and usefulness (LLM-as-judge + human review)
- A/B local Qwen vs Claude on reflection quality; lock the routing rule
- Tune importance and emotional-relevance signals based on what surfaces well
- Guardrails: avoid harmful, presumptuous, or clinically-toned statements
Week 6 — Product surfaces and real-time UIFeels like a product
Tasks
- Memory timeline: browse what it remembers, filter by type, search
- Reflection feed: accept, reject, or edit reflections (rejections feed back into quality)
- Goal tracking view: active goals, progress notes
- In-place updates with loading spinners; restore on error
- Basic auth + a real login screen
- Empty states and onboarding (first-run journaling prompt)
Week 7 — Voice in and outIt speaks and listens
Tasks
- TTS: Coqui XTTS v2 on the RTX 5090 behind a simple synth endpoint (ElevenLabs API as fallback)
- Chunk the LLM token stream into sentences and synthesize each as it arrives — audio starts before the full reply is done
- STT: record audio in the browser, transcribe with faster-whisper server-side, feed the text into the same chat pipeline
- Audio playback UI: play, pause, mute toggle, autoplay setting; push-to-talk or mic button for input
- Pick or clone one voice; consistent with the personality from week 5
- Latency pass: pipeline the stages so first spoken word lands in ~1 second
Week 8 — Harden, evaluate, shipStable and real
Tasks
- Run the full eval harness; fix regressions in memory recall and reflection quality
- Error handling: LLM, TTS, and STT timeouts, retries, graceful degradation
- Tracing review: confirm you can debug any odd answer from logs
- Performance pass: retrieval latency, prompt size, GPU memory with LLM + embeddings + XTTS + Whisper sharing the card
- Deploy on the VPS, automated nightly DB backup to Cloudflare R2
- Dogfood: use it as your own daily journaling tool, fix what annoys you
Risks and mitigations
| Risk | How I will handle it |
|---|---|
| Memory extraction quality is poor (junk memories) | Spend real time on the extraction prompt in week 2 and add importance filtering and dedup early. This is the highest-leverage prompt in the system. |
| 16 GB VRAM too tight for LLM + embeddings + XTTS together | Run embeddings on CPU, quantize the LLM harder, or run XTTS on CPU. If it still fights, use ElevenLabs API for voice and keep the GPU for LLM. With one user during MVP, concurrency is not the bottleneck yet. |
| Retrieval returns the wrong memories | Log every retrieval with its scores from day one of week 3. Cannot tune what cannot be seen. |
| Reflections feel generic or wrong | Ground every reflection in specific episodes with provenance. Route to Claude fallback if local model quality lags. |
| Scope creep toward the avatar | The avatar is explicitly v2. Will not touch it during 8 weeks — starting it would put the memory core at risk. |
| Week lost to infra (Ollama, tunnel, pgvector) | Timebox setup to the first two days of week 1. If something fights, fall back to the simplest working option and move on. |