Meet OpenJarvis: A Local-First Framework for On-Device Personal AI Agents with Tools, Memory, and Learning

Researchers at Stanford University and Lambda Labs, have published the research paper for OpenJarvis, an open-source framework that runs inference, agents, memory, and learning entirely on-device.

The open-weight models configured through OpenJarvis land within 3.2 percentage points of the best cloud model on average, at roughly 800× lower marginal API cost per query and roughly 4× lower latency under the research’s benchmark protocol. This research work builds on the research team’s earlier Intelligence Per Watt study, which reported that local models already handle 88.7% of single-turn chat and reasoning queries at interactive latency, with intelligence efficiency improving 5.3× from 2023 to 2025.

Model Overview & Access

OpenJarvis is not a single model. It is a framework that composes any supported model with a configurable agent stack, evaluated across 11 local models from four families.

PropertyValue
LicenseApache 2.0
Framework releaseMarch 12, 2026
PaperarXiv:2605.17172 (posted May 16, 2026)
Repositorygithub.com/open-jarvis/OpenJarvis
Stars / forks~5.4k / ~1.2k (June 2026)
LanguagesPython (~83%), Rust (~9%), TypeScript (~7%)
Evaluated models11 local models across 4 families: Qwen3.5, Gemma4, Nemotron, Granite
Cloud baselinesClaude Opus 4.6, GPT-5.4, Gemini 3.1 Pro
Supported enginesOllama, vLLM, SGLang, llama.cpp, Apple Foundation Models, Exo (among others)
Context windowModel-dependent
InstallationSingle command; ~3 minutes on broadband
HardwareTested on 7 platforms, from Mac Mini M4 to NVIDIA DGX Spark

Architecture: Five Primitives and a Spec

OpenJarvis decomposes a personal AI system into five typed primitives, composed through a single declarative configuration object called a spec.

  • Intelligence — the model, weights, generation parameters, and quantization format.
  • Engine — the inference runtime (Ollama, vLLM, SGLang, etc.), batching, KV-cache settings, and hardware path.
  • Agents — the reasoning loop (ReAct or CodeAct), system prompts, tool-use policy, and turn limits.
  • Tools & Memory — external interfaces, retrieval backends, 25+ data connectors, and 32+ messaging channels, with native MCP support and interchangeable memory backends.
  • Learning — the optimizer that updates the spec from traces. This slot accepts LoRA, DSPy, GEPA, or LLM-guided spec search.

Each primitive is independently swappable, and a spec serializes all five into a TOML file. Two specs can share the same agent and tool configuration and differ only in model and engine, so the same behavior runs on a Mac Mini and a workstation without rewriting prompts.

LLM-guided spec search is the second contribution. It is a local–cloud collaboration: a frontier cloud model acts as a teacher at search time, reading traces, diagnosing failure clusters, and proposing edits across Intelligence, Engine, Agents, and Tools & Memory. An edit is accepted only if it improves the target failure cluster without causing meaningful regressions elsewhere — the research team calls this the gate (default tolerance 1%). The optimized spec then runs entirely on-device at inference time, with zero cloud calls. The teacher is used only at search time; at 100 queries per day, the amortized teacher cost falls below $0.001 per query within six months.

Prior work (GEPA, DSPy, LoRA) optimizes one primitive at a time, and prompt optimizers alone recover only about 5 pp of the cloud–local gap. LLM-guided spec search recovers 13–32 pp because it edits across primitives jointly, at 7–11× lower optimization cost than single-primitive baselines. The four-primitive move space contributes 5.5–16.5 pp, and the LLM proposer adds about 10 pp on average over an evolutionary search at the same move space.

https://arxiv.org/pdf/2605.17172v1

Capabilities & Performance

OpenJarvis was evaluated across 8 benchmarks spanning 508 tasks: tool calling (ToolCall-15), agentic workflows (PinchBench), coding (LiveCodeBench), customer service (τ-Bench V2, τ²-Bench Telecom), general assistance (GAIA), and deep research (LiveResearchBench, DeepResearchBench).

The swap test: Replacing the intended cloud model with Qwen3.5-9B in existing frameworks (OpenClaw, Hermes Agent) drops accuracy by 25–39 pp. With the same model under an OpenJarvis spec, the residual drop shrinks to 5.6–16.5 pp — recovering 56–77% of the portability loss.

The accuracy frontier: The best single local model, Qwen3.5-122B, reaches 80.3% average accuracy versus Claude Opus 4.6 at 83.5% — a 3.2 pp gap. Local specs match or exceed cloud on 4 of 8 benchmarks: ToolCall-15, PinchBench, LiveCodeBench, and τ-Bench V2.

Cost and latency: Local configurations form the accuracy–efficiency frontier. Qwen3.5-122B delivers its 80.3% at roughly a thousandth of a cent per query, versus $0.009 per query for Claude Opus 4.6 — an approximately 800× marginal API-cost advantage. End-to-end latency drops by roughly 4× on the agentic workloads, though the paper notes single-shot prompts can favor cloud serving.

Search gains: LLM-guided spec search improves the Qwen3.5-9B student to 100% on PinchBench, 83% on LiveCodeBench, and 91% on LiveResearchBench. Across the full eight-benchmark suite, average gains per student model range from 13.1 to 31.5 pp. The authors report that these gains survive their robustness checks (reward-weight variants, search-seed variance, and random restarts).

How to Use it

Installation is one command. On macOS, Linux, or WSL2:

curl -fsSL https://open-jarvis.github.io/OpenJarvis/install.sh | bash

Windows users run an equivalent PowerShell script (irm … | iex). The installer provisions uv, a Python virtual environment, Ollama, and a starter model in about three minutes on broadband. A desktop GUI ships as a .dmg, .exe, .deb, .rpm, or .AppImage from the releases page.

After install, jarvis starts a chat session. Starter presets cover common workflows:

jarvis init --preset morning-digest-mac    # daily briefing with TTS
jarvis init --preset deep-research         # multi-hop research with citations
jarvis init --preset code-assistant        # agent with code execution and shell access
jarvis init --preset scheduled-monitor     # stateful agent on a schedule

The framework ships with eight built-in agents across three execution modes — on-demand, scheduled, and continuous. It connects to 25+ data sources (Gmail, Calendar, iMessage, Notion, Obsidian, Slack, GitHub, and others) and exposes agents over 32+ messaging channels (WhatsApp, Telegram, Discord, iMessage, Signal, and others).

Skills can be imported from external catalogs — about 150 from Hermes Agent and about 13,700 community skills from OpenClaw — all following the agentskills.io specification. A jarvis optimize skills --policy dspy command refines them from local trace history.

Marktechpost’s Visual Explainer

 
OpenJarvis · Stanford

01 / 07

 

Stanford · Hazy Research + Scaling Intelligence Lab
OpenJarvis

An open-source, local-first framework for personal AI agents that run inference, agents, memory, and learning entirely on-device.

Within 3.2 pp of best cloud
~800× lower marginal API cost
~4× lower latency

Apache 2.0  •  arXiv:2605.17172  •  Framework released March 12, 2026

What it is

Personal AI that runs on your hardware

Most “personal” AI still routes every query through a cloud API. OpenJarvis makes local-first the default and calls the cloud only when needed — building on the team’s Intelligence Per Watt finding that local models already handle 88.7% of single-turn queries.

LicenseApache 2.0
Repositorygithub.com/open-jarvis/OpenJarvis
Models11 local models · 4 families
Qwen3.5, Gemma4, Nemotron, Granite
EnginesOllama, vLLM, SGLang, llama.cpp, Apple FM, Exo

Architecture

Five primitives, one spec

A personal AI system is decomposed into five typed, independently swappable primitives, composed through a single declarative spec serialized to portable TOML.

  • Intelligence — model, weights, generation params, quantization
  • Engine — inference runtime, batching, KV-cache, hardware path
  • Agents — reasoning loop (ReAct or CodeAct), prompts, tool policy
  • Tools & Memory — 25+ connectors, 32+ channels, native MCP
  • Learning — optimizer slot: LoRA, DSPy, GEPA, or spec search

Key method

LLM-guided spec search

A frontier cloud model acts as a teacher at search time: it reads traces, diagnoses failure clusters, and proposes edits across primitives. A gate accepts only non-regressing edits. The optimized spec then runs entirely on-device — zero cloud calls at inference time.

13–32 ppof the cloud–local gap closed
7–11×lower optimization cost vs single-primitive baselines

The four-primitive move space adds 5.5–16.5 pp; the LLM proposer adds ~10 pp over evolutionary search at the same move space.

Performance

Close to cloud, far cheaper

3.2 ppgap: Qwen3.5-122B 80.3% vs Claude Opus 4.6 83.5%
4 / 8benchmarks where local matches or beats cloud
  • Matches/exceeds cloud on ToolCall-15, PinchBench, LiveCodeBench, τ-Bench V2
  • ~800× lower marginal API cost; ~4× lower latency (paper’s protocol)
  • Swap test: a 25–39 pp drop shrinks to 5.6–16.5 pp under a spec (56–77% recovered)

Developer experience

From zero to an agent in minutes

One command provisions uv, a Python virtual environment, Ollama, and a starter model (~3 minutes on broadband):

curl -fsSL https://open-jarvis.github.io/OpenJarvis/install.sh | bash
  • 8 built-in agents across on-demand, scheduled, and continuous modes
  • 25+ data connectors · 32+ messaging channels
  • Skills via agentskills.io: ~150 from Hermes Agent, ~13,700 from OpenClaw

The bottom line

A research platform and a production foundation

OpenJarvis trades roughly 3.2 pp of accuracy — the gap concentrating on reasoning- and research-heavy tasks — for major cost, latency, and privacy gains. Inference, agent state, and memory stay on-device by construction; the cloud teacher is optional and bounded.

Caveats: results average 5 runs per configuration, use GPT-5-mini as judge, and were run on a single machine. Apache 2.0 and actively maintained — built, in the authors’ words, “in the spirit of PyTorch” for local AI.

 

Marktechpost
AI research and developer tools, decoded for ML engineers — marktechpost.com

Key Takeaways

  • OpenJarvis runs inference, agents, memory, and learning fully on-device, landing within 3.2 pp of the best cloud model at ~800× lower marginal API cost and ~4× lower latency.
  • A typed “spec” decomposes the stack into five swappable primitives — Intelligence, Engine, Agents, Tools & Memory, and Learning — serialized to portable TOML.
  • LLM-guided spec search uses a frontier cloud model as a search-time teacher to recover 13–32 pp of the cloud–local gap at 7–11× lower optimization cost, then runs locally with zero cloud calls.
  • Local specs match or exceed cloud on 4 of 8 benchmarks (ToolCall-15, PinchBench, LiveCodeBench, τ-Bench V2); the remaining gap concentrates on reasoning- and research-heavy tasks.

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top