OpenSquilla: Token-Efficient AI Agent with Higher Intelligence Density

Token-efficient AI agent design has become critical as LLM costs scale with usage, and OpenSquilla addresses this challenge head-on with a framework that delivers higher intelligence density within the same token budget. Built in Python with support for Model Context Protocol (MCP), persistent memory, and a modular skills system, OpenSquilla enables developers to build agents that accomplish more with fewer tokens – reducing costs while improving output quality. With 3,747 stars and growing community adoption, OpenSquilla represents a shift toward efficiency-first agent architecture.

OpenSquilla is a microkernel AI agent runtime that runs locally on your machine, providing a unified gateway for CLI, Web UI, and chat channels. Its core innovation is SquillaRouter, an on-device LightGBM and ONNX classifier that scores each turn on length, language, code presence, keywords, and semantic embeddings, then routes it across four tiers (T0 through T3) to the cheapest capable model. Classification runs entirely on-device, meaning your prompt never leaves the machine to make that routing decision. The result: PinchBench 1.2.1 benchmarks show OpenSquilla achieving a 0.9251 average score across 25 tasks at $0.688 total cost, compared to a single-model approach scoring 0.9255 at $6.233 – nearly identical quality at roughly one-ninth the cost.

How It Works

OpenSquilla Architecture

Understanding the Architecture

The architecture diagram above illustrates the core components and their interactions within OpenSquilla. Let us break down each component:

User Request: The entry point for the system. A developer submits a natural language task or query through any of the supported interfaces – CLI, Web UI, or chat channels like Slack, Telegram, Discord, Feishu, DingTalk, WeCom, Matrix, or QQ. Every entry point runs through the same shared TurnRunner loop, ensuring consistent behavior across all surfaces.

Token Optimizer: The prompt compression engine that reduces token usage through context pruning, prompt compression, and intelligent summarization. OpenSquilla requests extended reasoning only for turns the router scores as complex, and the system prompt scales with task complexity – lightweight for trivial turns, full instructions for complex ones. This adaptive approach means simple queries consume far fewer tokens than they would with a one-size-fits-all prompt.

Agent Core (TurnRunner Orchestrator): The central orchestrator that coordinates LLM calls, tool execution, memory access, and response generation. It manages the complete lifecycle of each turn, from initial prompt construction through tool dispatch to final response delivery. The TurnRunner handles retries, error recovery, and approval-gated operations for sensitive tool calls.

LLM Backend (20+ Providers): The foundation model layer supports OpenRouter, OpenAI, Anthropic, Ollama, DeepSeek, Gemini, DashScope/Qwen, Moonshot, Mistral, Groq, Zhipu, SiliconFlow, vLLM, LM Studio, and more. Primary-plus-fallback selection ensures reliability, and the SquillaRouter classifier determines which tier each turn should use – from lightweight local models for simple tasks to powerful frontier models for complex reasoning.

Memory System (Persistent + Semantic): A curated MEMORY.md plus dated Markdown notes, searched with SQLite full-text keyword search and sqlite-vec semantic recall. Embeddings run on-device via bundled ONNX, or can swap to OpenAI or Ollama. Optional exponential decay and opt-in “dream” consolidation are available. This persistent memory eliminates the need to re-send context across sessions, directly reducing token consumption.

Skills Engine (15 Bundled Skills): Modular skill definitions that load only when the task needs them. The 15 bundled skills cover coding, GitHub, cron, PPTX/DOCX/XLSX/PDF generation, summarization, tmux, weather, and more. Skills can be authored, installed, and published from the CLI. On-demand loading means the prompt context stays lean.

MCP Layer (Client + Server): Model Context Protocol integration for connecting to external tools and services. OpenSquilla is both an MCP client and an MCP server – opensquilla mcp-server run exposes its capabilities to other MCP-compatible tools. This bidirectional connectivity extends agent reach without bloating the prompt.

OpenClaw Compatibility: A migration path from existing OpenClaw or Hermes Agent installations. The opensquilla migrate command previews and applies imports from existing homes, including memory, persona files, skills, MCP/channel config, and conflict handling. This makes switching from other agent frameworks straightforward.

High-Density Response: The final output – an intelligence-dense response delivered within the token budget. By optimizing every token in the request pipeline, OpenSquilla ensures that the response carries maximum information value per token consumed.

Key Insight: OpenSquilla’s core innovation is intelligence density – the idea that an AI agent should produce more useful output per token consumed. Rather than simply reducing token count, OpenSquilla optimizes the information content of each token through prompt compression, context pruning, and intelligent summarization, ensuring that every token in the LLM request contributes meaningfully to the response.

Key Features

OpenSquilla Features

Understanding the Features

The features diagram above shows OpenSquilla’s eight core capabilities radiating from the central hub. Each branch represents a distinct feature that contributes to the overall intelligence density goal.

Token Efficiency: The flagship feature. SquillaRouter classifies each turn into one of four tiers (T0 through T3) based on complexity, language, code presence, and semantic embeddings. Simple queries get routed to cheaper models, complex reasoning tasks to more capable ones. The system prompt also scales with complexity – lightweight for trivial turns, full instructions for complex ones. This dual optimization of model selection and prompt sizing delivers the core intelligence density promise.

MCP Integration: OpenSquilla speaks the Model Context Protocol as both client and server. As a client, it connects to external MCP-compatible tools and services. As a server, it exposes its own capabilities to other tools via opensquilla mcp-server run. This bidirectional connectivity requires the mcp extra (opensquilla[recommended,mcp]).

Persistent Memory: A curated MEMORY.md file plus dated Markdown notes, searched with SQLite full-text keyword search and sqlite-vec semantic recall. Embeddings run on-device via bundled ONNX, or can swap to OpenAI or Ollama. Optional exponential decay and opt-in “dream” consolidation help manage memory growth over time. This persistent context eliminates redundant token transmission across sessions.

Modular Skills: 15 bundled skills that load on-demand, covering coding, GitHub, cron, PPTX/DOCX/XLSX/PDF generation, summarization, tmux, weather, and more. Skills can be authored, installed, and published from the CLI. On-demand loading keeps the prompt context lean – only the skills needed for the current task are included in the context window.

OpenClaw Compatible: A migration path from existing OpenClaw or Hermes Agent installations. The opensquilla migrate command handles memory, persona files, skills, MCP/channel config, and conflict handling. Dry-run mode (--json) lets you preview changes before applying them.

Python-Native: Built for Python 3.12+ developers, OpenSquilla runs as a local gateway on 127.0.0.1:18791 with a Starlette ASGI server. The CLI provides commands for onboarding, configuration, session management, cost tracking, and more. A Web UI at /control/ offers a visual interface for monitoring and management.

Foundation Models: Support for 20+ LLM providers including OpenRouter, OpenAI, Anthropic, Ollama, DeepSeek, Gemini, DashScope/Qwen, Moonshot, Mistral, Groq, Zhipu, SiliconFlow, vLLM, and LM Studio. Primary-plus-fallback selection ensures reliability, and the SquillaRouter classifier determines the optimal model for each turn.

Deep Learning: The SquillaRouter uses LightGBM and ONNX for on-device classification. The recommended install profile includes numpy, scikit-learn, onnxruntime, and tokenizers – all running locally without sending prompts to external services for routing decisions.

Amazing: OpenSquilla combines three powerful efficiency mechanisms in one framework: token optimization that reduces redundant context, persistent memory that eliminates re-sending information across sessions, and a modular skills system that lets agents execute complex tasks with minimal prompt overhead. The result is an AI agent that achieves higher quality output within the same token budget that other agents spend on verbose, low-information content.

Feature Description
Token-Efficient Routing SquillaRouter classifies turns across 4 tiers (T0-T3) using on-device LightGBM + ONNX
Adaptive Prompts System prompt scales with task complexity – lightweight for simple, full for complex
20+ LLM Providers OpenRouter, OpenAI, Anthropic, Ollama, DeepSeek, Gemini, and more
On-Demand Skills 15 bundled skills load only when needed, keeping context lean
MCP Client and Server Bidirectional Model Context Protocol connectivity
Persistent Memory SQLite FTS + sqlite-vec semantic recall with on-device embeddings
Layered Sandbox Standard / Strict / Locked policy tiers with denial ledger
Unified Gateway CLI, Web UI, and 9 chat channels share one TurnRunner
Durable Sessions SQLite-backed sessions, subagents, and cron scheduling
Operator Controls Human-in-the-loop approvals, per-turn cost tracking, diagnostics

Installation

OpenSquilla runs on Windows, macOS, and Linux. The recommended installation path uses uv for isolated environment management.

1. Install uv (skip if uv --version already works):

Linux / macOS:

curl -LsSf https://astral.sh/uv/install.sh | sh
. "$HOME/.local/bin/env"

Windows PowerShell:

powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
$env:Path = "$env:USERPROFILE\.local\bin;" + $env:Path

2. Install OpenSquilla:

uv tool install --python 3.12 "opensquilla[recommended] @ https://github.com/opensquilla/opensquilla/releases/download/v0.3.1/opensquilla-0.3.1-py3-none-any.whl"

The recommended extra includes SquillaRouter runtime dependencies (ONNX Runtime, LightGBM, NumPy, and tokenizers). For a minimal install without the router, use OPENSQUILLA_INSTALL_PROFILE=core.

3. Configure and run:

opensquilla onboard
opensquilla gateway run

The onboard command is an interactive first-run wizard that writes the active config file and keeps provider secrets in environment variables. Open the Web UI at http://127.0.0.1:18791/control/.

Install from Source

For users tracking the main branch or contributing to development:

git lfs install
git clone https://github.com/opensquilla/opensquilla.git
cd opensquilla
git lfs pull --include="src/opensquilla/squilla_router/models/**"
uv sync --extra recommended --extra dev
uv run opensquilla --help

Docker

Build from a source checkout with Git LFS assets pulled:

docker build -t opensquilla:local .
./start.sh

Usage

OpenSquilla Workflow

Understanding the Workflow

The workflow diagram above shows the 7-step process that OpenSquilla follows for each user interaction, along with a decision loop that refines the output when needed.

Step 1 – User Submits Request: A natural language task or query enters the system through any supported interface – CLI, Web UI, or chat channel. The unified TurnRunner ensures identical behavior regardless of entry point.

Step 2 – Token Optimizer Compresses Context: SquillaRouter scores the turn on length, language, code presence, keywords, and semantic embeddings. The system prompt scales with task complexity – lightweight for trivial turns, full instructions for complex ones. This compression step ensures that only information-rich tokens reach the LLM.

Step 3 – Agent Core Processes Request: The TurnRunner orchestrator constructs the optimized prompt, selects the appropriate model tier, and prepares any tool definitions needed for the turn. It manages the complete lifecycle including retries, error recovery, and approval-gated operations.

Step 4 – LLM Generates Response: The foundation model produces an intelligence-dense output. The model tier (T0 through T3) was determined in Step 2, ensuring that simple queries use cheaper models while complex reasoning tasks get the full power of frontier models.

Step 5 – Skills Engine Executes Tools: If the LLM response includes tool calls, the modular skills engine dispatches them to the appropriate skill handler. Skills load on-demand, so only the relevant skill definitions are included in the context. The 15 bundled skills cover coding, GitHub, cron, document generation, and more.

Step 6 – Memory System Stores Context: Persistent memory retains key information from the interaction for future sessions. SQLite full-text search and sqlite-vec semantic recall enable efficient retrieval. This storage eliminates the need to re-send context in subsequent turns, directly reducing token consumption.

Step 7 – MCP Layer Connects Services: The Model Context Protocol layer connects to external tools and services when needed. OpenSquilla can act as both MCP client (connecting to external tools) and MCP server (exposing its capabilities to other tools).

Decision – Task Complete?: If the task is complete, the high-density response is delivered to the user. If not, the agent loops back to Step 3 with refined context, incorporating the results from tool execution and memory recall. This iterative refinement ensures thorough task completion without wasting tokens on unnecessary re-processing.

Takeaway: With OpenSquilla’s MCP integration and skills system, developers can extend agent capabilities without bloating the prompt context. Each skill is loaded on-demand, and the memory system ensures that context from previous interactions is retained without re-transmitting it – making every API call more efficient and every response more relevant.

Configuration

OpenSquilla uses a layered config system with environment variable overrides:

# Interactive setup
opensquilla onboard

# Non-interactive setup (CI/SSH)
export OPENROUTER_API_KEY="sk-..."
opensquilla onboard --provider openrouter --api-key-env OPENROUTER_API_KEY

# Reconfigure individual sections
opensquilla configure provider --provider openai --model gpt-4o --api-key-env OPENAI_API_KEY
opensquilla configure router --router recommended
opensquilla configure search --search-provider brave --api-key-env BRAVE_SEARCH_API_KEY

Config load order: OPENSQUILLA_GATEWAY_CONFIG_PATH -> ./opensquilla.toml -> ~/.opensquilla/config.toml -> built-in defaults.

Migrating from OpenClaw or Hermes

# Preview migration
opensquilla migrate openclaw --json

# Apply migration
opensquilla migrate openclaw --apply

Health Checks and Diagnostics

opensquilla doctor          # Check readiness
opensquilla doctor --json   # Machine-readable status
opensquilla cost            # Per-turn and per-session token/cost rollups

Conclusion

OpenSquilla represents a fundamental shift in AI agent design – from brute-force token consumption to intelligence density. By treating tokens as a scarce resource that should carry maximum information value, OpenSquilla achieves nearly identical quality to single-model approaches at roughly one-ninth the cost, as demonstrated by PinchBench 1.2.1 benchmarks.

The combination of on-device model routing (SquillaRouter), persistent memory with semantic recall, modular on-demand skills, and bidirectional MCP connectivity creates an agent that does more with less. Whether you are building CLI tools, web interfaces, or chat-integrated assistants, OpenSquilla’s unified TurnRunner ensures consistent behavior across all surfaces while keeping token costs under control.

With Python 3.12+ support, 20+ LLM providers, 15 bundled skills, and a clear migration path from OpenClaw and Hermes, OpenSquilla is ready for production use. Install it today and start building agents that deliver higher intelligence density within your budget.

Important: The shift from brute-force token consumption to intelligence density is not just a cost optimization – it is a fundamental change in how AI agents should be designed. OpenSquilla demonstrates that agents can be both more capable and more efficient by treating tokens as a scarce resource that should carry maximum information value.

Links:

Watch PyShine on YouTube

Contents