Future AGI: The Open-Source Platform for Self-Improving AI Agents

Future AGI is an open-source platform that collapses the entire AI agent lifecycle into one self-improving feedback loop. Instead of stitching together separate tools for evaluations, observability, and guardrails, Future AGI provides simulate, evaluate, protect, monitor, optimize, and gateway capabilities in a single platform – so agents do not just get monitored, they self-improve.

Architecture Overview

Understanding the Six-Pillar Architecture

The architecture diagram above illustrates Future AGI’s six core pillars and how they connect through a self-improving feedback loop. Let’s break down each component:

Simulate – Personas and Edge Cases The simulation engine generates thousands of multi-turn conversations against realistic personas, adversarial inputs, and edge cases. It supports both text and voice agents (LiveKit, VAPI, Retell, Pipecat), enabling teams to test their agents under conditions that mirror real-world usage before deployment.

Evaluate – 50+ Metrics Under One Call A single evaluate() call runs 50+ metrics: groundedness, hallucination detection, tool-use correctness, PII leakage, tone analysis, and custom rubrics. The evaluation engine combines three approaches – LLM-as-judge, heuristic checks, and ML classifiers – to provide comprehensive coverage without requiring multiple tools.

Protect – 18 Built-in Scanners + 15 Vendor Adapters The protection layer includes 18 built-in scanners (PII, jailbreak, injection, and more) plus 15 vendor adapters (Lakera, Presidio, Llama Guard, and others). These can run inline in the gateway or as a standalone SDK, providing real-time guardrails for production agents.

Monitor – OpenTelemetry-Native Tracing Monitoring is built on OpenTelemetry with 50+ framework instrumentors (LangChain, LlamaIndex, CrewAI, DSPy, and more). Span graphs, latency tracking, token cost analysis, and live dashboards come with zero configuration.

Command Center – 100+ Provider Gateway The Go-based gateway handles ~29k requests per second on t3.xlarge with P99 latency under 21ms even with guardrails enabled. It supports 100+ LLM providers, 15 routing strategies, semantic caching, virtual keys, MCP, and A2A protocols.

Optimize – 6 Prompt Optimization Algorithms Six algorithms (GEPA, PromptWizard, ProTeGi, Bayesian, Meta-Prompt, and Random) automatically improve prompts using production traces as training data, closing the feedback loop.

Platform Stack

Platform Stack

Understanding the Platform Stack

The platform stack diagram shows the four-layer architecture of Future AGI, where every arrow represents an open, documented interface:

Client Layer – traceAI SDKs The traceAI instrumentation libraries provide zero-config OpenTelemetry tracing for Python, TypeScript, Java, and C#. With just two lines of code, your existing OpenAI (or other framework) calls are automatically traced, with spans flowing into the platform via OTLP.

Edge Layer – Agent Command Center The Go-based gateway sits at the edge, handling all incoming requests with OpenAI-compatible HTTP. It provides weighted routing (~9.9ns per decision), semantic caching, virtual keys, and inline guardrail enforcement. The 29k req/s throughput with P99 under 21ms makes it production-ready for high-traffic deployments.

Platform Layer – Django 4.2 + Channels The core platform runs on Django 4.2 with Channels for real-time WebSocket support, plus Temporal for durable workflow orchestration. This layer houses all six pillars: simulate, evaluate, protect, monitor, optimize, and the command center management.

Data Layer – PostgreSQL + ClickHouse + Redis + RabbitMQ PostgreSQL stores metadata and configuration, ClickHouse handles spans and time-series data at scale, Redis manages state and caching, and RabbitMQ plus Temporal handle job queues and durable workflows. Every interface uses standard SQL or OTLP, so you can drop in your own infrastructure at any layer.

Evaluation Pipeline

Evaluation Pipeline

Understanding the Evaluation Pipeline

The evaluation pipeline diagram shows how agent traces and outputs flow through three parallel evaluator types before being aggregated into a single score set:

LLM-as-Judge The LLM judge uses a separate language model to evaluate outputs on dimensions that require semantic understanding: groundedness (is the response based on provided context?), hallucination detection (does the response invent facts?), tool-use correctness (did the agent call the right tools with the right parameters?), and overall response quality. This approach catches subtle errors that rule-based systems miss.

Heuristic Evaluators Heuristic checks provide deterministic, fast evaluation for well-defined patterns: PII detection via regex, length constraints, format validation (JSON schema, URL patterns), and custom rule-based checks. These run in milliseconds and provide consistent, reproducible scores.

ML Classifiers Custom ML models handle classification tasks that sit between heuristic rules and LLM judgment: embedding similarity for semantic deduplication, toxicity detection, topic classification, and custom classifiers trained on your specific domain data. These models run locally for low-latency production use.

Aggregation via evaluate() All three evaluator types feed into a single evaluate() call that returns 50+ metrics in one response. This eliminates the need to manage separate evaluation tools and ensures all scores are computed consistently against the same trace data.

Self-Improving Feedback Loop

Feedback Loop

Understanding the Self-Improving Feedback Loop

The feedback loop diagram illustrates the core value proposition of Future AGI: agents that do not just get monitored, they self-improve. Here is how each stage connects:

1. Deploy Agent to Production You deploy your AI agent (customer support, voice assistant, RAG pipeline, or coding agent) with traceAI instrumentation. From the first request, all interactions are captured as OpenTelemetry spans.

2. Monitor – OTel Traces, Cost, and Latency Every request flows through the monitoring layer, producing span graphs, latency histograms, token cost breakdowns, and live dashboards. You see exactly what your agent is doing, how long it takes, and how much it costs – with zero configuration.

3. Evaluate – 50+ Metrics Production traces are automatically evaluated against your chosen metrics. Hallucination rates, groundedness scores, tool-use accuracy, PII leakage, and custom rubrics are all computed and tracked over time. Anomalies trigger alerts.

4. Simulate – Persona-Driven Edge Cases Based on evaluation results, the simulation engine generates targeted test scenarios. If your agent struggles with adversarial inputs, the simulator creates more of them. If voice agents fail on specific turn patterns, LiveKit-driven simulations reproduce those patterns at scale.

5. Optimize – 6 Prompt Optimization Algorithms Production traces and simulation results feed into the optimization engine. Six algorithms (GEPA, PromptWizard, ProTeGi, Bayesian, Meta-Prompt, and Random) automatically improve prompts, system instructions, and agent configurations. The best-performing variants are promoted.

6. Protect – 18 Scanners + 15 Adapters Guardrails run inline in the gateway, blocking PII, jailbreaks, prompt injections, and other threats in real time. The protection layer also feeds signal back into the optimization engine, creating a virtuous cycle where the agent becomes both safer and more effective over time.

The loop then returns to deployment, where the improved agent version replaces the previous one – and the cycle continues.

Installation

Quick Start with Docker Compose

# Clone the repository
git clone https://github.com/future-agi/future-agi.git
cd future-agi

# Copy environment configuration
cp futureagi/.env.example futureagi/.env

# Start the full stack
docker compose up -d

Open http://localhost:3031 to access the dashboard.

Instrument Your First Agent (Python)

from fi_instrumentation import register
from traceai_openai import OpenAIInstrumentor

register(project_name="my-agent")
OpenAIInstrumentor().instrument()

# Your existing OpenAI code is now traced
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": query}],
)

Instrument Your First Agent (TypeScript)

import { register } from "@traceai/fi-core";
import { OpenAIInstrumentation } from "@traceai/openai";

register({ projectName: "my-agent" });
new OpenAIInstrumentation().instrument();

// Your existing OpenAI code is now traced
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: query }],
});

Cloud Option (No Install)

Sign up at app.futureagi.com for the free cloud tier, then install the evaluation SDK:

pip install ai-evaluation

Key Features

Feature Description
Agent Simulation Thousands of multi-turn conversations with realistic personas and adversarial inputs, including voice agents
50+ Evaluation Metrics Groundedness, hallucination, tool-use correctness, PII, tone, custom rubrics via LLM-as-judge + heuristic + ML
18 Built-in Scanners PII, jailbreak, injection detection, and more, plus 15 vendor adapters (Lakera, Presidio, Llama Guard)
OpenTelemetry Tracing 50+ framework instrumentors with span graphs, latency, token cost, and live dashboards
Agent Command Center Go-based gateway with ~29k req/s, P99 under 21ms, 100+ providers, 15 routing strategies
Prompt Optimization 6 algorithms (GEPA, PromptWizard, ProTeGi, Bayesian, Meta-Prompt, Random) using production traces
Self-Improving Loop Production traces feed back as training data for automatic prompt and agent improvement
Open Source Apache 2.0 license – inspect every evaluator, prompt, and trace with no black-box scoring
Self-Hostable Docker Compose, Kubernetes, or any container runtime (ECS, Cloud Run, AKS, EKS, GKE)

Comparison with Alternatives

Feature Future AGI Langfuse Phoenix Braintrust Helicone
Open Source Apache 2.0 MIT Elastic v2 No Apache 2.0
Self-Host Yes Yes Yes No Yes
Agent Simulation Yes No No No No
Voice Agent Eval Yes No Limited No No
LLM Gateway Yes (100+ providers) No No Yes Yes
Guardrails Yes (18 + 15 adapters) No No No No
Prompt Optimization Yes (6 algorithms) No No No No

SDKs and Integrations

Future AGI provides independently usable, Apache/MIT-licensed SDKs:

SDK Install Purpose
traceAI pip install fi-instrumentation-otel / npm i @traceai/fi-core Zero-config OTel tracing for 50+ AI frameworks
ai-evaluation pip install ai-evaluation / npm i @future-agi/ai-evaluation 50+ evaluation metrics + guardrail scanners
futureagi pip install futureagi Platform SDK for datasets, prompts, KB, experiments
agent-opt pip install agent-opt 6 prompt optimization algorithms
simulate-sdk pip install agent-simulate Voice-agent simulation via LiveKit + Silero VAD
agentcc pip install agentcc / npm i @agentcc/client Gateway client SDKs

Troubleshooting

Issue Solution
Docker Compose fails to start Ensure ports 3031, 5432, 8123, 6379, and 5672 are not in use by other services
traceAI not showing traces Verify register() is called before any LLM client initialization
Gateway returns 401 Check that your API key is configured in the dashboard under Settings > API Keys
ClickHouse connection refused Ensure ClickHouse is running: docker compose ps clickhouse
Evaluation scores are 0 Confirm your LLM judge API key is set in the environment configuration
High latency on gateway Check routing strategy; weighted routing adds ~9.9ns overhead per request

Conclusion

Future AGI stands out as the only open-source platform that combines agent simulation, evaluation, guardrails, monitoring, gateway, and prompt optimization into a single self-improving feedback loop. With Apache 2.0 licensing, OpenTelemetry-native tracing, a Go-based gateway handling 29k req/s, and 50+ evaluation metrics out of the box, it eliminates the need to stitch together Langfuse, Braintrust, Helicone, and Guardrails AI separately.

Whether you are deploying customer support agents, voice assistants, RAG pipelines, or coding agents, Future AGI provides the infrastructure to simulate edge cases before launch, evaluate what happens in production, protect users in real time, and turn every trace into signal for the next version.

Links:

Watch PyShine on YouTube

Contents