X25: Autonomous LLM Routing
X25 routes every LLM call to the cheapest model that meets your quality bar. It learns per project using Thompson Sampling, judges quality automatically, escalates when needed, and trains a custom model on your call history at Stage 4.
5-minute setup
-
Install the SDK
pip install x25-sdkbash -
Generate your API key
Log into your dashboard with appropriate credentials after creating an account. Go to the hamburger menu on the top navigation bar and click on Generate API Key. Check your organisation name, then click Proceed. You can now copy your retrieved API key to authenticate your SDK calls.
-
Route your first call
from x25 import X25 agent = X25( api_key="sk-x25-a1b2c3d4...", prompt="your-prompt", org="my-research-lab") result = agent.complete() print(result)python
Installation
The SDK has two components: the Python client (lightweight, just httpx) and the gateway server
(runs locally or on your infrastructure).
SDK only
pip install x25-sdkbash
Gateway + full stack
pip install -r requirements.txt # includes FastAPI, LangGraph, LangChain, OpenAI, aiosqlite, numpybash
Environment variables
# .env OPENROUTER_API_KEY=sk-or-... # required, routes to 300+ models OPENAI_API_KEY=sk-... # required, LLM-as-judge + classifierbash
Authentication
Every project gets an API key scoped to org. X25 uses this to isolate learning,
savings, and audit records per project.
| Field | Type | Description |
|---|---|---|
| org | string | Your project identifier. Used to track learning, savings, and stage. |
| rate_limit_rpm | int | Requests per minute. Default 0 = unlimited. |
Pass the key as a Bearer token on every request:
Authorization: Bearer sk-x25-a1b2c3d4...http
.env file immediately.
X25()
Create an agent instance. The agent holds your project config and re-uses it across calls.
X25( org: str = "default", optimize_for: dict = {"cost": 0.33, "quality": 0.34, "latency": 0.33}, policy: dict = {}, gateway_url: str = "http://localhost:8000", api_key: str = None, )python
| Parameter | Type | Description |
|---|---|---|
| org | str | Your project ID. X25 learns per-project. Separate projects get separate Thompson arms, audit logs, and stage progression. |
| optimize_for | dict | Weights summing to 1.0 across cost, quality, latency. See optimize_for. |
| policy | dict | Hard constraints. E.g. {"max_cost_per_call_usd": 0.02}. |
| gateway_url | str | Where the X25 gateway is running. |
| api_key | str | Your project API key from POST /keys/create. |
agent.complete()
Route a prompt to the best available model. X25 classifies the task, selects a tier, dispatches, judges quality, escalates if needed, and updates its own learning. All in one call.
agent.complete( prompt: str, hint: str = None, ) -> X25Responsepython
| Parameter | Type | Description |
|---|---|---|
| prompt | str | The text to route. |
| hint | str | Optional task type hint. One of: summary, code, classification,
extraction, qa, reasoning, creative. If omitted, X25
classifies automatically.
|
X25Response
@dataclass class X25Response: text: str # the model's answer model_used: str # e.g. "deepseek/deepseek-v4-flash" provider: str # e.g. "deepseek" task_type: str # classified task: "summary", "code", etc. cost_usd: float # actual cost of this call latency_ms: float # end-to-end latency in milliseconds quality_score: float # 0.0–1.0, LLM-as-judge cascade_steps: int # 1 = first tier passed, 2+ = escalated audit_hash: str # SHA-256 hash, tamper-evident goal_match: dict # per-dimension score vs your optimize_forpython
optimize_for
Tell X25 what matters most for your project. Weights must sum to 1.0. X25 normalises automatically if they don't.
| Key | What it controls | Example value |
|---|---|---|
| cost | How aggressively to prefer cheaper tiers | 0.6 |
| quality | Minimum quality bar and reward weight | 0.3 |
| latency | Preference for faster models | 0.1 |
Common profiles
# Cost-first, for high-volume pipelines optimize_for={"cost": 0.7, "quality": 0.2, "latency": 0.1} # Quality-first, for research and publication optimize_for={"cost": 0.1, "quality": 0.8, "latency": 0.1} # Balanced, default optimize_for={"cost": 0.33, "quality": 0.34, "latency": 0.33} # Latency-first, for real-time use optimize_for={"cost": 0.1, "quality": 0.2, "latency": 0.7}python
POST /route
The core routing endpoint. The SDK calls this on every agent.complete().
| Field | Type | Description |
|---|---|---|
| prompt | string | The text to route. |
| org | string | Project ID. Resolved from API key if provided. |
| optimize_for | object | Cost / quality / latency weights. |
| hint | string? | Optional task type hint. |
| policy | object? | Hard constraints. |
GET /stats/{project}
Aggregated stats for a project. Use all to aggregate across all projects.
{
"total_calls": 301,
"total_cost_usd": 0.055,
"total_saved_usd": 0.307,
"avg_quality": 0.957,
"avg_latency_ms": 10341,
"model_distribution": { "deepseek-v4-flash": 233, "...": 68 },
"tier_distribution": { "slm": 269, "mid": 14, "frontier": 18 }
}GET /stage/{project}
Current stage and progression for a project.
{
"stage": 3,
"stage_name": "Feedback",
"total_calls": 235,
"calls_to_next_stage": 265,
"progress_in_stage": 0.117,
"improvement_available": true
}GET /thompson/{project}
Live Thompson Sampling state. Shows Beta distributions for each tier.
{
"org": "my-research-lab",
"arms": [
{ "tier": "slm", "alpha": 207.4, "beta": 2.0, "mean_reward": 0.99, "confidence": 1.0 },
{ "tier": "mid", "alpha": 3.0, "beta": 2.6, "mean_reward": 0.54, "confidence": 0.11 },
{ "tier": "frontier", "alpha": 5.0, "beta": 4.2, "mean_reward": 0.54, "confidence": 0.18 }
]
}Stages
X25 autonomously advances through four stages as calls accumulate. Each stage unlocks new capabilities.
POST /feedback unlocked. Submit labelled examples to improve routing
accuracy.Thompson Sampling
X25 models each tier as a Beta distribution Beta(α, β) representing the posterior over "reward probability": how often this tier passes quality on your tasks.
# Each call: θ_slm = Beta(α_slm, β_slm).sample() θ_mid = Beta(α_mid, β_mid).sample() θ_frontier = Beta(α_frontier, β_frontier).sample() tier = argmax([θ_slm, θ_mid, θ_frontier]) # After observing quality, update the dispatched arm: α += reward β += (1 - reward)python
Cheaper tiers naturally receive more traffic as their α grows. Frontier is selected only when SLM and mid lack sufficient evidence.
Audit trail
Every routing decision is written to a hash-chained audit log. Any modification to a historical record breaks the chain.
# Each record contains: record_hash = SHA256(prev_hash + record_content) # Verify integrity: curl http://localhost:8000/verify # → {"intact": true, "total_records": 301}bash
Fine-tuning (Stage 4)
At 500 calls, X25 extracts your highest-quality call history and generates a LoRA fine-tuning script for Llama 3.2 3B.
# Trigger at Stage 4: curl -X POST http://localhost:8000/improve/my-research-lab \ -H "Authorization: Bearer sk-x25-..."bash
| Step | What happens |
|---|---|
| Extract | Audit log filtered for quality_score > 0.7, formatted as Alpaca JSONL |
| Train | Unsloth LoRA on Llama 3.2 3B, ~45 min on Colab T4, ~8GB VRAM |
| Register | POST /improve/{project}/register. Custom model enters routing pool at tier=slm |
| Route | Thompson Sampling explores the custom model. If reward >= 0.6 consistently, selected for ~85-92% of calls at $0.01/1M tokens |
Manager terminal
A business-facing view of X25. Shows weekly savings, call volume, model mix, and a chat interface where managers can ask questions about AI spend and routing decisions. No code required.
What managers see
| Panel | What it shows |
|---|---|
| Hero banner | Weekly savings vs naive frontier, total calls, quality delta |
| Chat | Ask questions about AI spend. X25 answers with routing receipts showing exactly which model ran and why |
| Observatory | 30-day savings curve, model mix by call volume, recent routing decisions table |
| Ticker | Live model prices, router uptime, queue depth, saved today |