Overview
MCP gives you a growing ecosystem of tool servers — web fetch, filesystem, databases, search, your own APIs. But wiring those tools into every app and every model is repetitive. aiproxy makes that infrastructure reusable and model-agnostic: define your MCP servers and LLM backends once, compose them into named assistants, and every OpenAI-compatible app in your stack gets a tool-augmented model for free — no client changes, no SDK lock-in.
flowchart LR
client(["Any OpenAI client"])
client -- "POST /v1/chat/completions" --> gate
subgraph proxy["aiproxy"]
direction TB
gate["auth chain<br/>static keys | Apiman"]
loop(["agent loop"])
backend["backend adapter<br/>OpenAI-compat | native Anthropic"]
mcp["MCP servers<br/>fetch | filesystem | http ..."]
gate --> loop
loop -- "LLM turn" --> backend
backend -. "assistant / tool_calls" .-> loop
loop -- "tool calls" --> mcp
mcp -. "results" .-> loop
end
backend -- "chat / messages API" --> llm(["Upstream LLM"])
proxy -- "OpenAI response" --> client
Streaming & non-streaming /v1/chat/completions + /v1/models. Works with the OpenAI SDKs, LangChain, LlamaIndex, curl.
OpenAI-compatible backends (OpenAI, Groq, vLLM, Ollama, …) and native Anthropic, behind one interface.
stdio, sse, streamable-http. Persistent sessions, namespaced tools, concurrent execution.
Add / edit / remove assistants, backends and MCP servers without a restart.
Quick start
Run the prebuilt multi-arch image from the GitHub Container Registry:
# configure secrets + assistants cp .env.example .env cp config.example.yaml config.yaml docker run --rm -p 8000:8000 --env-file .env \ -v "$PWD/config.yaml:/app/config.yaml:ro" \ ghcr.io/sirmmo/aiproxy:latest
Then talk to it exactly like OpenAI:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "research-assistant", "messages": [{"role":"user","content":"Summarize https://modelcontextprotocol.io"}] }'
…or with the OpenAI Python SDK — no code changes beyond the base URL:
from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused-or-PROXY_API_KEY") resp = client.chat.completions.create( model="research-assistant", # an assistant, not a raw model messages=[{"role": "user", "content": "What's on the MCP homepage?"}], ) print(resp.choices[0].message.content)
stream=True): content tokens flow through while tool rounds run transparently between them.Configuration
Everything is declared in config.yaml. ${VAR} / ${VAR:-default} are
expanded from the environment, so keep secrets in .env.
mcp_servers: fetch: # a reusable MCP server transport: stdio command: uvx args: ["mcp-server-fetch"] backends: anthropic: # a wrapped LLM provider kind: anthropic # or "openai" for any compat endpoint base_url: https://api.anthropic.com/v1 api_key: ${ANTHROPIC_API_KEY} assistants: - name: research-assistant # ← clients pass this as `model` backend: anthropic model: claude-sonnet-5 system_prompt: "You are a meticulous research assistant. Cite your sources." mcp_servers: [fetch] max_tool_iterations: 8 temperature: 0.2
Backends
| kind | Talks to | Auth header |
|---|---|---|
openai | Any OpenAI-compatible /chat/completions — OpenAI, Groq, Together, Mistral, vLLM, Ollama (/v1), LM Studio, OpenRouter… | Authorization: Bearer |
anthropic | Native Anthropic /messages | x-api-key |
The Anthropic backend translates the canonical chat messages ↔ the Messages API (system prompt,
tool_use/tool_result blocks, streaming events, stop-reason mapping), so tool use
works first-class with Claude.
MCP servers
| transport | Fields |
|---|---|
stdio | command, args, env, cwd |
sse | url, headers |
http / streamable-http | url, headers |
Tools are exposed to the model as <server>__<tool> and routed back to the right
server on call. Sessions are persistent (one subprocess per stdio server, reused across requests) and
started lazily on first use. Node (npx) and uvx are baked into the image, so most
community MCP servers install on demand.
Assistants
An assistant is a virtual model exposed via the OpenAI model field. It binds one backend,
a system prompt, and a set of MCP servers, plus a tool-loop budget.
| Field | Meaning |
|---|---|
name | What clients pass as model |
backend | Which configured backend to call |
model | The upstream model id (e.g. gpt-4o, claude-sonnet-5) |
system_prompt | Prepended if the request has no system message |
mcp_servers | List of MCP servers whose tools are attached |
max_tool_iterations | Tool-loop budget; the final turn drops tools to force an answer |
temperature, top_p, max_tokens | Defaults; client-supplied params override them |
OpenAI API
| Method & path | Purpose |
|---|---|
GET /v1/models | List configured assistants as OpenAI models |
POST /v1/chat/completions | Chat completion; runs the MCP tool loop. Supports stream |
How a request flows
- Client posts to
/v1/chat/completionswithmodel: "<assistant>". - The gateway resolves the assistant → backend + MCP servers, and ensures those servers are connected.
- It builds the OpenAI tool schema and enters the agent loop: call the LLM; if it requests tools, execute
them concurrently against the MCP servers and feed results back; repeat until the model answers or
max_tool_iterationsis hit. - Returns a standard
chat.completion(or streamschat.completion.chunks), with the assistant name asmodeland aggregated token usage.
Admin API
Mutate the live registry without restarting (set ADMIN_API_KEY to protect it):
# see what tools a server actually advertises curl localhost:8000/admin/mcp/fetch/tools # add / replace an assistant at runtime curl -X PUT localhost:8000/admin/assistants/coder \ -H "Content-Type: application/json" \ -d '{"backend":"openai","model":"gpt-4o","mcp_servers":["filesystem"], "system_prompt":"You are a coding agent."}'
| Method & path | Purpose |
|---|---|
GET /admin/config | Dump current registry (secrets redacted) |
GET/PUT/DELETE /admin/assistants[/{name}] | Manage assistants |
GET/PUT/DELETE /admin/backends[/{name}] | Manage LLM backends |
GET/PUT/DELETE /admin/mcp[/{name}] | Manage MCP servers |
GET /admin/mcp/{name}/tools | Introspect a server's tools |
GET /admin/config to export current state and persist it into config.yaml yourself.Auth
/v1/* auth is a pluggable chain — a request is authorized if any enabled
provider accepts it, so static keys and Apiman run in parallel. The
caller's key is read from Authorization: Bearer, the X-API-Key header, or the
?apikey= query param.
| Provider | Enable with | Accepts when… |
|---|---|---|
| Static keys | non-empty proxy_api_keys | the key matches an entry |
Apiman gateway_probe | apiman.mode: gateway_probe | the key validates via a round-trip through the Apiman gateway (2xx), cached |
Apiman trusted_header | apiman.mode: trusted_header | the request carries the shared secret the gateway injects |
Apiman — gateway_probe
aiproxy stays directly reachable and validates each caller's key against Apiman. Register a small "auth check" API whose backend is aiproxy's /health:
apiman: enabled: true mode: gateway_probe gateway_url: http://apiman-gateway:8080/apiman-gateway probe_api: aiproxy/authcheck/1.0 # {org}/{api}/{version} probe_path: health # backend path that returns 2xx cache_ttl: 60
Apiman — trusted_header
Put the Apiman gateway in front of aiproxy; an "Add Header" policy injects a shared secret aiproxy trusts (it never sees raw keys):
apiman:
enabled: true
mode: trusted_header
header: X-Apiman-Gateway-Token
secret: ${APIMAN_SHARED_SECRET}
/admin/* is separate: if ADMIN_API_KEY is set, admin calls must send Authorization: Bearer <ADMIN_API_KEY>. Everything is open when unset (handy for local dev; lock it down in production).
Development
The recommended path is Docker — a clean Python 3.12 with node/uvx available.
docker build -t aiproxy:latest . # end-to-end check: spawns the demo MCP server and drives the full agent # loop (streaming + non-streaming) with a scripted fake LLM. No API key needed: docker run --rm aiproxy:latest python scripts/smoke_test.py
See the README for the full reference and a no-Docker (uv) workflow.