Skip to main content

Multi-Provider — pick the right model per phase

AIFactory routes each pipeline phase to its own model. You can plan with Claude Opus, code with a local Ollama qwen3:14b, and validate with Claude Sonnet — all in one task.

Supported providers

ProviderModelsUse case
Anthropic (via Claude Agent SDK)Opus 4.x, Sonnet 4.x, Haiku 4.xDefault for planning + QA — highest quality, integrated with MCP servers
Codex CLIgpt-5.3-codex and other OpenAI Codex models via the local CLIFastest reliable agentic coding (see the provider benchmark)
GitHub Copilot CLIcopilot:claude-sonnet-4.5, copilot:claude-sonnet-4, copilot:gpt-5Run builds on your existing GitHub Copilot subscription — no extra API key. Copilot is a router over Claude/GPT-5 backends
Gemini CLIGoogle Gemini 3.x ProCapable agentic coding; the isolated worktree is trusted automatically so it can edit files
Ollamaqwen2.5-coder:*, qwen3-coder:*, llama3.x:*, deepseek-coder:*, any local modelFree, offline, air-gapped coding. Needs an adequately-sized model + GPU — see Local models: sizing & hardware
OpenAIgpt-4o, gpt-4.1, o3-miniDrop-in alternative where licensing or compliance prefers it
OpenAI-compatibleLM Studio, vLLM, OpenRouter, Together, Groq, LocalAIAny endpoint that speaks the OpenAI /v1/chat/completions shape
OpenCode CLI (community / self-host tier)opencode:<provider/model>, e.g. opencode:anthropic/claude-sonnet-4-5Run builds through the OpenCode CLI runtime. Not enterprise-certified — its model catalogue comes from the remote models.dev registry, so models can change/disappear (there is no guaranteed free default). See the tier note below

Provider tiers

Not every provider carries the same support guarantees:

  • Enterprise-certified — Claude (Agent SDK), Codex, AWS Bedrock, and Azure OpenAI. Stable model catalogues, compliance posture, and the integrations enterprise deployments depend on. Use one of these for production / regulated workloads.
  • Community / self-host tier — OpenCode and other self-hosted/OpenAI-compatible runtimes. Fully supported for self-hosting and evaluation, but not enterprise-certified: OpenCode in particular resolves its model list from the remote models.dev registry, so individual models (including "free" ones such as the former opencode/sonic) can be removed without notice. There is no hardcoded default — you must pass an explicit opencode:<provider/model> or set OPENCODE_DEFAULT_MODEL; otherwise the build fails fast with an actionable error rather than silently using a dead model.

How routing works

Each task has a phase profile — a mapping from phase name to model string. Example:

{
"phaseModels": {
"spec": "sonnet",
"planning": "opus",
"coding": "ollama:qwen3:14b",
"qa": "sonnet",
"qa_fixer": "sonnet"
}
}

The backend's phase_config.infer_provider_from_model() parses the model string and picks the right provider:

  • sonnet, opus, haiku, claude-* → Claude Agent SDK
  • ollama:<model> → Ollama
  • copilot:<backend> → GitHub Copilot CLI (checked before the claude-*/gpt-* rules, since Copilot's own backend names are claude-sonnet-4.5 / gpt-5)
  • gpt-*, *codex* → Codex CLI
  • gemini-* → Gemini CLI
  • <endpoint>:<model> (with custom endpoint registered in Settings → LLM Providers) → OpenAI-compatible

Where to configure

  • Per task — Task Creation Wizard → Agent Profile dropdown
  • Per profile — Settings → Agent Profile (create reusable profiles)
  • Per endpoint — Settings → LLM Providers (register your endpoints, API keys are encrypted at rest)

Local models: sizing & hardware

Local (Ollama) coding is free and offline, but unlike a one-shot chat it has to drive a multi-step agentic loop: read files, call tools, write code, react to results. That asks far more of a model than autocomplete, and model size matters a lot. From our provider benchmark:

Model classWhat to expect on a real multi-file task
≤ 7BNot recommended — rarely sustains the tool-calling loop.
14B (e.g. qwen2.5-coder:14b)Now produces real code (after the small-model fix below), but typically can't finish a whole multi-file feature — good for single-file edits and smoke tests.
27–32B (e.g. qwen3-coder, qwen2.5-coder:32b, deepseek-coder-v2)The realistic floor for completing full tasks locally. Slower than cloud, review more.
70B+Best local quality, but needs serious hardware.

The small-model fix. Small local models often emit a tool call as a ```json {…}``` text block instead of the native tool_calls field, and tend to loop on Read without ever writing. AIFactory now parses those text-emitted tool calls and nudges the model to write once it has read enough — so a 14B goes from writing nothing to writing real code. (Ported from the sister TFactory project.)

Hardware, rough guide (4-bit quantized, with a 32K context window):

Model sizeVRAM (approx)Example GPU
14B~10–12 GBRTX 4070/4080, or a 16 GB card
27–32B~24–32 GBRTX 3090/4090 (24 GB), or better
70B~48 GB+A6000 / dual-24 GB / data-center cards

Run Ollama on a dedicated GPU box, not your daily-driver desktop — a 27B model will pin the GPU and can take down a desktop session. Keep some VRAM headroom for the context window.

The rule we never break

Claude interactions always route through apps/backend/core/client.py::create_client(). Never raw anthropic.Anthropic(). This is enforced in code review and is the only way OAuth-token auth + MCP server integration + per-agent tool permissions all work together.