AIFactory Blog

We ran the same build through every LLM — what won, what broke, and where local models stand

2026-05-31T00:00:00.000Z

AIFactory claims to be provider-agnostic: the same build task should run on Claude, Codex, Gemini, GitHub Copilot, or a local Ollama model. A claim like that is worth nothing until you test it — so we did, with the same task, on every provider, and we re-ran the tests ourselves rather than trusting the agent's word.

The short version: every managed provider produced a working, tested feature — and the process surfaced (and fixed) four real bugs. Local models are a different story, and that story is the interesting one.

The test

One small but non-trivial, uniformly testable feature, added to the same FastAPI demo repo by every provider:

A tictactoe module (3×3 board, win/draw detection, invalid-move handling), exposed via a POST /tictactoe/move endpoint, plus a pytest suite covering a win, a draw, and a rejected move.

Python + pytest on purpose: "result = working" is objectively verifiable by running the tests, and the bar is identical for everyone. Each provider got the same prompt, ran in its own isolated git worktree, and stopped at the human-review gate. Then we re-ran pytest on each branch in a clean virtualenv — these are not self-reported numbers.

The results

Provider	Model	Time	Tests (independent `pytest`)
Codex	`gpt-5.3-codex`	4m 56s	✅ 7 / 7 pass
Claude	`claude-sonnet-4-6`	9m 29s	✅ 9 / 9 pass
Gemini	`gemini-3.1-pro-preview`	19m 37s	✅ 16 / 16 pass
Copilot	`copilot:claude-sonnet-4.5`	~25m	✅ 4 / 4 pass
Ollama (local)	`qwen2.5-coder:14b`	~3m	⚠️ partial — code, not a finished feature

Fastest working build: Codex. Roughly twice as fast as the rest, with passing tests.
Cleanest first pass: Claude. Followed the spec exactly, 9/9 on the first run, no fixes.
Most thorough: Gemini. Wrote the largest suite, including idiomatic async API tests.
Most convenient: Copilot. Runs on an existing GitHub Copilot subscription, no extra key — it's a router over Claude/GPT-5 backends. Also the slowest.

The four bugs the benchmark found (and we fixed)

Running every provider through the same pipeline is a great way to find provider bugs:

Codex wrote nothing on the first run. Root cause: the agentic provider drained only stdout while the Codex MCP server filled its stderr pipe buffer and blocked. Fixed by draining stderr concurrently. The after-fix run is the fastest in the table.
Gemini's code was correct but its tests "failed". It wrote idiomatic async tests, but the repo had no pytest-asyncio configured. Fixed by configuring async test support — 8/10 became 16/16 on a fresh run.
Copilot wasn't supported at all — so we added a provider for it.
Ollama wrote nothing — the most interesting one. More below.

Ollama: what works, what needs doing

Local coding is fundamentally harder than a cloud chat. It isn't autocomplete — the model has to drive a multi-step agentic loop: read files, call tools, write code, react to results. That asks a lot of a model.

What we fixed. Small local models (qwen, llama) frequently describe a tool call as text — a JSON blob in the message body — instead of using the native tool-call field, and they tend to loop on reading without ever writing. AIFactory now parses those text-emitted tool calls and nudges the model to write once it has read enough. With that fix, qwen2.5-coder:14b went from writing zero files to writing real code. (We ported the fix from our sister project, TFactory, which had already solved it.)

What still needs doing. A 14B model now starts the job but usually can't finish a full multi-file feature — in our run it produced a partial module and stopped after the first subtask. That's not an AIFactory limit; it's a model-capacity limit.

Which Ollama models, and what hardware

Model class	What to expect on a real multi-file task
≤ 7B	Not recommended — rarely sustains the tool-calling loop.
14B (`qwen2.5-coder:14b`)	Produces real code; good for single-file edits and smoke tests; usually won't finish a whole feature.
27–32B (`qwen3-coder`, `qwen2.5-coder:32b`, `deepseek-coder-v2`)	The realistic floor for completing full tasks locally.
70B+	Best local quality, needs serious hardware.

Rough hardware guide (4-bit quantized, 32K context):

Model size	VRAM (approx)	Example GPU
14B	~10–12 GB	RTX 4070/4080
27–32B	~24–32 GB	RTX 3090/4090 (24 GB) or better
70B	~48 GB+	A6000 / dual-24 GB / data-center

One hard-won tip: run Ollama on a dedicated GPU box, not your daily-driver desktop. While testing, a 27B model pinned the GPU hard enough to take down a desktop session. Keep VRAM headroom for the context window.

Why the cloud models are still king

If you want one honest sentence: for almost any real task today, a managed/cloud model is the right default, and local models are the special case — not the other way around.

Capability. The frontier cloud models are far larger than anything you'll run on a single GPU. They sustain the agentic loop, follow multi-file instructions, and write passing tests on the first try. Three of four managed providers did exactly that here; the local 14B couldn't.
Speed. Codex finished in under five minutes. A capable local model that can actually finish the task (27B+) runs several times slower on consumer hardware.
Cost, in context. Cloud runs of a task this size cost on the order of a few cents. "Free" local coding costs $0 in API fees but needs a real GPU and your patience — and below ~27B it often can't finish at all.

So when is local the right call? When you can't send code to a third party — air-gapped, regulated, or privacy-mandated environments. (That's the whole reason AIFactory is self-hostable; see Why we can't use Cursor at a bank.) In that case: use a 27B+ coder model on a 24 GB+ GPU, expect to review more, and lean on AIFactory's human-review gate. For everyone else, pick a managed provider and move fast — AIFactory lets you switch with a single model string, and even mix them per phase.

See it / run it yourself

The full, independently-verified results: Provider Benchmark.
Every provider run is reproducible with scripts/benchmark-provider.mjs in the repo — re-run pytest on each branch and check our numbers.

Provider-agnostic isn't a marketing line if you can prove it. Now you can.

Why we can't use Cursor at a bank — and what I built instead

2026-05-30T00:00:00.000Z

A friend who works at a bank told me their security team had just banned every cloud AI coding tool. Not because they're luddites — these are sharp engineers who'd love the productivity. They banned them because they can't send proprietary source code to a third party, and they can't explain to an auditor where a given line of code came from. They wanted AI's help. They weren't allowed to have it.

I kept hearing versions of this. The more I looked, the more I realized it isn't a niche complaint — it's the unspoken default for a huge slice of the industry. So I built something for it, open -sourced it, and this post is about why.

The problem isn't capability. It's trust.

We're past the point of arguing whether AI can write code. It can. The interesting question has moved: can you trust what it produced, and can you prove where it came from?

The data says no, and it's getting worse:

96% of developers don't fully trust AI-generated code — yet only 48% actually verify it (Sonar, 2026). That gap is where bugs and vulnerabilities live.
For 38% of teams, reviewing AI-written code now takes more effort than reviewing a human's. We automated the writing and quietly moved the cost to review.
~74% of organizations can't provide security provenance for AI-generated code. When the auditor asks "where did this come from and who approved it?", there's no answer.
Depending on the study, 40–62% of AI-generated code contains vulnerabilities or design flaws.

For a solo dev on a side project, fine. For a regulated team, that's a wall.

Why the cloud tools structurally can't fix this

It's tempting to think the SaaS vendors will just add an "enterprise mode" and the problem goes away. But the issues are structural, not cosmetic:

Data residency. The tool's value comes from ingesting your repo. If your compliance regime says source can't leave the perimeter, "enterprise SSO" doesn't help — the architecture is wrong.
No air-gap. Many regulated environments are network-isolated. A tool that phones home to a hosted model service can't run there at all.
Opaque actions. Most agents hand you a diff, not a defensible record of what they did, in what order, and what you approved.
Lock-in. Betting your whole dev workflow on one vendor's model and pricing is its own risk.

You can't bolt provenance and air-gap onto an architecture designed around "send us your code."

The thesis: autonomy and governance aren't opposites

Here's the conviction the whole project rests on: you can have an agent that ships code and a trail you can defend. Those aren't in tension — you just have to design for both from the start.

Concretely, that means:

Spec-first. Every run begins with a written spec and acceptance criteria — intent you can read and edit before anything happens.
Review-gated. You approve the plan before code is written, and the diff before it merges. A QA agent checks the result against the spec.
Isolated. Each task runs in its own git worktree. Nothing touches your working tree until you decide to merge.
Provenance by default. Every action is journaled in a hash-chained audit log. The spec, the plan, and the QA report all live on disk and in version control.
Self-hosted. It runs in your perimeter — your Kubernetes cluster, or just docker-compose on a laptop — against your choice of model, including a fully local one.

What I built

That's AIFactory. It turns a task into shipping code through a pipeline you watch and verify: spec → plan → code → QA, with human-review gates at each step. You bring your own model — Claude, OpenAI, Gemini, Codex, or a local Ollama / OpenAI-compatible endpoint — and you own the infrastructure it runs on. Every task lands in a hash-chained audit log, so afterwards you can show exactly what happened and who approved it.

It's open source (MIT) and I build it solo, full-time. There's a separate enterprise edition for organizations that need multi-tenant isolation, SAML/SCIM, and signed audit evidence — that's what funds the open core — but the core pipeline is free, and it's the part most people need.

If this is your problem too

If you're somewhere that wants AI's productivity but can't use the cloud tools — or you just don't want to merge code you can't account for — I'd genuinely like to hear what would make this usable for you. The repo is here: github.com/olafkfreund/AIFactory. Open an issue, or tell me where it falls short.

Autonomy you can't defend isn't worth much in the places that matter most. I think we can do better than "trust the diff."

AIFactory Blog

We ran the same build through every LLM — what won, what broke, and where local models stand

The test​

The results​

The four bugs the benchmark found (and we fixed)​

Ollama: what works, what needs doing​

Which Ollama models, and what hardware​

Why the cloud models are still king​

See it / run it yourself​