<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://aifactory.freundcloud.com/blog</id>
    <title>AIFactory Blog</title>
    <updated>2026-05-31T00:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://aifactory.freundcloud.com/blog"/>
    <subtitle>Governed, auditable, self-hostable autonomous coding.</subtitle>
    <icon>https://aifactory.freundcloud.com/img/favicon.ico</icon>
    <rights>Copyright © 2026 AIFactory contributors.</rights>
    <entry>
        <title type="html"><![CDATA[We ran the same build through every LLM — what won, what broke, and where local models stand]]></title>
        <id>https://aifactory.freundcloud.com/blog/we-tested-every-llm-provider</id>
        <link href="https://aifactory.freundcloud.com/blog/we-tested-every-llm-provider"/>
        <updated>2026-05-31T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[AIFactory is provider-agnostic, so we put it to the test — the identical coding task through Claude, Codex, Gemini, GitHub Copilot, and local Ollama, with independently-verified tests. Here are the real results, the bugs we fixed along the way, and an honest take on local vs cloud.]]></summary>
        <content type="html"><![CDATA[<p>AIFactory claims to be provider-agnostic: the same build task should run on Claude, Codex,
Gemini, GitHub Copilot, or a local Ollama model. A claim like that is worth nothing until you
test it — so we did, with the <em>same</em> task, on <em>every</em> provider, and we re-ran the tests
ourselves rather than trusting the agent's word.</p>
<p>The short version: <strong>every managed provider produced a working, tested feature</strong> — and the
process surfaced (and fixed) four real bugs. Local models are a different story, and that story
is the interesting one.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-test">The test<a href="https://aifactory.freundcloud.com/blog/we-tested-every-llm-provider#the-test" class="hash-link" aria-label="Direct link to The test" title="Direct link to The test" translate="no">​</a></h2>
<p>One small but non-trivial, uniformly testable feature, added to the same FastAPI demo repo by
every provider:</p>
<blockquote>
<p>A <code>tictactoe</code> module (3×3 board, win/draw detection, invalid-move handling), exposed via a
<code>POST /tictactoe/move</code> endpoint, plus a pytest suite covering a win, a draw, and a rejected move.</p>
</blockquote>
<p>Python + pytest on purpose: "result = working" is <strong>objectively verifiable</strong> by running the
tests, and the bar is identical for everyone. Each provider got the same prompt, ran in its own
isolated git worktree, and stopped at the human-review gate. Then we re-ran <code>pytest</code> on each
branch in a clean virtualenv — these are not self-reported numbers.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-results">The results<a href="https://aifactory.freundcloud.com/blog/we-tested-every-llm-provider#the-results" class="hash-link" aria-label="Direct link to The results" title="Direct link to The results" translate="no">​</a></h2>
<table><thead><tr><th>Provider</th><th>Model</th><th>Time</th><th>Tests (independent <code>pytest</code>)</th></tr></thead><tbody><tr><td><strong>Codex</strong></td><td><code>gpt-5.3-codex</code></td><td><strong>4m 56s</strong></td><td>✅ 7 / 7 pass</td></tr><tr><td><strong>Claude</strong></td><td><code>claude-sonnet-4-6</code></td><td>9m 29s</td><td>✅ 9 / 9 pass</td></tr><tr><td><strong>Gemini</strong></td><td><code>gemini-3.1-pro-preview</code></td><td>19m 37s</td><td>✅ 16 / 16 pass</td></tr><tr><td><strong>Copilot</strong></td><td><code>copilot:claude-sonnet-4.5</code></td><td>~25m</td><td>✅ 4 / 4 pass</td></tr><tr><td>Ollama (local)</td><td><code>qwen2.5-coder:14b</code></td><td>~3m</td><td>⚠️ partial — code, not a finished feature</td></tr></tbody></table>
<ul>
<li class=""><strong>Fastest working build: Codex.</strong> Roughly twice as fast as the rest, with passing tests.</li>
<li class=""><strong>Cleanest first pass: Claude.</strong> Followed the spec exactly, 9/9 on the first run, no fixes.</li>
<li class=""><strong>Most thorough: Gemini.</strong> Wrote the largest suite, including idiomatic async API tests.</li>
<li class=""><strong>Most convenient: Copilot.</strong> Runs on an existing GitHub Copilot subscription, no extra key —
it's a router over Claude/GPT-5 backends. Also the slowest.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-four-bugs-the-benchmark-found-and-we-fixed">The four bugs the benchmark found (and we fixed)<a href="https://aifactory.freundcloud.com/blog/we-tested-every-llm-provider#the-four-bugs-the-benchmark-found-and-we-fixed" class="hash-link" aria-label="Direct link to The four bugs the benchmark found (and we fixed)" title="Direct link to The four bugs the benchmark found (and we fixed)" translate="no">​</a></h2>
<p>Running every provider through the <em>same</em> pipeline is a great way to find provider bugs:</p>
<ol>
<li class=""><strong>Codex wrote nothing</strong> on the first run. Root cause: the agentic provider drained only
stdout while the Codex MCP server filled its stderr pipe buffer and blocked. Fixed by draining
stderr concurrently. The after-fix run is the fastest in the table.</li>
<li class=""><strong>Gemini's code was correct but its tests "failed".</strong> It wrote idiomatic async tests, but the
repo had no <code>pytest-asyncio</code> configured. Fixed by configuring async test support — 8/10 became
16/16 on a fresh run.</li>
<li class=""><strong>Copilot wasn't supported at all</strong> — so we added a provider for it.</li>
<li class=""><strong>Ollama wrote nothing</strong> — the most interesting one. More below.</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="ollama-what-works-what-needs-doing">Ollama: what works, what needs doing<a href="https://aifactory.freundcloud.com/blog/we-tested-every-llm-provider#ollama-what-works-what-needs-doing" class="hash-link" aria-label="Direct link to Ollama: what works, what needs doing" title="Direct link to Ollama: what works, what needs doing" translate="no">​</a></h2>
<p>Local coding is fundamentally harder than a cloud chat. It isn't autocomplete — the model has to
drive a multi-step <em>agentic loop</em>: read files, call tools, write code, react to results. That
asks a lot of a model.</p>
<p><strong>What we fixed.</strong> Small local models (qwen, llama) frequently describe a tool call as text — a
JSON blob in the message body — instead of using the native tool-call field, and they tend to
loop on reading without ever writing. AIFactory now parses those text-emitted tool calls and
nudges the model to write once it has read enough. With that fix, <code>qwen2.5-coder:14b</code> went from
<strong>writing zero files</strong> to <strong>writing real code</strong>. (We ported the fix from our sister project,
<a href="https://github.com/olafkfreund/TFactory" target="_blank" rel="noopener noreferrer" class="">TFactory</a>, which had already solved it.)</p>
<p><strong>What still needs doing.</strong> A 14B model now <em>starts</em> the job but usually can't <em>finish</em> a full
multi-file feature — in our run it produced a partial module and stopped after the first subtask.
That's not an AIFactory limit; it's a model-capacity limit.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="which-ollama-models-and-what-hardware">Which Ollama models, and what hardware<a href="https://aifactory.freundcloud.com/blog/we-tested-every-llm-provider#which-ollama-models-and-what-hardware" class="hash-link" aria-label="Direct link to Which Ollama models, and what hardware" title="Direct link to Which Ollama models, and what hardware" translate="no">​</a></h3>
<table><thead><tr><th>Model class</th><th>What to expect on a real multi-file task</th></tr></thead><tbody><tr><td>≤ 7B</td><td>Not recommended — rarely sustains the tool-calling loop.</td></tr><tr><td>14B (<code>qwen2.5-coder:14b</code>)</td><td>Produces real code; good for single-file edits and smoke tests; usually won't finish a whole feature.</td></tr><tr><td><strong>27–32B</strong> (<code>qwen3-coder</code>, <code>qwen2.5-coder:32b</code>, <code>deepseek-coder-v2</code>)</td><td>The realistic floor for completing full tasks locally.</td></tr><tr><td>70B+</td><td>Best local quality, needs serious hardware.</td></tr></tbody></table>
<p>Rough hardware guide (4-bit quantized, 32K context):</p>
<table><thead><tr><th>Model size</th><th>VRAM (approx)</th><th>Example GPU</th></tr></thead><tbody><tr><td>14B</td><td>~10–12 GB</td><td>RTX 4070/4080</td></tr><tr><td>27–32B</td><td>~24–32 GB</td><td>RTX 3090/4090 (24 GB) or better</td></tr><tr><td>70B</td><td>~48 GB+</td><td>A6000 / dual-24 GB / data-center</td></tr></tbody></table>
<p>One hard-won tip: run Ollama on a <strong>dedicated GPU box</strong>, not your daily-driver desktop. While
testing, a 27B model pinned the GPU hard enough to take down a desktop session. Keep VRAM
headroom for the context window.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-the-cloud-models-are-still-king">Why the cloud models are still king<a href="https://aifactory.freundcloud.com/blog/we-tested-every-llm-provider#why-the-cloud-models-are-still-king" class="hash-link" aria-label="Direct link to Why the cloud models are still king" title="Direct link to Why the cloud models are still king" translate="no">​</a></h2>
<p>If you want one honest sentence: <strong>for almost any real task today, a managed/cloud model is the
right default, and local models are the special case</strong> — not the other way around.</p>
<ul>
<li class=""><strong>Capability.</strong> The frontier cloud models are far larger than anything you'll run on a single
GPU. They sustain the agentic loop, follow multi-file instructions, and write passing tests on
the first try. Three of four managed providers did exactly that here; the local 14B couldn't.</li>
<li class=""><strong>Speed.</strong> Codex finished in under five minutes. A capable <em>local</em> model that can actually
finish the task (27B+) runs several times slower on consumer hardware.</li>
<li class=""><strong>Cost, in context.</strong> Cloud runs of a task this size cost on the order of a few cents. "Free"
local coding costs $0 in API fees but needs a real GPU and your patience — and below ~27B it
often can't finish at all.</li>
</ul>
<p><strong>So when is local the right call?</strong> When you <em>can't</em> send code to a third party — air-gapped,
regulated, or privacy-mandated environments. (That's the whole reason AIFactory is self-hostable;
see <a class="" href="https://aifactory.freundcloud.com/blog/why-we-cant-use-cursor-at-a-bank">Why we can't use Cursor at a bank</a>.) In that case:
use a <strong>27B+ coder model on a 24 GB+ GPU</strong>, expect to review more, and lean on AIFactory's
human-review gate. For everyone else, pick a managed provider and move fast — AIFactory lets you
switch with a single model string, and even mix them per phase.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="see-it--run-it-yourself">See it / run it yourself<a href="https://aifactory.freundcloud.com/blog/we-tested-every-llm-provider#see-it--run-it-yourself" class="hash-link" aria-label="Direct link to See it / run it yourself" title="Direct link to See it / run it yourself" translate="no">​</a></h2>
<ul>
<li class="">The full, independently-verified results: <strong><a class="" href="https://aifactory.freundcloud.com/showcase/benchmark-results">Provider Benchmark</a></strong>.</li>
<li class="">Every provider run is reproducible with <code>scripts/benchmark-provider.mjs</code> in the repo — re-run
<code>pytest</code> on each branch and check our numbers.</li>
</ul>
<p>Provider-agnostic isn't a marketing line if you can prove it. Now you can.</p>]]></content>
        <author>
            <name>Olaf Krasicki-Freund</name>
            <uri>https://github.com/olafkfreund</uri>
        </author>
        <category label="ai-coding" term="ai-coding"/>
        <category label="benchmark" term="benchmark"/>
        <category label="ollama" term="ollama"/>
        <category label="local-llm" term="local-llm"/>
        <category label="providers" term="providers"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Why we can't use Cursor at a bank — and what I built instead]]></title>
        <id>https://aifactory.freundcloud.com/blog/why-we-cant-use-cursor-at-a-bank</id>
        <link href="https://aifactory.freundcloud.com/blog/why-we-cant-use-cursor-at-a-bank"/>
        <updated>2026-05-30T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[AI can write code. For regulated teams the blocker isn't capability — it's trust and provenance. Here's the case for governed, self-hostable, auditable autonomous coding.]]></summary>
        <content type="html"><![CDATA[<p>A friend who works at a bank told me their security team had just banned every cloud AI coding
tool. Not because they're luddites — these are sharp engineers who'd love the productivity. They
banned them because they can't send proprietary source code to a third party, and they can't
explain to an auditor where a given line of code came from. They wanted AI's help. They weren't
allowed to have it.</p>
<p>I kept hearing versions of this. The more I looked, the more I realized it isn't a niche complaint
— it's the unspoken default for a huge slice of the industry. So I built something for it, open
-sourced it, and this post is about why.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-problem-isnt-capability-its-trust">The problem isn't capability. It's trust.<a href="https://aifactory.freundcloud.com/blog/why-we-cant-use-cursor-at-a-bank#the-problem-isnt-capability-its-trust" class="hash-link" aria-label="Direct link to The problem isn't capability. It's trust." title="Direct link to The problem isn't capability. It's trust." translate="no">​</a></h2>
<p>We're past the point of arguing whether AI can write code. It can. The interesting question has
moved: <em>can you trust what it produced, and can you prove where it came from?</em></p>
<p>The data says no, and it's getting worse:</p>
<ul>
<li class=""><strong>96% of developers don't fully trust AI-generated code</strong> — yet only <strong>48% actually verify it</strong>
(Sonar, 2026). That gap is where bugs and vulnerabilities live.</li>
<li class="">For <strong>38% of teams, reviewing AI-written code now takes <em>more</em> effort</strong> than reviewing a human's.
We automated the writing and quietly moved the cost to review.</li>
<li class=""><strong>~74% of organizations can't provide security provenance</strong> for AI-generated code. When the
auditor asks "where did this come from and who approved it?", there's no answer.</li>
<li class="">Depending on the study, <strong>40–62% of AI-generated code contains vulnerabilities or design flaws.</strong></li>
</ul>
<p>For a solo dev on a side project, fine. For a regulated team, that's a wall.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-the-cloud-tools-structurally-cant-fix-this">Why the cloud tools structurally can't fix this<a href="https://aifactory.freundcloud.com/blog/why-we-cant-use-cursor-at-a-bank#why-the-cloud-tools-structurally-cant-fix-this" class="hash-link" aria-label="Direct link to Why the cloud tools structurally can't fix this" title="Direct link to Why the cloud tools structurally can't fix this" translate="no">​</a></h2>
<p>It's tempting to think the SaaS vendors will just add an "enterprise mode" and the problem goes
away. But the issues are structural, not cosmetic:</p>
<ul>
<li class=""><strong>Data residency.</strong> The tool's value comes from ingesting your repo. If your compliance regime
says source can't leave the perimeter, "enterprise SSO" doesn't help — the architecture is wrong.</li>
<li class=""><strong>No air-gap.</strong> Many regulated environments are network-isolated. A tool that phones home to a
hosted model service can't run there at all.</li>
<li class=""><strong>Opaque actions.</strong> Most agents hand you a diff, not a defensible record of <em>what they did, in
what order, and what you approved.</em></li>
<li class=""><strong>Lock-in.</strong> Betting your whole dev workflow on one vendor's model and pricing is its own risk.</li>
</ul>
<p>You can't bolt provenance and air-gap onto an architecture designed around "send us your code."</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-thesis-autonomy-and-governance-arent-opposites">The thesis: autonomy and governance aren't opposites<a href="https://aifactory.freundcloud.com/blog/why-we-cant-use-cursor-at-a-bank#the-thesis-autonomy-and-governance-arent-opposites" class="hash-link" aria-label="Direct link to The thesis: autonomy and governance aren't opposites" title="Direct link to The thesis: autonomy and governance aren't opposites" translate="no">​</a></h2>
<p>Here's the conviction the whole project rests on: <strong>you can have an agent that ships code <em>and</em> a
trail you can defend.</strong> Those aren't in tension — you just have to design for both from the start.</p>
<p>Concretely, that means:</p>
<ul>
<li class=""><strong>Spec-first.</strong> Every run begins with a written spec and acceptance criteria — intent you can read
and edit before anything happens.</li>
<li class=""><strong>Review-gated.</strong> You approve the plan before code is written, and the diff before it merges. A QA
agent checks the result against the spec.</li>
<li class=""><strong>Isolated.</strong> Each task runs in its own git worktree. Nothing touches your working tree until you
decide to merge.</li>
<li class=""><strong>Provenance by default.</strong> Every action is journaled in a hash-chained audit log. The spec, the
plan, and the QA report all live on disk and in version control.</li>
<li class=""><strong>Self-hosted.</strong> It runs in <em>your</em> perimeter — your Kubernetes cluster, or just docker-compose on
a laptop — against <em>your</em> choice of model, including a fully local one.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-i-built">What I built<a href="https://aifactory.freundcloud.com/blog/why-we-cant-use-cursor-at-a-bank#what-i-built" class="hash-link" aria-label="Direct link to What I built" title="Direct link to What I built" translate="no">​</a></h2>
<p>That's <a href="https://github.com/olafkfreund/AIFactory" target="_blank" rel="noopener noreferrer" class="">AIFactory</a>. It turns a task into shipping code
through a pipeline you watch and verify: <strong>spec → plan → code → QA</strong>, with human-review gates at
each step. You bring your own model — Claude, OpenAI, Gemini, Codex, or a local Ollama /
OpenAI-compatible endpoint — and you own the infrastructure it runs on. Every task lands in a
hash-chained audit log, so afterwards you can show exactly what happened and who approved it.</p>
<p>It's open source (MIT) and I build it solo, full-time. There's a separate enterprise edition for
organizations that need multi-tenant isolation, SAML/SCIM, and signed audit evidence — that's what
funds the open core — but the core pipeline is free, and it's the part most people need.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="if-this-is-your-problem-too">If this is your problem too<a href="https://aifactory.freundcloud.com/blog/why-we-cant-use-cursor-at-a-bank#if-this-is-your-problem-too" class="hash-link" aria-label="Direct link to If this is your problem too" title="Direct link to If this is your problem too" translate="no">​</a></h2>
<p>If you're somewhere that wants AI's productivity but can't use the cloud tools — or you just don't
want to merge code you can't account for — I'd genuinely like to hear what would make this usable
for you. The repo is here: <strong><a href="https://github.com/olafkfreund/AIFactory" target="_blank" rel="noopener noreferrer" class="">github.com/olafkfreund/AIFactory</a></strong>.
Open an issue, or tell me where it falls short.</p>
<p>Autonomy you can't defend isn't worth much in the places that matter most. I think we can do better
than "trust the diff."</p>]]></content>
        <author>
            <name>Olaf Krasicki-Freund</name>
            <uri>https://github.com/olafkfreund</uri>
        </author>
        <category label="ai-coding" term="ai-coding"/>
        <category label="self-hosted" term="self-hosted"/>
        <category label="compliance" term="compliance"/>
        <category label="governance" term="governance"/>
    </entry>
</feed>