What a 4B Model Can Actually Do: Field Notes from 155 Experiments

The single most useful thing I learned in two weeks fits on one line of code, and it made a 4B model meaningfully better at a task it was supposed to be bad at.

The task was Foofah table transforms — messy spreadsheet-to-spreadsheet reshaping, the kind of thing that usually wants a hand-written script. Base Qwen 3.5 4B, asked to emit the transformed table directly as JSON, got 138/250 (55.2%). Respectable for a 4B model and not something you’d ship. Then I gave it a second way to answer: write an executable transform(table) program, repair it once against the one visible input-output example, and — this is the whole trick — commit the program’s output whenever it passes that visible example. No learned reranker, no confidence threshold, no consensus vote. One gate.

That moved the score to 156/250 (62.4%): eighteen recoveries, zero regressions, 79.5% precision on the programs it committed. And the slice that should have been impossible to call from the outside told the real story — in the 26 cases where the program disagreed with the direct answer but still passed the visible example, the program was right on 18 of 26 and the direct answer on 0 of 26. The model didn’t get smarter. I just gave it something it could check, and the check was load-bearing.

That pattern — give a small model a checkable, executable surface and it punches far above its benchmark weight — was the spine of the whole sprint, and it kept paying off in places I didn’t expect. This is the field guide: the levers that worked, how I know they’re real, and the genuinely exciting frontier they opened up.

The shape of the sprint

Roughly two weeks, 155 experiments, two parallel tracks, run as a disciplined, semi-automated research operating system that I directed and agents executed inside. Track-y (91 experiments) was structured execution — latent and typed-bytecode compilers, VM-echo trace distillation, slot and register executors, trace verifiers. Track-z (64 experiments) was applied selection science — Foofah table transforms, code-ABI coverage ladders, retrieval-adapt pipelines, pass@K RL, DPO/GRPO/DAgger posttraining, tool controllers.

The model was Qwen 3.5 4B, mostly base, some instruct, served with vLLM, finetuned with QLoRA / LoRA (4-bit NF4, bf16 compute), on a single workstation GPU.

One honest note up front: a handful of the earliest compiler and generalization experiments used adjacent models — Qwen 3 4B for the latent and bytecode compilers, Qwen 2.5 Coder 3B Instruct for the factor-recombination ladders — because those substrates predate the two-track focus. The bulk of the work, and every headline number unless I say otherwise, is Qwen 3.5 4B.

The benchmarks were Foofah table transforms (250 cases), MBPP and HumanEval for code, and synthetic modular-arithmetic, typed-bytecode, and operator-inventory substrates built to isolate one mechanism at a time. And the thing that made it productive rather than just busy: every experiment attached to a research program, ran a cheap smoke path before the expensive run, carried its controls, and updated a claim ledger — confirmed, promising, open, negative, retired. That instrument is what turned 155 fast experiments into a map you can actually navigate by, instead of 155 disconnected runs.

Lever one: give it something to check

The executable-intermediate result reproduced across every setup I tried it in, and it is the most immediately useful thing I can hand you.

The Foofah ensemble made the point from a harder starting position. Three independently prompted program variants, direct JSON baseline 111/250 (44.4%), and the simple first-visible-passing-program gate reached 130/250 (52.0%) — twenty-three recoveries against four losses. The win is not subtle, and it costs you almost nothing: a sandbox, one visible example, and the willingness to run the candidate before you trust it.

def commit(direct_answer, program, visible_example, held_out_input):
    if program is not None and runs_clean(program, visible_example):
        # the held-out input is visible; its correct answer is not.
        return program(held_out_input)   # commit the checked program's output
    return direct_answer

The same instinct — make the model’s work checkable — is what made the structured-execution track sing.

Lever two: small models really can learn structured execution

This is the section I’m most excited about, because it pushes against the easy assumption that structured, multi-step execution is something only big models do.

A Qwen-attached executable latent compiler learned to run modular-arithmetic programs and then scaled — from 8 to 24 operation slots — by copying its learned short-program parameters into longer structures and continuing training. It held executor accuracy from 82.8% on the standard length-24 split up to 100.0% on the paraphrase split, with state-prefix recovery at 97.9%–100%, and it did it with no beam search, no candidate reranking, and no tokenized program output. A small model carrying a 24-step typed computation in latent state and getting the answer right is a genuinely encouraging result.

Typed bytecode confirmed the interface is fully learnable: dense gold-trace supervision near-saturated at 99.6% on fresh-paired splits. And answer-verified expert iteration — let the model generate programs, keep the ones whose output is correct, train on those — moved a frozen-Qwen bytecode head from 17.6% to 50.4% (+32.8 points) without any gold traces at all. The model bootstrapped a big chunk of the way to competence on its own verified successes.

The result that genuinely changed how I think about small-model agents was verifier-guided process control. I put the model inside a deterministic verifier MDP: it sees a compact state — candidate set, a few executed observations, eight concrete probe choices — and picks the next action; the verifier runs it and filters. Trained with process-DPO and GRPO, base 32.8% → 43.8%, which is 77.8% of the entire base-to-oracle headroom captured by a 4B controller orchestrating exact tools. This is the neurosymbolic bet paying off: let execution and search make the answer reachable, then train a small model to drive. The shuffled-reward control collapsed to 36.7% and the scrambled-feature control to 31.2%, so the gain is real understanding of the verifier signal, not formatting luck.

And operator-inventory search was the cleanest “look what opens up” moment of the two weeks. Ask the model to solve held-out aggregate operators from a closed vocabulary and it gets 0.0% — it simply doesn’t have the pieces. Give it an inventory to search and execute against, and held-out target coverage jumps to 100.0%, with 92.5% of held-out records solved from just six visible examples and zero active queries, reaching 100% with a single query. The capability was never missing. The access to it was.

Lever	Result	Setup
Executable program gate (Foofah)	55.2% → 62.4%, +18, 0 losses	Qwen 3.5 4B, visible-pass commit
Latent compiler, 8→24 slots	82.8%–100% executor accuracy	Qwen 3 4B + QLoRA, no beam search
Expert iteration (frozen-Qwen bytecode)	17.6% → 50.4% (+32.8)	self-verified targets, no gold traces
Verifier-guided process control	32.8% → 43.8% (77.8% of headroom)	process-DPO / GRPO in a verifier MDP
Operator-inventory search	0% → 100% held-out coverage	search + execute, 6 visible examples

Lever three: the supervision is the secret, not the size

If there’s one practitioner takeaway from track-y, it’s that how you supervise a small model matters more than its parameter count — and the levers are concrete and reusable.

Aligned execution traces were transformative for program repair: showing the model the actual computation path took held-out accuracy to 75.0% (54/72), versus 33.3% with no trace and 37.5% with a shuffled trace. Same model, same data budget; the alignment of the demonstration was the whole difference. Curriculum was the other quiet hero — a length curriculum, staged from short to long programs, took a length-24 compiler to 96.9% where training without the curriculum reached only 4.7%. These aren’t exotic tricks. They’re a recipe: show the real steps, in the right order, shortest-first.

Lever four: memory and evidence that pay their way

Retrieval earned its place too, when it fed the model something executable rather than something to read. Semantic retrieval over a 364-algorithm verified library, plus Qwen adaptation, recovered 8/24 (33.3%) of the tasks direct sampling had missed on MBPP — and it was control-clean, beating random retrieval (4/24) and shuffled-query retrieval (3/24) decisively. That pushed the reachable ceiling from 56/80 (70.0%) to 64/80 (80.0%): real coverage, attributable to genuine semantic matching, for about 24k forward tokens. Retrieved algorithms the model can adapt and run are a real external memory.

Even active evidence showed a true signal: entropy-and-disagreement example selection reached 70.0% against a 66.7% baseline and beat its shuffled-label control by ten points. Modest, but real and reproducible — a foothold for the budgeted-evidence work I want to do next.

And a clarifying surprise on the easy end: when I ran a genuine sample-verify-commit loop, HumanEval coverage hit 96.7% and the simplest possible selector already committed 96.7%. On the tasks a 4B model can do, it mostly just does them, and you don’t need anything clever at all. Knowing where you need machinery and where you don’t is its own kind of win.

The frontier all of this revealed

Here is the part I find genuinely thrilling, and it’s the through-line that ties the whole sprint together: the levers got good enough that they relocated the hard problem.

Across pipeline after pipeline, I could make the right answer reachable — present in the candidate pool, recoverable under an oracle. A finite verified code-ABI reached 83.8% oracle coverage on MBPP. Retrieval pushed the reachable ceiling to 80.0%. On one HumanEval slice the pool contained a correct implementation for 100% of tasks. The generation problem, the thing everyone worries about with small models, was substantially solved.

What’s left is selection: committing the right candidate from the evidence a deployment actually has. The first-visible selector on that code-ABI pool committed 95/160 (59.4%) against the 83.8% ceiling; on the HumanEval slice, every policy I tried landed at 16.7% — 2 of 12.

I don’t read that as a wall. I read it as the most concentrated pile of headroom in the entire stack — twenty-four points here, forty there, sitting in a pool I already generated, waiting for nothing more exotic than better evidence to pick it. After two weeks, “selection under visible evidence” is the single most exciting place I know to point effort, precisely because generation got so good.

That reframing — generation is largely solved at 4B; selection is the open prize — is, to me, the biggest thing the sprint taught. It tells you where to dig.

How I know it’s real (and what I cleared out of the way)

The reason I trust the wins above is the same reason I can hand you a map instead of a hunch: every result carried a control, and I let the controls win. That discipline is also how I cleared the brush fast — and a few of those clearings are genuinely useful to know about, because they save you from plausible ideas that don’t pay off.

A favorite: I built an elaborate counterexample-probe selector for the Foofah programs — generate stress-test inputs, measure agreement between the program and the model’s direct answer, trust the ones that agree under pressure. Clean story, and the simple visible-pass gate beat it outright. The probe signal even came back inverted (mean support 25.0% for correct programs vs 33.3% for wrong ones), because the model’s two channels shared a bias and agreed with each other on the same wrong rule. The lesson is a gift: independent-looking evidence that comes from the same model isn’t independent. I’d rather learn that in an afternoon than in production.

A few more, briefly, because honest negatives are how a map earns trust:

Online pass@K RL optimized a clean signal (mean utility 0.988) but landed at 43.8% coverage against tuned hot sampling at 68.8% — at this scale, turning the temperature up is a strong, cheap baseline worth respecting before you reach for RL.
Prompt-only skill memory retrieved the right skill 77.5% of the time and still helped 0 of 40 tasks — which is exactly why memory has to enter as candidates and tests you execute, not prose you hope the model abstracts from. A negative that points straight at the better design.
And the posttraining methods — DPO, GRPO, DAgger — every one of them beat its label-shuffled control, so the learning signal is unmistakably real; they just didn’t always beat a strong frozen baseline like sample-more. That’s not a disappointment, it’s a calibration: posttraining at 4B is a real tool with a known envelope.

None of these are the story. They’re the guardrails that make the story trustworthy, and several of them are directly useful as “don’t bother, do this instead.”

The threads I can’t wait to pull

The best part of a reconnaissance trip is the rivers you spot but don’t have time to follow. I came out of two weeks with a whole set of them — promising leads that already show signal and just need their next experiment.

Sampler portfolios. One of my favorite accidental findings: a constrained-DPO policy at K=4 and plain hot sampling at K=8 solved different MBPP tasks — their wins barely overlapped. I went in expecting one approach to win and found that the right move is to keep several. A learned scheduler over a handful of cheap generation policies could compound coverage no single policy reaches. That is a research line hiding inside a result I almost filed as a tie.

Process control with obvious room to grow. The verifier-MDP controller already captured 77.8% of the base-to-oracle headroom with three probes and a tiny budget. Jointly optimizing the policy and the probe design — richer actions, a bigger budget — is sitting right there, with real ceiling above it.

Active evidence, properly coupled. The active-acquisition signal was small, +3.3 points, but it beat its shuffled control by ten — a genuine foothold. Tie the acquisition objective directly to the selector’s decision and that foothold could become a lever.

Expert iteration and trace-prediction on the model’s own candidates. Both already work when teacher-forced on gold targets; both are only weakly coupled to choosing better programs. Condition them on what the compiler actually samples — its own decisions, its own consequences — and the next experiment writes itself.

Counterexamples without an answer key. My probe selector failed because the model’s channels shared a bias. But generating evidence that genuinely discriminates correct from wrong without peeking at hidden answers is exactly the kind of problem that looks hard until the right framing makes it easy — and the payoff is turning that 24-point reachable-but-uncommitted gap into committed accuracy. That is the single most leveraged open problem I found.

Every one of these is a half-open door, and not one of them needs a cluster.

The surface area is the point

Here’s what I find genuinely inspiring, two weeks in: 155 experiments barely scratched it. The instrument I built to run them already has a launchpad queued behind it — 35 structured proposals, a dozen of them top priority, all eleven research programs covered, and five entirely new program lines waiting on their first probe:

Multimodal and embodied small models — do executable intermediates and evidence-selection survive when the input is an image, a UI state, or a physical observation? First probe: one visual-table transform.
Small-model collaboration — give specialized small models the roles of generator, verifier, critic, and evidence-acquirer and see if the team beats one single-policy model at a fixed budget. Given how much of what worked here was checking, role-specialized small models feel like one of the most promising bets on the board.
On-device, latency-constrained agents — which of these levers survive a hard token, memory, and tool-call ceiling? The ones that do are the ones that ship to a phone.
Synthetic curriculum design — curriculum was one of my biggest wins; learning which synthetic curricula create transfer rather than benchmark-specific competence is a program unto itself.
The structured-representation head-to-head — typed bytecode versus latent slots versus text programs versus execution feedback, on one shared suite, finally telling me which part of each carries the weight. (It is already the top-priority probe in the queue.)

And the thing that makes all of that real rather than a wishlist: it runs on one GPU, minutes per experiment. The surface area is enormous and cheap to explore — the most exciting combination in research. The bottleneck is not compute and it is not ideas; it is how fast you can run an honest experiment and believe the result. That is exactly what the instrument is for, and it is the highest-leverage thing I built in two weeks — the programs, the required controls, the claim ledger, the habit of separating what’s reachable from what’s committable. It is why 155 experiments compounded into a map instead of evaporating into 155 vibes.

So this is not a finish line. It is a coastline I spent two weeks mapping, with a dozen rivers already visible running inland and a boat that costs minutes to launch. The levers are real and reusable: give a small model something to check, supervise its steps, let it search and verify, and it punches far above its weight. The frontier — turning everything reachable into something committable — is wide open and unusually well-defined. And there is so much more to try than one person can try alone.

That is the part I keep coming back to. We learned an enormous amount in two weeks, and the best thing we learned is how much is still right there to find. The frontier is wide open, the boat launches in minutes, and there are more rivers than any one of us can follow alone. So — see you out there. Go pull the next thread.

The shape of the sprint#

Lever one: give it something to check#

Lever two: small models really can learn structured execution#

Lever three: the supervision is the secret, not the size#

Lever four: memory and evidence that pay their way#

The frontier all of this revealed#

How I know it’s real (and what I cleared out of the way)#

The threads I can’t wait to pull#

The surface area is the point#