Quantization is the cheapest lever in the inference stack and the one most often pulled wrong. You can quarter a model’s memory footprint in an afternoon with no gradient steps, but the method, the granularity, and the calibration set decide whether you ship a model that’s two percent slower on your eval or one that quietly forgets how to close parentheses. The thing to internalize first: for single-stream LLM decode you are bandwidth-bound, not compute-bound, so the win from quantization is mostly about moving fewer bytes per token — and that reframes which methods are even worth the trouble.
Quantize the bottleneck, not the FLOPs
Decode — autoregressive generation at small batch — reads the entire weight matrix out of HBM to produce a single token. Arithmetic intensity is terrible: you do a matrix-vector product, so you’re limited by how fast you can stream weights, not by tensor-core throughput. Cutting weights from FP16 to INT4 cuts that read by roughly 4x, and the decode step speeds up close to proportionally, even though you dequantize back to BF16 to actually do the multiply. That is the entire reason weight-only INT4 is the default for latency-sensitive serving.
Prefill (processing the prompt) and large-batch serving are different animals. Arithmetic intensity is high, you’re compute-bound, and weight-only INT4 buys you little — you still do the matmul in FP16 after dequantizing, and the dequant itself is now pure overhead. There you want the multiply to happen in low precision, on INT8 or FP8 tensor cores, which means quantizing the activations too.
So the first question is not “GPTQ or AWQ” — it’s “what is my serving regime.” Latency-bound chat at batch 1 to 8: weight-only. Throughput farm at batch 128, or heavy prefill from RAG and long prompts: activation quant or FP8. KV-cache-dominated long context: quantize the KV cache, which is a separate axis entirely.
The taxonomy you have to get right
Three orthogonal choices, and most confusion comes from conflating them.
What you quantize.
- Weight-only (W4A16, W3A16): weights in INT4/INT3, activations stay FP16/BF16. GPTQ and AWQ live here. Helps bandwidth and memory, not compute.
- Weight-and-activation (W8A8): both in INT8 or FP8. SmoothQuant and FP8 pipelines. Helps compute on INT8/FP8 tensor cores.
Granularity of scales. A quantized tensor is integers plus a scale (and maybe a zero-point). Where that scale applies trades accuracy against kernel speed and metadata overhead:
- Per-tensor: one scale for the whole matrix. Cheapest, least accurate.
- Per-channel: one scale per output channel. The standard for weights.
- Group-wise: one scale per contiguous group of N weights along the input dimension. Group size 128 is the de facto default for INT4 — smaller groups mean more accuracy, more scale overhead, and you had better have a kernel that supports your group size.
Symmetric vs asymmetric. Symmetric uses a scale only (zero maps to integer zero); asymmetric adds a zero-point to use the full integer range on skewed distributions. Asymmetric is slightly more accurate for weights with non-zero mean, slightly more expensive in the kernel.
The bits themselves are not the hard part. Rounding weights to a 4-bit grid is trivial; doing it so the layer’s output barely changes is the whole game. GPTQ and AWQ are two different theories of which rounding errors matter.
GPTQ: spend a Hessian to round better
GPTQ descends from Optimal Brain Surgeon by way of OBQ. The framing: treat quantizing one linear layer as minimizing the reconstruction error of its output on calibration data — minimize , where X stacks the calibration activations. Quantize the weights one column at a time; after you round a column you’ve introduced an error, so update all the not-yet-quantized columns to compensate, using second-order (curvature) information. That curvature is the Hessian of the quadratic, which for a linear layer is proportional to XᵀX (the Gram matrix of calibration inputs) — the same matrix for every output row.
The OBS update needs the inverse Hessian, and applying it naively per weight is the expensive part. GPTQ’s three engineering moves make it tractable at 100B-plus scale:
- Fixed column order. OBQ greedily picks the lowest-error weight next, implying a different order and a separate Hessian per row. GPTQ quantizes columns in a single shared left-to-right order, so all rows share one Hessian and one inverse. For large layers this barely costs accuracy and massively parallelizes.
- Lazy batched updates. Apply the compensating updates in blocks to keep the GEMMs cache-friendly instead of doing rank-1 updates one weight at a time.
- Cholesky reformulation. The repeated inverse-Hessian application is numerically fragile; GPTQ precomputes the needed columns of the inverse via a Cholesky factor, which is what makes it stable in floating point.
The GPTQ paper reports quantizing OPT-175B to 3-4 bits in roughly four GPU-hours on a single A100 — which is what turned INT4 LLMs into something anyone could produce rather than a research-lab event.
Two knobs bite in practice. Calibration data: GPTQ estimates the Hessian from a small set (commonly 128 sequences of 2048 tokens). It’s reasonably robust, but calibrate a code model on Wikipedia prose and you will leave accuracy on the table. Act-order (desc_act in AutoGPTQ/GPTQModel): quantize columns in order of decreasing Hessian diagonal — most “important” first. It helps at INT4 and small groups, sometimes noticeably, at the cost of a more complex kernel access pattern. Turn it on, benchmark, keep it if your runtime supports it.
A mental model in ten lines
# Per linear layer. X: calibration inputs [n_samples, in_features]H = X.T @ X # layer Hessian (shared across rows)H = cholesky_inverse(H + damp * I) # damped, factorized for stability
for j in range(in_features): # left-to-right (or by desc Hessian diag) w = W[:, j].clone() q = quantize(w, group_scale[j]) # round this column to the grid err = (w - q) / H[j, j] # error scaled by curvature W[:, j] = q W[:, j+1:] -= err.outer(H[j, j+1:]) # smear error onto remaining colsThe point is not the code — it’s that GPTQ never touches activations and never backprops. It’s a closed-form-ish rounding that uses calibration statistics to decide how to smear each rounding error across the weights it hasn’t frozen yet.
AWQ: protect the channels the activations care about
AWQ starts from a different observation: weights are not equally important, and their importance is set by the activations, not the weights. A small fraction of input channels — on the order of 1% — carry large-magnitude activations, and the weight columns multiplying them dominate the layer’s output. The AWQ paper shows that keeping just those roughly 1% salient channels in FP16 recovers most of the INT4 degradation. But mixed precision is a kernel nightmare: ragged dtypes per channel kill your GEMM.
AWQ’s trick is to get the same protection without mixed precision. Scale the salient weight channels up by a per-channel factor before quantizing, and divide the corresponding activations down by the same factor. The layer computes the same product, so the FP16 output is preserved, but the salient channels now occupy more of the quantization grid and their relative rounding error shrinks. Scale too aggressively and you inflate the group’s scale and hurt the other channels — so AWQ searches for the per-layer migration strength (a single knob driven by per-channel activation magnitude) that minimizes output error on the calibration set.
Compared to GPTQ:
- No Hessian, no second-order solve, no error feedback — just a per-channel scaling plus a small grid search. It’s fast and embarrassingly parallel.
- The AWQ authors argue it overfits the calibration set less than GPTQ’s full reconstruction objective, since it only uses activation-magnitude statistics, so it tends to generalize better across domains and to instruction-tuned models.
- Same deployment target: weight-only INT4/INT3, group-wise, paired with a fused dequant GEMM.
In practice GPTQ and AWQ land in roughly the same accuracy neighborhood at W4A16 for most models; which one wins is model- and eval-specific, and the honest answer is to run both on your task evals. Neither is exotic — both are one call in llm-compressor / AutoAWQ / GPTQModel, and both emit checkpoints that vLLM and TensorRT-LLM load directly.
Activations are where it gets hard: SmoothQuant and FP8
The moment you want INT8 activations, for the compute-bound regime, you hit the outlier problem. Transformer activations have persistent outlier channels — specific feature dimensions with magnitudes orders of magnitude above the rest — and they sharpen as models grow. Per-tensor INT8 quantization then spends almost its entire integer range representing a few channels, and accuracy collapses. Weights, by contrast, are flat and easy.
SmoothQuant (Xiao et al., 2022) resolves this by migrating the difficulty from activations to weights. A linear layer’s output is X·W, so you can insert a per-channel smoothing vector and its inverse without changing the result: X·W = (X·diag(s)⁻¹)·(diag(s)·W). Choose the per-channel s to balance the dynamic ranges — pull the activation outliers down by dividing, push the easy weights up by multiplying — with a migration strength (typically around 0.5) controlling how much pain you move. Now both tensors are quantization-friendly: per-token INT8 activations, per-channel INT8 weights, and the multiply runs on INT8 tensor cores. The smoothing scale folds into the preceding LayerNorm/RMSNorm, so it costs nothing at runtime. This is the standard route to W8A8 INT8 with near-FP16 accuracy, and it shines exactly where INT4 weight-only does nothing: compute-bound, large-batch serving.
FP8 sidesteps a lot of this
FP8 has largely eaten the W8A8 use case on recent hardware, because a floating exponent handles outliers far more gracefully than INT8’s uniform grid. Two formats, both 8 bits:
- E4M3 (4 exponent, 3 mantissa): more precision, narrower range — max magnitude 448 in the OCP/NVIDIA variant. The default for forward-pass weights and activations.
- E5M2 (5 exponent, 2 mantissa): more range (max 57344), less precision, IEEE-style with inf/NaN. Conventionally used for gradients in training.
The win is that FP8’s dynamic range usually makes per-tensor or per-channel scaling enough — you often don’t need SmoothQuant-style migration to keep activations sane, because the exponent absorbs the outliers. Hardware support is the gate: FP8 tensor cores arrived on NVIDIA Hopper (H100), Ada (L40S), and AMD MI300; Blackwell pushes further into block-scaled FP4/FP6, the MX and NVFP4 microscaling formats where a small block of values shares a low-precision scale. On Hopper-class silicon, FP8 W8A8 is frequently the best throughput-per-accuracy point for serving, and the same format is increasingly used in training (Transformer Engine’s delayed scaling with an amax history).
Evaluating it without fooling yourself
Quantization failures are sneaky: perplexity barely moves while a specific capability quietly degrades. A few rules the desk holds to.
Don’t trust perplexity alone. WikiText perplexity is the field’s favorite quantization metric because it’s cheap and it’s what the papers report, but it’s insensitive to exactly the failures that matter — long-context retrieval, code syntax, multi-step arithmetic, instruction following, structured output. Evaluate on task-relevant evals from your own harness, and weight the ones that touch low-probability tokens, because that is where a coarse grid bites first.
Calibration data matters more than its reputation suggests. The calibration set anchors GPTQ’s Hessian and AWQ’s scale search. Use in-domain data (chat-formatted samples for a chat model, code for a code model), enough of it (a few hundred sequences), and the right sequence length. Calibrating a 128k-context model on 512-token snippets is a common own-goal.
The kernel decides whether you actually got faster. A quantized checkpoint is a data format, not a speedup. The speedup comes from a kernel that reads the packed integers and dequantizes in registers — vLLM’s Marlin/Machete for INT4, native FP8 paths on Hopper. Check three things: whether a fast kernel exists for your exact config (bits, group size, symmetric vs asymmetric, act-order) on your GPU; what happens at your batch size, since weight-only INT4 loses its edge as batch grows into the compute-bound regime and dequant becomes overhead; and the latency you actually get — measure TTFT and inter-token latency, not just memory saved. Benchmark, don’t assert.
KV cache is its own quantization target. At long context or high concurrency, the KV cache, not the weights, dominates memory and bandwidth — quantizing it (FP8 KV is the common, low-risk choice; INT8/INT4 KV pushes harder) can matter more than the weight format. vLLM exposes this as kv_cache_dtype="fp8". It interacts with everything else: a model can tolerate INT4 weights yet smear attention if you also crush its KV to INT4. Treat it as a separate knob with its own eval.
The bottom line
Pick the method from the serving regime, not the leaderboard. Latency-bound decode at small batch: weight-only INT4 with GPTQ or AWQ, group size 128, on a Marlin-class kernel — the easy, high-return default. Compute-bound prefill or large-batch throughput: W8A8, and on Hopper-or-newer that means FP8 (E4M3) before reaching for INT8 plus SmoothQuant, because the exponent does the outlier work for you. Long context or high concurrency: quantize the KV cache and measure attention quality separately. Across all of them the failure mode is the same — shipping a checkpoint you validated on perplexity and a kernel that doesn’t accelerate your config — so the discipline is boring and non-negotiable: calibrate in-domain, evaluate on the tasks you actually serve, and benchmark the latency you actually get on the hardware you actually have.