Quantization Interview Preparation Guide

Introduction

Quantization is the process of reducing the numerical precision of model weights and/or activations, from FP32 or BF16 down to INT8, INT4, or even binary formats, to shrink model memory footprint and accelerate inference. In 2026, quantization is an essential skill for any engineer deploying large language models: a 70B parameter model in BF16 requires ~140GB VRAM, making it undeployable on a single A100 80GB. INT4 quantization reduces this to ~35GB, fitting on a single GPU.

Quantization interview questions assess whether a candidate understands the accuracy vs efficiency tradeoff at the weights and activations level. Junior engineers are expected to know the difference between INT8 and INT4, and when to apply bitsandbytes vs GPTQ. Mid-level engineers must reason about calibration dataset selection, per-channel vs per-tensor granularity, and activation outlier handling. Senior engineers are assessed on GPTQ's Hessian-based compensation, AWQ's activation-aware scaling, and FP8 training with hardware-native support on H100 and H200 GPUs.

Why It Matters

The economics of LLM deployment are dominated by GPU cost, and GPU cost is proportional to the number of GPUs needed to fit the model. A 70B model in BF16 requires 4× A100 80GB GPUs minimum. The same model quantized to INT4 (AWQ) fits on a single A100 with room for the KV cache. At scale, this represents a 4x infrastructure cost reduction for the same model capability.

Beyond memory, quantization improves throughput. LLM inference is memory-bandwidth-bound, the bottleneck is moving weights from GPU HBM to compute units. INT4 weights require 4x less bandwidth than BF16, enabling 2-4x higher tokens-per-second on the same GPU. This is why serving frameworks like vLLM integrate AWQ and GPTQ natively.

As an interview topic, quantization reveals hardware-aware engineering depth. Understanding why per-channel quantization preserves accuracy better than per-tensor, what causes activation outliers in LLMs and how SmoothQuant addresses them, and why calibration dataset quality matters for PTQ demonstrates the systems-level understanding that MLOps and AI infrastructure roles require.

Core Concepts

Architecture Overview

Quantization transforms the model weights and activations from high-precision floating point to lower-precision integers, requiring specific kernels for dequantization during the forward pass.

Data Flow

Calibration data passes through the model to collect activation statistics, which are then used to compute optimal scales for weight quantization. During inference, quantized weights are loaded and dequantized on-the-fly by optimized kernels before matrix multiplication.

  [Calibration Data]
          ↓
  [Activation Statistics]
          ↓
  [Weight Quantizer] ← [Scaling Factors]
          ↓
  [Quantized Weights]
          ↓
  [Inference Engine]
    ↓            ↓
[Dequant Kernel] [MatMul]
    ↓            ↓
  [Output Activation]

Key Components

Tools & Frameworks

Design Patterns

Weight-only Quantization Deployment Pattern

Quantize weights to 4-bit while keeping activations in FP16 to maintain accuracy.

Trade-offs: Reduces memory footprint significantly but requires dequantization overhead.

Per-channel Quantization Granularity Pattern

Calculate separate scale factors for each output channel in a weight matrix.

Trade-offs: Higher accuracy than per-tensor but requires more complex kernel logic.

Calibration-based PTQ Workflow Pattern

Use a small representative dataset to tune scale factors post-training.

Trade-offs: Faster than QAT but can be sensitive to the choice of calibration data.

Common Mistakes

Production Considerations

Reliability	Quantization can lead to silent failure where model outputs become nonsensical; implement rigorous evaluation on validation sets.
Scalability	Quantization enables horizontal scaling by fitting larger models on cheaper, smaller GPU instances.
Performance	Expect 2x-4x throughput improvements due to reduced memory bandwidth requirements.
Cost	Reduces cloud infrastructure costs by allowing smaller instance types (e.g., A10 vs A100).
Security	Quantized models may be more susceptible to adversarial attacks; ensure robust input sanitization.
Monitoring	Track 'quantization error' metrics and monitor latency distributions in production.

Key Trade-offs

•Perplexity vs Memory

•Latency vs Accuracy

•Calibration Time vs Quality

Scaling Strategies

•Dynamic quantization for CPU

•Static quantization for GPU

•Weight-only for LLM serving

Optimisation Tips

•Use fused kernels for dequantization

•Align memory buffers to cache lines

•Profile with Nsight Systems

FAQ

What is the difference between PTQ and QAT?

PTQ (Post-Training Quantization) is applied to a pre-trained model without retraining, making it fast but potentially less accurate. QAT (Quantization-Aware Training) incorporates quantization effects during training, resulting in higher accuracy at the cost of significantly more compute and time.

Why is 4-bit quantization popular for LLMs?

4-bit quantization allows massive LLMs to fit into the limited VRAM of consumer-grade GPUs, enabling inference on hardware that would otherwise be unable to load the model. It provides a massive reduction in memory footprint with manageable accuracy degradation.

What is the role of calibration data?

Calibration data is a small, representative subset of the training or validation data used to calculate optimal scale factors and zero-points. It ensures the quantized model maintains performance on real-world data distributions.

How does AWQ differ from GPTQ?

GPTQ uses second-order information (Hessian matrix) to compensate for quantization errors globally across weights. AWQ (Activation-Aware Weight Quantization) identifies and protects salient weights based on their activation magnitude, which is often more efficient for LLMs.

What is group-wise quantization?

Group-wise quantization divides a weight tensor into smaller blocks (groups) and applies independent scale factors to each. This increases granularity, allowing for more precise mapping and reducing the accuracy loss common in low-bit quantization.

Can I quantize models without losing accuracy?

Quantization almost always involves some loss of precision. However, techniques like QAT, AWQ, and smaller group sizes can minimize this loss to negligible levels for most applications, balancing accuracy against memory and speed gains.

What is the difference between symmetric and asymmetric quantization?

Symmetric quantization maps the floating point range symmetrically around zero to an integer range, simplifying hardware implementation. Asymmetric quantization uses a zero-point offset, allowing it to handle skewed distributions more effectively.

Why do I need specific kernels for quantization?

Standard matrix multiplication kernels expect FP16/BF16 inputs. Quantized kernels must perform on-the-fly dequantization or use specialized hardware instructions (like INT4/INT8 tensor core ops) to compute results efficiently without full dequantization.

What is the impact of quantization on latency?

Quantization reduces memory bandwidth pressure, which is often the primary bottleneck for LLM decoding. This leads to higher throughput and lower latency, provided the hardware and kernels are optimized for the specific quantization format.

What are the risks of using quantization in production?

The main risk is accuracy degradation, which can manifest as silent failures or nonsensical outputs. It is critical to perform rigorous validation and testing on domain-specific datasets before deploying quantized models.